• isnormal and non-FP values: possible defect

    From Vincent Lefevre@21:1/5 to All on Wed Jul 28 11:25:03 2021
    The latest C2x draft N2596 says for isnormal:

    The isnormal macro determines whether its argument value is normal
    (neither zero, subnormal, infinite, nor NaN). [...]

    (like in C99). But there may be values that are neither normal, zero, subnormal, infinite, nor NaN, e.g. for long double on PowerPC, where
    the double-double format is used. This is allowed by 5.2.4.2.2p4:
    "and values that are not floating-point numbers, such as infinities
    and NaNs" ("such as", not limited to). Note that these additional
    values may be in the normal range, or outside (with an absolute value
    less than the minimum positive normal number or greater than the
    maximum normal number).

    What should the behavior be for these values, in particular when
    they are in the normal range, i.e. with an absolute value between
    the minimum positive normal number and the maximum normal number?

    --
    Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
    100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
    Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Kuyper@21:1/5 to Vincent Lefevre on Wed Jul 28 08:32:43 2021
    On 7/28/21 7:25 AM, Vincent Lefevre wrote:
    The latest C2x draft N2596 says for isnormal:

    The isnormal macro determines whether its argument value is normal
    (neither zero, subnormal, infinite, nor NaN). [...]

    For reference: that's 7.12.3.5p2.

    (like in C99). But there may be values that are neither normal, zero, subnormal, infinite, nor NaN, e.g. for long double on PowerPC, where
    the double-double format is used. This is allowed by 5.2.4.2.2p4:
    "and values that are not floating-point numbers, such as infinities
    and NaNs" ("such as", not limited to). Note that these additional
    values may be in the normal range, or outside (with an absolute value
    less than the minimum positive normal number or greater than the
    maximum normal number).

    What should the behavior be for these values, in particular when
    they are in the normal range, i.e. with an absolute value between
    the minimum positive normal number and the maximum normal number?

    7.12p12, describing the number classification macros, allows that
    "Additional implementation-defined floating-point classifications, with
    macro definitions beginning with FP_ and an uppercase letter, may also
    be specified by the implementation."

    7.12.3.1p2 says
    "The fpclassify macro classifies its argument value as NaN, infinite,
    normal, subnormal, zero, or into another implementation-defined category."

    7.12.3.5p3 says
    "The isnormal macro returns a nonzero value if and only if its argument
    has a normal value."

    Since the standard explicitly allows that there may be other implementation-defined categories, 7.12.3.5p2 and 7.12.3.5p3 conflict on
    an implementation where other categories are supported. I would
    recommend that this discrepancy be resolved in favor of 7.12.3.5p3.

    This conflict was not introduced in C222x; all of the relevant wording
    was the same in C99.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vincent Lefevre@21:1/5 to James Kuyper on Thu Jul 29 08:38:27 2021
    In article <sdripc$n9b$1@dont-email.me>,
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    7.12p12, describing the number classification macros, allows that
    "Additional implementation-defined floating-point classifications, with
    macro definitions beginning with FP_ and an uppercase letter, may also
    be specified by the implementation."

    7.12.3.1p2 says
    "The fpclassify macro classifies its argument value as NaN, infinite,
    normal, subnormal, zero, or into another implementation-defined category."

    7.12.3.5p3 says
    "The isnormal macro returns a nonzero value if and only if its argument
    has a normal value."

    Since the standard explicitly allows that there may be other implementation-defined categories, 7.12.3.5p2 and 7.12.3.5p3 conflict on
    an implementation where other categories are supported. I would
    recommend that this discrepancy be resolved in favor of 7.12.3.5p3.

    Now I'm wondering of the practical consequences. The fact that there
    may exist non-FP numbers between the minimum positive normal number
    and the maximum one may have been overlooked by everyone, and users
    might use isnormal() to check whether the value is finite and larger
    than the minimum positive normal number in absolute value.

    GCC assumes the following definition in gcc/builtins.c:

    /* isnormal(x) -> isgreaterequal(fabs(x),DBL_MIN) &
    islessequal(fabs(x),DBL_MAX). */

    which is probably what the users expect, and a more useful definition
    in practice. Thoughts?

    --
    Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
    100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
    Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Kuyper@21:1/5 to Vincent Lefevre on Thu Jul 29 13:05:06 2021
    On 7/29/21 4:38 AM, Vincent Lefevre wrote:
    In article <sdripc$n9b$1@dont-email.me>,
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    7.12p12, describing the number classification macros, allows that
    "Additional implementation-defined floating-point classifications, with
    macro definitions beginning with FP_ and an uppercase letter, may also
    be specified by the implementation."

    7.12.3.1p2 says
    "The fpclassify macro classifies its argument value as NaN, infinite,
    normal, subnormal, zero, or into another implementation-defined category."

    7.12.3.5p3 says
    "The isnormal macro returns a nonzero value if and only if its argument
    has a normal value."

    Since the standard explicitly allows that there may be other
    implementation-defined categories, 7.12.3.5p2 and 7.12.3.5p3 conflict on
    an implementation where other categories are supported. I would
    recommend that this discrepancy be resolved in favor of 7.12.3.5p3.

    Now I'm wondering of the practical consequences. The fact that there
    may exist non-FP numbers between the minimum positive normal number
    and the maximum one may have been overlooked by everyone, and users
    might use isnormal() to check whether the value is finite and larger
    than the minimum positive normal number in absolute value.

    GCC assumes the following definition in gcc/builtins.c:

    /* isnormal(x) -> isgreaterequal(fabs(x),DBL_MIN) &
    islessequal(fabs(x),DBL_MAX). */

    which is probably what the users expect, and a more useful definition
    in practice. Thoughts?


    I think it would be highly inappropriate for isnormal(x) to produce a
    different result than (fp_classify(x)==FP_NORMAL) for any value of x.

    I was not familiar with the term "double-double". A check on Wikipedia
    led me to <https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Double-double_arithmetic>,
    but doesn't describe it to me in sufficient detail to clarify why there
    might be a classification problem. Could you give an example of a value
    that cannot be classified as either infinite, NaN, normal, subnormal, or
    zero? In particular, I'm not sure what you mean by a "non-FP number".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vincent Lefevre@21:1/5 to Vincent Lefevre on Fri Jul 30 15:36:54 2021
    In article <20210730150844$4673@zira.vinc17.org>,
    Vincent Lefevre <vincent-news@vinc17.net> wrote:

    In article <sdun42$9co$1@dont-email.me>,
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    [...] In particular, I'm not sure what you mean by a "non-FP number".

    [...]

    So x is a number with more than the 106-bit precision of normal
    numbers, and both fpclassify(x) and isnormal(x) regard it as a
    "normal number", which should actually be interpreted as being
    in the range of the normal numbers.

    Note: LDBL_MANT_DIG = 106 because with this format, there are
    107-bits numbers that cannot be represented exactly (near the
    overflow threshold).

    Note about this point: in the ISO C model defined in 5.2.4.2.2,
    the sum goes from k = 1 to p. Thus this model does not allow
    floating-point numbers to have more than a p-digit precision,
    where p = LDBL_MANT_DIG for long double (see the *_MANT_DIG
    definitions). Increasing the value of p here would mean that
    some floating-point numbers would not be exactly representable
    (actually many of them), which is forbidden (the text from the
    ISO C standard is not very clear, but this seems rather obvious,
    otherwise this could invalidate common error analysis).

    --
    Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
    100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
    Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vincent Lefevre@21:1/5 to James Kuyper on Fri Jul 30 15:24:22 2021
    In article <sdun42$9co$1@dont-email.me>,
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    I think it would be highly inappropriate for isnormal(x) to produce a different result than (fp_classify(x)==FP_NORMAL) for any value of x.

    I agree. This should go with fp_classify.

    I was not familiar with the term "double-double". A check on Wikipedia
    led me to <https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Double-double_arithmetic>,
    but doesn't describe it to me in sufficient detail to clarify why there
    might be a classification problem. Could you give an example of a value
    that cannot be classified as either infinite, NaN, normal, subnormal, or zero? In particular, I'm not sure what you mean by a "non-FP number".

    Here's a test program for long double, which shows the issue
    on PowerPC. Here, 1 + 2^(-120) is exactly representable as a
    double-double number (the exact sum of 1 and 2^(-120), which
    are both double-precision numbers). This is verified by the
    output of x - 1, which gives 2^(-120) instead of 0.

    ------------------------------------------------------------------
    #include <stdio.h>
    #include <float.h>
    #include <math.h>

    int main (void)
    {
    volatile long double x = 1 + 0x1.p-120L;
    int c;

    printf ("LDBL_MANT_DIG = %d\n", (int) LDBL_MANT_DIG);
    printf ("x - 1 = %La\n", x - 1);

    c = fpclassify (x);
    printf ("fpclassify(x) = %s\n",
    c == FP_NAN ? "FP_NAN" :
    c == FP_INFINITE ? "FP_INFINITE" :
    c == FP_ZERO ? "FP_ZERO" :
    c == FP_SUBNORMAL ? "FP_SUBNORMAL" :
    c == FP_NORMAL ? "FP_NORMAL" : "unknown");

    printf ("isfinite/isnormal/isnan/isinf(x) = %d/%d/%d/%d\n",
    isfinite (x), isnormal (x), isnan (x), isinf (x));

    return 0;
    }
    ------------------------------------------------------------------

    I get the following result:

    LDBL_MANT_DIG = 106
    x - 1 = 0x1p-120
    fpclassify(x) = FP_NORMAL
    isfinite/isnormal/isnan/isinf(x) = 1/1/0/0

    So x is a number with more than the 106-bit precision of normal
    numbers, and both fpclassify(x) and isnormal(x) regard it as a
    "normal number", which should actually be interpreted as being
    in the range of the normal numbers.

    Note: LDBL_MANT_DIG = 106 because with this format, there are
    107-bits numbers that cannot be represented exactly (near the
    overflow threshold).

    --
    Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
    100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
    Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Kuyper@21:1/5 to Vincent Lefevre on Fri Jul 30 18:40:35 2021
    On 7/30/21 11:36 AM, Vincent Lefevre wrote:
    In article <20210730150844$4673@zira.vinc17.org>,
    Vincent Lefevre <vincent-news@vinc17.net> wrote:

    In article <sdun42$9co$1@dont-email.me>,
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    [...] In particular, I'm not sure what you mean by a "non-FP number".

    [...]

    So x is a number with more than the 106-bit precision of normal
    numbers, and both fpclassify(x) and isnormal(x) regard it as a
    "normal number", which should actually be interpreted as being
    in the range of the normal numbers.

    Note: LDBL_MANT_DIG = 106 because with this format, there are
    107-bits numbers that cannot be represented exactly (near the
    overflow threshold).

    Note about this point: in the ISO C model defined in 5.2.4.2.2,
    the sum goes from k = 1 to p. Thus this model does not allow
    floating-point numbers to have more than a p-digit precision,
    where p = LDBL_MANT_DIG for long double (see the *_MANT_DIG
    definitions). ...

    OK - I now have a better understanding of the point you're making. A key
    thing to understand about that model is in footnote 21: "The
    floating-point model is intended to clarify the description of each floating-point characteristic and does not require the floating-point arithmetic of the implementation to be identical."

    For every statement that the C standard makes in terms of the model, it
    should have another statement that is NOT in terms of the model. The
    statement that is in terms of the model should be understood as a
    non-normative clarification of the other one.

    For example, 5.2.4.2.2p13 says that DBL_EPSILON is "the difference
    between 1 and the least value greater than 1 that is representable in
    the given floating point type, b^1−p" (I use ^ to indicate that the last
    part of that expression is in superscript). The part before the comma is
    the normative definition of DBL_EPSILON. The part after the comma is a non-normative mathematical expression that would, if the floating point
    format fits the model, give the correct value for DBL_EPSILON.

    However, what I'd forgotten is that despite the fact that the standard
    "does not require the floating-point arithmetic ... to be identical", 5.2.4.2.2p2 constitutes the official definition of what a "floating
    point number" is.

    ... Increasing the value of p here would mean that
    some floating-point numbers would not be exactly representable
    (actually many of them), which is forbidden (the text from the
    ISO C standard is not very clear, but this seems rather obvious,
    otherwise this could invalidate common error analysis).

    The standard frequently applies the adjective "representable" to
    floating point numbers - that would be redundant if all floating point
    numbers were required to be representable. I think the format you
    describe should be considered to have p (and therefore, LDBL_MANT_DIG)
    large enough to include all representable numbers, even if that would
    mean that not all floating point numbers are representable.

    "In addition to normalized floating-point numbers ( f_1 > 0 if x ≠ 0), floating types may be able to contain other kinds of floating-point
    numbers, such as subnormal floating-point numbers (x ≠ 0, e = e_min ,
    f_1 = 0) and unnormalized floating-point numbers (x ≠ 0, e > e_min , f_1
    = 0), and values that are not floating-point numbers, such as infinities
    and NaNs." (5.2.4.2.2p3)

    (I've use underscores to indicate subscripts in the original text).

    The phrases "subnormal floating point numbers" and "unnormalized
    floating -point numbers" are italicized, an ISO convention indicating
    that the containing sentence is the official definition of those terms.
    Oddly enough, "normalized floating-point numbers" is not italicized,
    despite being followed by a similar description. Normalized, subnormal,
    and unnormalized floating point numbers are all defined/described in
    terms of whether f_1, the leading base-b digit, is zero. The lower order
    base-b digits have no role to play any of those
    definitions/descriptions. It doesn't matter how many of those other
    digits there are, and it therefore shouldn't matter if that number is
    variable.

    Therefore, I would guess that a double-double value of the form a+b,
    where fabs(a) > fabs(b), should be classified as normal iff a is normal,
    and as subnormal iff a is subnormal - which would fit the behavior you
    describe for the implementation you were using.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Fred J. Tydeman@21:1/5 to Vincent Lefevre on Sat Jul 31 16:34:37 2021
    On Fri, 30 Jul 2021 15:24:22 UTC, Vincent Lefevre <vincent-news@vinc17.net> wrote:

    So x is a number with more than the 106-bit precision of normal
    numbers, and both fpclassify(x) and isnormal(x) regard it as a
    "normal number", which should actually be interpreted as being
    in the range of the normal numbers.

    isnormal() is not a test of is the number normalized.
    Your program results match the C standard.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vincent Lefevre@21:1/5 to James Kuyper on Sun Aug 1 11:08:27 2021
    In article <se1v53$47u$1@dont-email.me>,
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    The standard frequently applies the adjective "representable" to
    floating point numbers - that would be redundant if all floating point numbers were required to be representable. I think the format you
    describe should be considered to have p (and therefore, LDBL_MANT_DIG)
    large enough to include all representable numbers, even if that would
    mean that not all floating point numbers are representable.

    I think that this would be rather useless in practice (completely
    unusable for error analysis). And if an implementation chooses to
    represent pi exactly (with a special encoding, as part of the
    "values that are not floating-point numbers")?

    Until now, *_MANT_DIG has always meant that all FP numbers from the
    model are representable, AFAIK. That's probably why for long double
    on PowerPC (double-double format), LDBL_MANT_DIG is 106 and not 107,
    while almost all 107-bit FP numbers are representable (this fails
    only near the overflow threshold).

    "In addition to normalized floating-point numbers ( f_1 > 0 if x ≠ 0), floating types may be able to contain other kinds of floating-point
    numbers, such as subnormal floating-point numbers (x ≠ 0, e = e_min ,
    f_1 = 0) and unnormalized floating-point numbers (x ≠ 0, e > e_min , f_1
    = 0), and values that are not floating-point numbers, such as infinities
    and NaNs." (5.2.4.2.2p3)

    (I've use underscores to indicate subscripts in the original text).

    The phrases "subnormal floating point numbers" and "unnormalized
    floating -point numbers" are italicized, an ISO convention indicating
    that the containing sentence is the official definition of those terms.
    Oddly enough, "normalized floating-point numbers" is not italicized,
    despite being followed by a similar description. Normalized, subnormal,
    and unnormalized floating point numbers are all defined/described in
    terms of whether f_1, the leading base-b digit, is zero. The lower order base-b digits have no role to play any of those
    definitions/descriptions. It doesn't matter how many of those other
    digits there are, and it therefore shouldn't matter if that number is variable.

    But since f_1 is defined by the formula in 5.2.4.2.2p2, this means that
    it is defined only with no more than p digits, where p = LDBL_MANT_DIG
    for long double.

    Therefore, I would guess that a double-double value of the form a+b,
    where fabs(a) > fabs(b), should be classified as normal iff a is normal,
    and as subnormal iff a is subnormal - which would fit the behavior you describe for the implementation you were using.

    The standard would still need to extend the definition of f_1 to
    a number of digits larger than p (possibly infinite).

    --
    Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
    100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
    Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vincent Lefevre@21:1/5 to Fred J. Tydeman on Sun Aug 1 11:23:02 2021
    In article <Jd11RFbi8Eoq-pn2-eUKoqCC1dXdb@ECS16501512.domain>,
    Fred J. Tydeman <tydeman.consulting@sbcglobal.net> wrote:

    On Fri, 30 Jul 2021 15:24:22 UTC, Vincent Lefevre <vincent-news@vinc17.net> wrote:

    So x is a number with more than the 106-bit precision of normal
    numbers, and both fpclassify(x) and isnormal(x) regard it as a
    "normal number", which should actually be interpreted as being
    in the range of the normal numbers.

    isnormal() is not a test of is the number normalized.
    Your program results match the C standard.

    The standard says: "The isnormal macro determines whether its argument
    value is normal (neither zero, subnormal, infinite, nor NaN)."

    If you mean that "normal" (which is not defined) means "neither zero, subnormal, infinite, nor NaN", then this would exclude implementations
    with non-FP values less than the minimum normalized value, or make
    them very awkward, as such values would be classified as normal.
    And it would not be equivalent to fpclassify(x) == FP_NORMAL, as
    7.12.3.1 says:

    The fpclassify macro classifies its argument value as NaN, infinite,
    normal, subnormal, zero, or into another implementation-defined
    category.

    Note the "or into another implementation-defined category", which fits
    the "neither zero, subnormal, infinite, nor NaN". Therefore, one would
    have fpclassify(x) != FP_NORMAL, but isnormal(x) would be true. IMHO,
    this is wrong.

    --
    Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
    100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
    Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From antispam@math.uni.wroc.pl@21:1/5 to Vincent Lefevre on Sun Aug 1 20:56:04 2021
    Vincent Lefevre <vincent-news@vinc17.net> wrote:
    In article <se1v53$47u$1@dont-email.me>,
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    The standard frequently applies the adjective "representable" to
    floating point numbers - that would be redundant if all floating point numbers were required to be representable. I think the format you
    describe should be considered to have p (and therefore, LDBL_MANT_DIG) large enough to include all representable numbers, even if that would
    mean that not all floating point numbers are representable.

    I think that this would be rather useless in practice (completely
    unusable for error analysis). And if an implementation chooses to
    represent pi exactly (with a special encoding, as part of the
    "values that are not floating-point numbers")?

    Until now, *_MANT_DIG has always meant that all FP numbers from the
    model are representable, AFAIK. That's probably why for long double
    on PowerPC (double-double format), LDBL_MANT_DIG is 106 and not 107,
    while almost all 107-bit FP numbers are representable (this fails
    only near the overflow threshold).
    <snip>
    But since f_1 is defined by the formula in 5.2.4.2.2p2, this means that
    it is defined only with no more than p digits, where p = LDBL_MANT_DIG
    for long double.

    Therefore, I would guess that a double-double value of the form a+b,
    where fabs(a) > fabs(b), should be classified as normal iff a is normal, and as subnormal iff a is subnormal - which would fit the behavior you describe for the implementation you were using.

    The standard would still need to extend the definition of f_1 to
    a number of digits larger than p (possibly infinite).

    AFAICS double-double is poor fit to the standard. Your implementation
    made sane choice. From point of view of standard purity, your
    implementation probably should have conformace statement explaning
    what LDBL_MANT_DIG means and that double-double does not fit
    standard model.

    To put it differently: make sane choices and document in conformace
    statement any dicrepancies between those choices and letter of the
    standard. Attempting to claim that your implementation satisfies
    letter of the standard by inventing values that are not FP-numbers,
    but which otherwise in artitmetic work like FP-numbers is insane.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Kuyper@21:1/5 to Vincent Lefevre on Sun Aug 1 19:15:27 2021
    On 8/1/21 7:08 AM, Vincent Lefevre wrote:
    In article <se1v53$47u$1@dont-email.me>,
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    The standard frequently applies the adjective "representable" to
    floating point numbers - that would be redundant if all floating point
    numbers were required to be representable. I think the format you
    describe should be considered to have p (and therefore, LDBL_MANT_DIG)
    large enough to include all representable numbers, even if that would
    mean that not all floating point numbers are representable.

    I think that this would be rather useless in practice (completely
    unusable for error analysis). And if an implementation chooses to
    represent pi exactly (with a special encoding, as part of the
    "values that are not floating-point numbers")?

    Until now, *_MANT_DIG has always meant that all FP numbers from the
    model are representable, AFAIK. ...


    When you use floating point format that is not a good fit to the
    standard's model, you're inherently going to be breaking some
    assumptions that are based upon that model - the only question is, which
    ones.
    Error analysis for double-double will necessarily be quite different
    from the error analysis that applies to a format that fits the
    standard's model, regardless of what value you choose for LDBL_MANT_DIG.
    You can do excessively conservative error analysis based upon ignoring
    those differences and deliberately choosing an inaccurate value for LDBL_MANT_DIG that is sufficiently small, but the cost of doing so is
    that any double-double value which has more than LDBL_MANT_DIG base-b
    digits of precision no longer qualifies as a floating-point number.

    Note: when searching for "floating-point number", remember to include
    the '-'.

    I had thought that this would be disastrous, but when I looked more
    carefully I was reminded that that the most important clauses I was
    worried about (such as 3.9 and 6.4.4.2p3) don't depend in any way upon
    the term "floating-point number".

    The consequences of this decision are therefore relatively limited
    outside of 5.2.4.2.2. Many standard library functions that take floating
    point arguments have defined behavior only when passed a floating-point
    number. In some cases, the behavior is also defined if they are passed a
    NaN or an infinity. The behavior is undefined, by omission of any
    definition, when the number doesn't fall into any of those categories.
    This gives such an implementation the freedom to produce precisely
    whatever result that users of such an implementations would prefer it to produce - so this is not as severe a consequence as I had thought.

    The phrases "subnormal floating point numbers" and "unnormalized
    floating -point numbers" are italicized, an ISO convention indicating
    that the containing sentence is the official definition of those terms.
    Oddly enough, "normalized floating-point numbers" is not italicized,
    despite being followed by a similar description. Normalized, subnormal,
    and unnormalized floating point numbers are all defined/described in
    terms of whether f_1, the leading base-b digit, is zero. The lower order
    base-b digits have no role to play any of those
    definitions/descriptions. It doesn't matter how many of those other
    digits there are, and it therefore shouldn't matter if that number is
    variable.

    But since f_1 is defined by the formula in 5.2.4.2.2p2, this means that
    it is defined only with no more than p digits, where p = LDBL_MANT_DIG
    for long double.

    No, f_1 can trivially be identified as the leading base-b digit,
    regardless of whether or not there are too many base-b digits for the representation to qualify as a floating-point number.

    Therefore, I would guess that a double-double value of the form a+b,
    where fabs(a) > fabs(b), should be classified as normal iff a is normal,
    and as subnormal iff a is subnormal - which would fit the behavior you
    describe for the implementation you were using.

    The standard would still need to extend the definition of f_1 to
    a number of digits larger than p (possibly infinite).

    Why would it be infinite? My understanding is that the largest number of mantissa digits required by a double-double value would be for the value represented by a+b, where a is DBL_MAX and b is DBL_MIN. That only
    requires a finite number of digits (though it is ridiculously large).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vincent Lefevre@21:1/5 to antispam@math.uni.wroc.pl on Mon Aug 2 18:33:45 2021
    In article <se71p4$7a2$1@z-news.wcss.wroc.pl>,
    antispam@math.uni.wroc.pl wrote:

    AFAICS double-double is poor fit to the standard. Your implementation
    made sane choice. From point of view of standard purity, your
    implementation probably should have conformace statement explaning
    what LDBL_MANT_DIG means and that double-double does not fit
    standard model.

    Double-double fits the standard model, with a subset being
    floating-point numbers. The LDBL_MANT_DIG value is conforming.
    I don't see anything wrong there. It is not me who suggested
    to change the meaning of LDBL_MANT_DIG in the standard.

    --
    Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
    100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
    Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vincent Lefevre@21:1/5 to James Kuyper on Mon Aug 2 19:05:09 2021
    In article <se79uh$ejo$1@dont-email.me>,
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    On 8/1/21 7:08 AM, Vincent Lefevre wrote:
    In article <se1v53$47u$1@dont-email.me>,
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    The standard frequently applies the adjective "representable" to
    floating point numbers - that would be redundant if all floating point
    numbers were required to be representable. I think the format you
    describe should be considered to have p (and therefore, LDBL_MANT_DIG)
    large enough to include all representable numbers, even if that would
    mean that not all floating point numbers are representable.

    I think that this would be rather useless in practice (completely
    unusable for error analysis). And if an implementation chooses to
    represent pi exactly (with a special encoding, as part of the
    "values that are not floating-point numbers")?

    Until now, *_MANT_DIG has always meant that all FP numbers from the
    model are representable, AFAIK. ...

    When you use floating point format that is not a good fit to the
    standard's model, you're inherently going to be breaking some
    assumptions that are based upon that model - the only question is, which ones.

    This is FUD. Double-double is a good fit to the standard's model.
    There are additional numbers (as allowed by the standard), so that
    this won't follow the same behavior as IEEE FP, but so does
    contraction of FP expressions.

    Error analysis for double-double will necessarily be quite different
    from the error analysis that applies to a format that fits the
    standard's model, regardless of what value you choose for LDBL_MANT_DIG.

    Assuming some error bound in ulp (which must be done whatever the
    format), the error analysis will be the same.

    The consequences of this decision are therefore relatively limited
    outside of 5.2.4.2.2. Many standard library functions that take floating point arguments have defined behavior only when passed a floating-point number.

    The main ones, like expl, have a defined behavior for other real
    values: "The exp functions compute the base-e exponential of x.
    A range error occurs if the magnitude of x is too large."
    Note that x is *not* required to be a floating-point number.

    However, the accuracy is not defined, even if x is a floating-point
    number.

    In some cases, the behavior is also defined if they are passed a
    NaN or an infinity.

    Only in Annex F, AFAIK.

    The phrases "subnormal floating point numbers" and "unnormalized
    floating -point numbers" are italicized, an ISO convention indicating
    that the containing sentence is the official definition of those terms.
    Oddly enough, "normalized floating-point numbers" is not italicized,
    despite being followed by a similar description. Normalized, subnormal,
    and unnormalized floating point numbers are all defined/described in
    terms of whether f_1, the leading base-b digit, is zero. The lower order >> base-b digits have no role to play any of those
    definitions/descriptions. It doesn't matter how many of those other
    digits there are, and it therefore shouldn't matter if that number is
    variable.

    But since f_1 is defined by the formula in 5.2.4.2.2p2, this means that
    it is defined only with no more than p digits, where p = LDBL_MANT_DIG
    for long double.

    No, f_1 can trivially be identified as the leading base-b digit,
    regardless of whether or not there are too many base-b digits for the representation to qualify as a floating-point number.

    It can, but this is not what the standard says. This can be solved
    if p is replaced by the infinity over the sum symbol in the formula.

    Therefore, I would guess that a double-double value of the form a+b,
    where fabs(a) > fabs(b), should be classified as normal iff a is normal, >> and as subnormal iff a is subnormal - which would fit the behavior you
    describe for the implementation you were using.

    The standard would still need to extend the definition of f_1 to
    a number of digits larger than p (possibly infinite).

    Why would it be infinite?

    So that the definition remains valid with a format that chooses to
    make pi exactly representable, for instance.

    --
    Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
    100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
    Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Kuyper@21:1/5 to Vincent Lefevre on Mon Aug 2 19:19:32 2021
    On 8/2/21 3:05 PM, Vincent Lefevre wrote:
    In article <se79uh$ejo$1@dont-email.me>,
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    On 8/1/21 7:08 AM, Vincent Lefevre wrote:
    ...
    Until now, *_MANT_DIG has always meant that all FP numbers from the
    model are representable, AFAIK. ...

    When you use floating point format that is not a good fit to the
    standard's model, you're inherently going to be breaking some
    assumptions that are based upon that model - the only question is, which
    ones.

    This is FUD. Double-double is a good fit to the standard's model.
    There are additional numbers (as allowed by the standard), so that
    this won't follow the same behavior as IEEE FP, but so does
    contraction of FP expressions.

    We're clearly using "good fit" in different senses. I would consider a
    floating point format to be a "perfect fit" to the C standard's model if
    every number representable in that format qualifies as a floating-point
    number, and every number that qualifies as a floating point number is representable in that format. "Goodness of fit" would depend upon how
    closely any given format approaches that ideal. There's no single value
    of the model parameters that makes both of those statements even come
    close to being true for double-double format.

    Error analysis for double-double will necessarily be quite different
    from the error analysis that applies to a format that fits the
    standard's model, regardless of what value you choose for LDBL_MANT_DIG.

    Assuming some error bound in ulp (which must be done whatever the
    format), the error analysis will be the same.

    Given a number in double-double format represented by a+b, where fabs(a)
    fabs(b) && fabs(b) > 0, with x being the largest power of FLT_RADIX
    that is no larger than b, 1 ulp will be DBL_EPSILON*x. That expression
    will vary over a very large dynamic range while the number being
    represented by a+b changes only negligibly. That is very different from conventional formats, where 1 ulp is determined by the magnitude of the
    entire number being represented, and remains constant while that number
    varies by a factor of FLT_EPSILON. The only way I can see to avoid
    complicating your error analysis to deal with that fact is to ignore it, despite it being relevant.

    The consequences of this decision are therefore relatively limited
    outside of 5.2.4.2.2. Many standard library functions that take floating
    point arguments have defined behavior only when passed a floating-point
    number.

    The main ones, like expl, have a defined behavior for other real
    values: "The exp functions compute the base-e exponential of x.
    A range error occurs if the magnitude of x is too large."
    Note that x is *not* required to be a floating-point number.

    The frexp(), ldexp(), and fabs() functions have specified behavior only
    when the double argument qualifies as a floating point number.

    The printf() family of functions have specified behavior only for floating-point numbers, NaN's and infinities, when using f,F,e,E,g,G,a,
    or A formats.

    The ceil() and floor() functions have a return value which is required
    to be floating point number.

    The scanf() family of functions (when using f,F,e,E,g,G,a, or A formats)
    and strtod(). are required to return floating-point numbers, NaN's or infinities,

    and, of course, this also applies to the float, long double, and complex versions of all of those functions, and to the wide-character version of
    the text functions.

    In some cases, the behavior is also defined if they are passed a
    NaN or an infinity.

    Only in Annex F, AFAIK.

    No, the descriptions of printf() and scanf() families of functions also
    allow for NaNs and infinities.

    ...
    But since f_1 is defined by the formula in 5.2.4.2.2p2, this means that
    it is defined only with no more than p digits, where p = LDBL_MANT_DIG
    for long double.

    No, f_1 can trivially be identified as the leading base-b digit,
    regardless of whether or not there are too many base-b digits for the
    representation to qualify as a floating-point number.

    It can, but this is not what the standard says. This can be solved
    if p is replaced by the infinity over the sum symbol in the formula.

    The C standard defines what normalized, subnormal, and unnormalized floating-point numbers are; it does not define what those terms mean for
    things that could qualify as floating-point numbers, but only if
    LDBL_MANT_DIG were larger. However, the extension of those concepts to
    such representations is trivial. I think that increasing the value of LDBL_MANT_DIG to allow them to qualify as floating-point numbers is the
    more appropriate approach.

    If p is replaced by infinity, it no longer defines a floating point
    format. Such formats are defined, in part, by their finite maximum value
    of p.

    Therefore, I would guess that a double-double value of the form a+b,
    where fabs(a) > fabs(b), should be classified as normal iff a is normal, >>>> and as subnormal iff a is subnormal - which would fit the behavior you >>>> describe for the implementation you were using.

    The standard would still need to extend the definition of f_1 to
    a number of digits larger than p (possibly infinite).

    Why would it be infinite?

    So that the definition remains valid with a format that chooses to
    make pi exactly representable, for instance.

    I see no significant value in changing the standard to allow such a
    format. I thought we were only discussing what's needed to modify C's
    model so it fits double-double format.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vincent Lefevre@21:1/5 to James Kuyper on Tue Aug 3 17:01:41 2021
    In article <se9ui6$m1q$1@dont-email.me>,
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    On 8/2/21 3:05 PM, Vincent Lefevre wrote:
    In article <se79uh$ejo$1@dont-email.me>,
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    On 8/1/21 7:08 AM, Vincent Lefevre wrote:
    ...
    Until now, *_MANT_DIG has always meant that all FP numbers from the
    model are representable, AFAIK. ...

    When you use floating point format that is not a good fit to the
    standard's model, you're inherently going to be breaking some
    assumptions that are based upon that model - the only question is, which >> ones.

    This is FUD. Double-double is a good fit to the standard's model.
    There are additional numbers (as allowed by the standard), so that
    this won't follow the same behavior as IEEE FP, but so does
    contraction of FP expressions.

    We're clearly using "good fit" in different senses. I would consider a floating point format to be a "perfect fit" to the C standard's model if every number representable in that format qualifies as a floating-point number, and every number that qualifies as a floating point number is representable in that format. "Goodness of fit" would depend upon how
    closely any given format approaches that ideal. There's no single value
    of the model parameters that makes both of those statements even come
    close to being true for double-double format.

    The C standard does not have a notion of "goodness of fit", and it is
    certainly not in the rationale. Strict IEEE floating point is nice as
    it allows one to used specific algorithms based on its semantic. One
    goal is to provide more precision and more accuracy; this is exactly
    what double-double does, providing at least 106-bit precision, with
    accurate algorithms based on the IEEE 754 semantic of double-precision
    numbers. But for that, the IEEE semantic must strictly be followed;
    anything that helps in improving the precision (though it is a good
    thing in some contexts) will make the algorithms fail.

    Otherwise, the goal is not to fit the floating-point model, but to
    get as much accuracy as possible and bounds with a conventional error
    analysis model. For instance, the goal of double-double is to provide
    a 106-bit precision at least (107-bit precision almost everywhere),
    following the FP model. There are additional representable numbers,
    but are not a problem: for the error analysis, they will typically
    be ignored (though in some cases, one can take advantage of them),
    i.e. they won't have much influence on error bounds (if any); for
    the actual result, you get these additional values for free compared
    to a strict 106-bit FP arithmetic.

    This double-double choice done as the "long double" format on PowerPC
    is better than just double precision (allowed by the C standard) in
    term of precision. And it is faster than a software implementation of
    binary128 (a.k.a. quadruple precision). These were regarded as better
    criteria than "goodness to fit".

    Error analysis for double-double will necessarily be quite different
    from the error analysis that applies to a format that fits the
    standard's model, regardless of what value you choose for LDBL_MANT_DIG.

    Assuming some error bound in ulp (which must be done whatever the
    format), the error analysis will be the same.

    Given a number in double-double format represented by a+b, where
    fabs(a) > fabs(b) && fabs(b) > 0, with x being the largest power of
    FLT_RADIX that is no larger than b, 1 ulp will be DBL_EPSILON*x.
    That expression will vary over a very large dynamic range while the
    number being represented by a+b changes only negligibly.
    [...]

    The notion of ulp is defined (and usable in practice) only with a floating-point format. The best way to apply it to double-double
    is to define it to a floating-point format that a subset of this
    double-double format. If double has a 53-bit precision, this gives
    a 106-bit precision, hence the choice of LDBL_MANT_DIG = 106.

    The consequences of this decision are therefore relatively limited
    outside of 5.2.4.2.2. Many standard library functions that take floating >> point arguments have defined behavior only when passed a floating-point
    number.

    The main ones, like expl, have a defined behavior for other real
    values: "The exp functions compute the base-e exponential of x.
    A range error occurs if the magnitude of x is too large."
    Note that x is *not* required to be a floating-point number.

    The frexp(), ldexp(), and fabs() functions have specified behavior only
    when the double argument qualifies as a floating point number.

    Well, in the particular case of double-double, they can easily be
    generalized to non-floating-point numbers, with some drawbacks:
    frexp and ldexp may introduce rounding. In practice, they can just
    be defined by the implementation. This is not worse than the fact
    that the standard doesn't define the accuracy of the floating-point
    operations and functions.

    The printf() family of functions have specified behavior only for floating-point numbers, NaN's and infinities, when using f,F,e,E,g,G,a,
    or A formats.

    The ceil() and floor() functions have a return value which is required
    to be floating point number.

    The scanf() family of functions (when using f,F,e,E,g,G,a, or A formats)
    and strtod(). are required to return floating-point numbers, NaN's or infinities,

    and, of course, this also applies to the float, long double, and complex versions of all of those functions, and to the wide-character version of
    the text functions.

    Ditto, they can be defined by the implementation.

    In some cases, the behavior is also defined if they are passed a
    NaN or an infinity.

    Only in Annex F, AFAIK.

    No, the descriptions of printf() and scanf() families of functions also
    allow for NaNs and infinities.

    OK, but this is not much useful, as you need Annex F or definitions
    by the implementation to know how infinities and NaNs are handled
    by most operations, starting with the basic arithmetic operations.

    ...
    But since f_1 is defined by the formula in 5.2.4.2.2p2, this means that >>> it is defined only with no more than p digits, where p = LDBL_MANT_DIG >>> for long double.

    No, f_1 can trivially be identified as the leading base-b digit,
    regardless of whether or not there are too many base-b digits for the
    representation to qualify as a floating-point number.

    It can, but this is not what the standard says. This can be solved
    if p is replaced by the infinity over the sum symbol in the formula.

    The C standard defines what normalized, subnormal, and unnormalized floating-point numbers are; it does not define what those terms mean for things that could qualify as floating-point numbers, but only if LDBL_MANT_DIG were larger. However, the extension of those concepts to
    such representations is trivial. I think that increasing the value of LDBL_MANT_DIG to allow them to qualify as floating-point numbers is the
    more appropriate approach.

    If p is replaced by infinity, it no longer defines a floating point
    format. Such formats are defined, in part, by their finite maximum value
    of p.

    So, you mean that to follow the definition of "normal" used with
    double-double on PowerPC, LDBL_MANT_DIG needs to be increased to
    a large value (something like e_max - e_min + 53), even though
    not all floating-point values would be representable?

    But then, various specifications would be incorrect, such as:

    * LDBL_MAX, as (1 - b^(-p)) b^emax would not be representable
    (p would be too large).

    * LDBL_EPSILON would no longer make any sense and would not be
    representable, as b^(1-p) would be too small.

    * frexp, because the condition "value equals x times 2^(*exp)" could
    not always be satisfied (that's probably why the standard says "If
    value is not a floating-point number [...]", which is OK if all
    floating-point numbers are assumed to be exactly representable).

    In summary, you're just replacing a problem by several problems.

    --
    Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
    100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
    Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Kuyper@21:1/5 to Vincent Lefevre on Tue Aug 3 18:55:24 2021
    On 8/3/21 1:01 PM, Vincent Lefevre wrote:
    In article <se9ui6$m1q$1@dont-email.me>,
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    On 8/2/21 3:05 PM, Vincent Lefevre wrote:...
    This is FUD. Double-double is a good fit to the standard's model.
    There are additional numbers (as allowed by the standard), so that
    this won't follow the same behavior as IEEE FP, but so does
    contraction of FP expressions.

    We're clearly using "good fit" in different senses. I would consider a
    floating point format to be a "perfect fit" to the C standard's model if
    every number representable in that format qualifies as a floating-point
    number, and every number that qualifies as a floating point number is
    representable in that format. "Goodness of fit" would depend upon how
    closely any given format approaches that ideal. There's no single value
    of the model parameters that makes both of those statements even come
    close to being true for double-double format.

    The C standard does not have a notion of "goodness of fit",

    Agreed - this is simply a matter of ordinary English usage, not standard-defined usage. When trying to use the standard's model to
    describe a particular format runs into problems (see below), that format
    is a bad fit to the model.

    ...
    following the FP model. There are additional representable numbers,
    but are not a problem: for the error analysis, they will typically
    be ignored (though in some cases, one can take advantage of them),

    As I said - one way to avoid making the error analysis easier is to
    ignore the complications that would have to be dealt with to do it
    properly. That doesn't make the simpler analysis proper.

    ...
    Assuming some error bound in ulp (which must be done whatever the
    format), the error analysis will be the same.

    Given a number in double-double format represented by a+b, where
    fabs(a) > fabs(b) && fabs(b) > 0, with x being the largest power of
    FLT_RADIX that is no larger than b, 1 ulp will be DBL_EPSILON*x.
    That expression will vary over a very large dynamic range while the
    number being represented by a+b changes only negligibly.
    [...]

    The notion of ulp is defined (and usable in practice) only with a floating-point format.

    It's perfectly well-defined as the difference between two consecutive representable numbers. That definition ties directly into the way that
    the term is used. If you want to use a number larger that that one, you shouldn't call it ulp.

    ...
    The frexp(), ldexp(), and fabs() functions have specified behavior only
    when the double argument qualifies as a floating point number.

    Well, in the particular case of double-double, they can easily be
    generalized to non-floating-point numbers, with some drawbacks:
    frexp and ldexp may introduce rounding. In practice, they can just
    be defined by the implementation. This is not worse than the fact
    that the standard doesn't define the accuracy of the floating-point operations and functions.

    It's trivial to implement frexpl() for double-double format in terms of
    frexp() and ldexp() applied to double components of the number, and
    similarly for ldexp(). It's not the difficulty of implementation that
    bothers me - it's the fact that your approach gratuitously gives most
    calls to the long-double versions of those functions undefined behavior.

    The printf() family of functions have specified behavior only for
    floating-point numbers, NaN's and infinities, when using f,F,e,E,g,G,a,
    or A formats.

    The ceil() and floor() functions have a return value which is required
    to be floating point number.

    The scanf() family of functions (when using f,F,e,E,g,G,a, or A formats)
    and strtod(). are required to return floating-point numbers, NaN's or
    infinities,

    and, of course, this also applies to the float, long double, and complex
    versions of all of those functions, and to the wide-character version of
    the text functions.

    Ditto, they can be defined by the implementation.

    There would be no need for the behavior to be implementation-defined if
    the proper choice had been made for LDBL_MANT_DIG.

    ...
    No, the descriptions of printf() and scanf() families of functions also
    allow for NaNs and infinities.

    OK, but this is not much useful, as you need Annex F or definitions
    by the implementation to know how infinities and NaNs are handled
    by most operations, starting with the basic arithmetic operations.

    The only reason I brought that up was because the simplest statement of
    my point "only defined if passed a floating point number", did not apply
    to those functions. The point is still that those functions have
    gratuitously unspecified behavior just because you don't want to choose
    a more accurate value for LDBL_MANT_DIG.

    ...
    The C standard defines what normalized, subnormal, and unnormalized
    floating-point numbers are; it does not define what those terms mean for
    things that could qualify as floating-point numbers, but only if
    LDBL_MANT_DIG were larger. However, the extension of those concepts to
    such representations is trivial. I think that increasing the value of
    LDBL_MANT_DIG to allow them to qualify as floating-point numbers is the
    more appropriate approach.

    If p is replaced by infinity, it no longer defines a floating point
    format. Such formats are defined, in part, by their finite maximum value
    of p.

    So, you mean that to follow the definition of "normal" used with double-double on PowerPC, LDBL_MANT_DIG needs to be increased to
    a large value (something like e_max - e_min + 53), even though
    not all floating-point values would be representable?

    But then, various specifications would be incorrect, such as:

    * LDBL_MAX, as (1 - b^(-p)) b^emax would not be representable
    (p would be too large).

    As I said earlier, the standard's textual descriptions are the ones that
    are normative - the formulas are merely show what the the correct value
    would be when using a floating point format that is a good fit to the standard's model.

    LDBL_MAX is, by definition, "maximum representable finite floating-point number". By definition, it must be representable. If I understand
    double-double format correctly, the maximum representable value is
    represented by a+b when a and b are both DBL_MAX, so LDBL_MAX should be
    exactly twice DBL_MAX. The fact that the formula gives the wrong result
    is an indicator of the fact that this format is not a good fit.

    * LDBL_EPSILON would no longer make any sense and would not be
    representable, as b^(1-p) would be too small.

    LDBL_EPSILON is "the difference between 1 and the least value greater
    than 1 that is representable in the given floating point type,". If I understand double-double format correctly, the least value greater than
    1 that is representable as a double-double is 1 + DBL_MIN, so
    DLBL_EPSILON should be DBL_MIN. Again, the formula you cite would give
    the wrong result, which is an indication of how double-double is a poor
    fit to the standard's model.

    * frexp, because the condition "value equals x times 2^(*exp)" could
    not always be satisfied (that's probably why the standard says "If
    value is not a floating-point number [...]", which is OK if all
    floating-point numbers are assumed to be exactly representable).

    Again, that's merely a sign that double-double is not a good fit to the standard's model. Given a finite x that isn't a NaN,

    y = frexpl(x, &exp);
    z = ldexpl(y, exp);

    z should compare equal to x, and that would be trivial to arrange for a floating point format that was a good fit to C's model, but that's not
    possible for all values in double-double format.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vincent Lefevre@21:1/5 to James Kuyper on Wed Aug 4 00:57:20 2021
    In article <691aadf4-74c7-e739-d93a-5907d00f6bb5@alumni.caltech.edu>,
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    On 8/3/21 1:01 PM, Vincent Lefevre wrote:
    In article <se9ui6$m1q$1@dont-email.me>,
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    On 8/2/21 3:05 PM, Vincent Lefevre wrote:...
    This is FUD. Double-double is a good fit to the standard's model.
    There are additional numbers (as allowed by the standard), so that
    this won't follow the same behavior as IEEE FP, but so does
    contraction of FP expressions.

    We're clearly using "good fit" in different senses. I would consider a
    floating point format to be a "perfect fit" to the C standard's model if >> every number representable in that format qualifies as a floating-point
    number, and every number that qualifies as a floating point number is
    representable in that format. "Goodness of fit" would depend upon how
    closely any given format approaches that ideal. There's no single value
    of the model parameters that makes both of those statements even come
    close to being true for double-double format.

    The C standard does not have a notion of "goodness of fit",

    Agreed - this is simply a matter of ordinary English usage, not standard-defined usage. When trying to use the standard's model to
    describe a particular format runs into problems (see below), that format
    is a bad fit to the model.

    There is no ordinary English usage. Either the format is conforming,
    then everything is fine (possibly with some drawbacks for the user),
    or it is non-conforming.

    ...
    Assuming some error bound in ulp (which must be done whatever the
    format), the error analysis will be the same.

    Given a number in double-double format represented by a+b, where
    fabs(a) > fabs(b) && fabs(b) > 0, with x being the largest power of
    FLT_RADIX that is no larger than b, 1 ulp will be DBL_EPSILON*x.
    That expression will vary over a very large dynamic range while the
    number being represented by a+b changes only negligibly.
    [...]

    The notion of ulp is defined (and usable in practice) only with a floating-point format.

    It's perfectly well-defined as the difference between two consecutive representable numbers. That definition ties directly into the way that
    the term is used. If you want to use a number larger that that one, you shouldn't call it ulp.

    The ulp is defined only for floating-point formats (or with a
    reference floating-point format). If you have a source defining
    the ulp for other kinds of formats, I would like to know...

    [...]
    Ditto, they can be defined by the implementation.

    There would be no need for the behavior to be implementation-defined if
    the proper choice had been made for LDBL_MANT_DIG.

    As I've said below, this would lead other issues (more important ones).

    The only reason I brought that up was because the simplest statement of
    my point "only defined if passed a floating point number", did not apply
    to those functions. The point is still that those functions have
    gratuitously unspecified behavior just because you don't want to choose
    a more accurate value for LDBL_MANT_DIG.

    This is very silly. I did *not* choose the value of LDBL_MANT_DIG.
    It has probably been chosen a long time ago by the initial vendor
    of the compiler for PowerPC (IBM?), and this value has been used
    by various compilers, including GCC, for a long time. And AFAIK,
    no-one found any issue with it.

    Choosing a larger value could potentially break may programs that use LDBL_MANT_DIG. It would also defeat the purpose of minimum values for *_MANT_DIG required by the standard. For instance, the standard says
    that DBL_MANT_DIG must be at least 53, so that the implementer is not
    tempted to define a low-precision double type. But if it is allowed
    to choose arbitrary large values for DBL_MANT_DIG, not reflecting the
    real precision, what's the point?

    [if a large value were chosen for LDBL_MANT_DIG]
    But then, various specifications would be incorrect, such as:

    * LDBL_MAX, as (1 - b^(-p)) b^emax would not be representable
    (p would be too large).

    As I said earlier, the standard's textual descriptions are the ones that
    are normative - the formulas are merely show what the the correct value
    would be when using a floating point format that is a good fit to the standard's model.

    This is not what the standard says. Some part has even be corrected
    in C2x (currently N2596) to take into account the double-double
    format. See DR 467 and N2092[*], which introduced FLT_NORM_MAX,
    DBL_NORM_MAX and LDBL_NORM_MAX.

    [*] N2092 <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2092.htm>
    says at the end:

    Existing practice

    For most implementations, these three macros will be the same as the
    corresponding *_MAX macros. The only known case where that is not true
    is those where long double is implemented as a pair of doubles (and
    then only LDBL_MAX will differ from LDBL_NORM_MAX).

    BTW, setting LDBL_MANT_DIG to a large value would make LDBL_MAX and LDBL_NORM_MAX necessarily equal. This is clearly not the intent.

    LDBL_MAX is, by definition, "maximum representable finite floating-point number". By definition, it must be representable.

    But it would make the sentence "if that number is normalized, its
    value is (1 - b^(-p)) b^emax" from C2x incorrect.

    If I understand double-double format correctly, the maximum
    representable value is represented by a+b when a and b are both
    DBL_MAX,
    [...]

    No, a and b must not overlap. And apparently, there must be a least
    a bit 0 between a and b (this is natural from the algorithms, based
    on rounding to nearest, but this can no longer be followed near the
    overflow threshold). This is still an issue with the definition in
    GCC:

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61399

    * LDBL_EPSILON would no longer make any sense and would not be
    representable, as b^(1-p) would be too small.

    LDBL_EPSILON is "the difference between 1 and the least value greater
    than 1 that is representable in the given floating point type,".
    [...]

    This was corrected in C2x: "the difference between 1 and the least
    normalized value greater than 1 that is representable in the given floating-point type, b^(1-p)"

    So, with LDBL_MANT_DIG = 106, this is fine. But it you want it to be emax-emin+53, then b^(1-p) would be too small to be representable.

    Note: I think that this correction was done with double-double in
    mind, so that LDBL_MANT_DIG = 106 was assumed to be the correct
    value (like with DR 467 and N2092).

    * frexp, because the condition "value equals x times 2^(*exp)" could
    not always be satisfied (that's probably why the standard says "If
    value is not a floating-point number [...]", which is OK if all
    floating-point numbers are assumed to be exactly representable).

    Again, that's merely a sign that double-double is not a good fit to the standard's model.

    Again, this is not a notion of the standard. What you suggest would
    make the spec in the standard invalid, which makes your suggestion
    impossible.

    --
    Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
    100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
    Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vincent Lefevre@21:1/5 to Vincent Lefevre on Wed Aug 4 01:51:09 2021
    In article <20210803233337$5946@zira.vinc17.org>,
    Vincent Lefevre <vincent-news@vinc17.net> wrote:

    In article <691aadf4-74c7-e739-d93a-5907d00f6bb5@alumni.caltech.edu>,
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    The only reason I brought that up was because the simplest statement of
    my point "only defined if passed a floating point number", did not apply
    to those functions. The point is still that those functions have gratuitously unspecified behavior just because you don't want to choose
    a more accurate value for LDBL_MANT_DIG.

    This is very silly. I did *not* choose the value of LDBL_MANT_DIG.
    It has probably been chosen a long time ago by the initial vendor
    of the compiler for PowerPC (IBM?), and this value has been used
    by various compilers, including GCC, for a long time. And AFAIK,
    no-one found any issue with it.

    [typo in "many" corrected below]
    Choosing a larger value could potentially break many programs that use LDBL_MANT_DIG. It would also defeat the purpose of minimum values for *_MANT_DIG required by the standard. For instance, the standard says
    that DBL_MANT_DIG must be at least 53, so that the implementer is not
    tempted to define a low-precision double type. But if it is allowed
    to choose arbitrary large values for DBL_MANT_DIG, not reflecting the
    real precision, what's the point?

    BTW, the current C2x draft N2596 says in 5.2.4.2.2p4:

    Floating types shall be able to represent zero (all f_k == 0) and
    all /normalized floating-point numbers/ (f_1 > 0 and all possible
    k digits and e exponents result in values representable in the
    type).

    (I suppose that "k" in "k digits" is a typo; it should be "p".)

    There are 2 changes compared to C17:
    * "normalized floating-point numbers" is now in italic, which
    makes it a definition as expected;
    * it is now explicit that *all* normalized floating-point numbers
    are required to be representable.

    With these 2 changes, it is now clear that LDBL_MANT_DIG cannot
    be larger than 107, as there are numbers with 108 digits that
    are not exactly representable (for any exponent). 106 is a valid
    value. Whether 107 is valid or not depends on some choices made
    for the exponent DBL_MAX_EXP: if the values with e = DBL_MAX_EXP
    are regarded as normalized floating-point numbers, then 107 is
    not possible.

    --
    Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
    100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
    Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)