// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);
Result (mint) is a negative number, something not right!!!
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);
Result (mint) is a negative number, something not right!!!
On Fri, 1 Oct 2021 08:37:21 -0700 (PDT)
wij <wyniijj@gmail.com> wrote:
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);
Result (mint) is a negative number, something not right!!!
It might help if you set iexp to some value.
long double frexpl(long double value, int *p);
The frexp functions break a floating-point number into a normalized fraction and an integer exponent. They store the integer in the int object pointed to by p .
Am 01.10.2021 um 17:37 schrieb wij:
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);
Result (mint) is a negative number, something not right!!!
Take this:
#include <iostream>
#include <cstdint>
#include <limits>
#include <utility>
#include <iomanip>
using namespace std;
using mantissa_pair = pair<bool, uint64_t>;
mantissa_pair getMantissa( double value );
int main()
{
    double v = numeric_limits<double>::min();
    do
    {
       mantissa_pair mp = getMantissa( v );
       cout << "value: " << v;
       if( mp.first )
           cout << " mantissa: " << hex << mp.second << endl;
       else
           cout << " invalid mantissa (Inf, S(NaN))" << endl;
    } while( (v *= 2.0) != numeric_limits<double>::infinity() );
}
static_assert(numeric_limits<double>::is_iec559, "must be standard fp");
mantissa_pair getMantissa( double value )
{
    union
    {
       uint64_t binary;
       double value;
    } u;
    u.value = value;
    unsigned exponent = (unsigned)(u.binary >> 52) & 0x7FF;
    if( exponent == 0 )
       return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu
| 0x10000000000000u );
    if( exponent == 0x7FF )
       return pair<bool, uint64_t>( false, 0 );
    return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu | 0x10000000000000u );
}
mantissa_pair getMantissa( double value )
{
     union
     {
        uint64_t binary;
        double value;
     } u;
     u.value = value;
     unsigned exponent = (unsigned)(u.binary >> 52) & 0x7FF;
     if( exponent == 0 )
        return pair<bool, uint64_t>( true, u.binary &
0xFFFFFFFFFFFFFu | 0x10000000000000u );
Should be:
         return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu );
     if( exponent == 0x7FF )
        return pair<bool, uint64_t>( false, 0 );
     return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu | >> 0x10000000000000u );
}
mantissa_pair getMantissa( double value )
{
    union
    {
       uint64_t binary;
       double value;
    } u;
    u.value = value;
    unsigned exponent = (unsigned)(u.binary >> 52) & 0x7FF;
    if( exponent == 0x7FF )
       return pair<bool, uint64_t>( false, 0 );
    uint64_t hiBit = (uint64_t)(exponent != 0) << 52;
    return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu | hiBit );
}
Am 01.10.2021 um 17:37 schrieb wij:
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);
Result (mint) is a negative number, something not right!!!
Take this:
#include <iostream>
#include <cstdint>
#include <limits>
#include <utility>
#include <iomanip>
using namespace std;
using mantissa_pair = pair<bool, uint64_t>;
mantissa_pair getMantissa( double value );
On 01/10/2021 18:18, Bonita Montero wrote:
Am 01.10.2021 um 17:37 schrieb wij:
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);
Result (mint) is a negative number, something not right!!!
Take this:
#include <iostream>
#include <cstdint>
#include <limits>
#include <utility>
#include <iomanip>
using namespace std;
using mantissa_pair = pair<bool, uint64_t>;
mantissa_pair getMantissa( double value );
What happened to *long* double?
// numeric_limits<long double>::digits=64.long double max = cumeric_limits<long double>::max();
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);
Result (mint) is a negative number, something not right!!!
long double max = cumeric_limits<long double>::max();"A prvalue of a floating-point type can be converted to a prvalue of an
long mantissa = max; // impicit conversion
On 10/1/21 12:09 PM, Radica...@theburrow.co.uk wrote:
On Fri, 1 Oct 2021 08:37:21 -0700 (PDT)I can't duplicate this problem: I get mint:9223372036854775807.
wij <wyn...@gmail.com> wrote:
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);
Result (mint) is a negative number, something not right!!!
It might help if you set iexp to some value.frexpl() is from the part of the C++ standard library that is copied
from the C standard library.
Section 7.12.6.7p1 of the C standard says:
long double frexpl(long double value, int *p);
The following paragraph says:
The frexp functions break a floating-point number into a normalized fraction and an integer exponent. They store the integer in the int object pointed to by p .
So iexp should be set. When I ran the code, it got set to a value of
16384. That's not the problem.
On 10/1/21 2:53 PM, Branimir Maksimovic wrote:It is not undefined as long is larger then mantissa part.
...
long double max = cumeric_limits<long double>::max();"A prvalue of a floating-point type can be converted to a prvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type." (7.3.10p1).
long mantissa = max; // impicit conversion
While it's not required to be the case, on most implementations std::numeric_limits<long double>::max() is WAY too large to be
represented by a long, so the behavior of such code is undefined. The
minimum value of LDBL_MAX (set by the C standard, inherited by the C++ standard) is 1e37, which would require long to have at least 123 bits in order for that conversion to have defined behavior.
And even when it has defined behavior, I can't imagine how you would
reach the conclusion that this conversion should be the value of the mantissa.
On 2021-10-02, James Kuyper <jameskuyper@alumni.caltech.edu> wrote:
On 10/1/21 2:53 PM, Branimir Maksimovic wrote:It is not undefined as long is larger then mantissa part.
...
long double max = cumeric_limits<long double>::max();"A prvalue of a floating-point type can be converted to a prvalue of an
long mantissa = max; // impicit conversion
integer type. The conversion truncates; that is, the fractional part is
discarded. The behavior is undefined if the truncated value cannot be
represented in the destination type." (7.3.10p1).
While it's not required to be the case, on most implementations
std::numeric_limits<long double>::max() is WAY too large to be
represented by a long, so the behavior of such code is undefined. The
minimum value of LDBL_MAX (set by the C standard, inherited by the C++
standard) is 1e37, which would require long to have at least 123 bits in
order for that conversion to have defined behavior.
And even when it has defined behavior, I can't imagine how you would
reach the conclusion that this conversion should be the value of the
mantissa.
ok correct is long long :P
...On 02.10.2021., at 07:27, James Kuyper <jameskuyper@alumni.caltech.edu> wrote:
On 10/1/21 10:16 PM, Branimir Maksimovic wrote:
The C++ standard cross-references the C standard for such purposes, andok correct is long long :PProblem is that neither long double nor long is defined how small
On my system, changing it to long long doesn't make any different - the
maximum value representable by long long is still 9223372036854775807,
the same as the maximum value for long; it's still far too small to
represent 1.18973e+4932, so the behavior of the conversion is undefined.
The actual behavior on my system appears to be saturating at LLONG_MAX
== 9223372036854775807. If I change the second line to
long long mantissa = 0.75*max;
0.75*max is 8.92299e+4931, which should certainly not have the same
mantissa as max itself, but the value loaded into "mantissa" is still
9223372036854775807.
or large can be…
so it can fit or not…
but question is how to extract mantissa which was answer :P
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);
Result (mint) is a negative number, something not right!!!
I accidentally sent this message first to Branimir by e-mail, and heBecause mantissa is whole part of floating point number?
No, that is not the answer. If max did have a value small enough to make
the conversion to long long have defined behavior, the result of that conversion would be the truncated value itself (7.3.10p1), NOT the
mantissa of the truncated value. What makes you think otherwise?
On Saturday, 2 October 2021 at 02:05:56 UTC+8, james...@alumni.caltech.edu wrote:<snip>
On 10/1/21 12:09 PM, Radica...@theburrow.co.uk wrote:
On Fri, 1 Oct 2021 08:37:21 -0700 (PDT)I can't duplicate this problem: I get mint:9223372036854775807.
wij <wyn...@gmail.com> wrote:
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);
Result (mint) is a negative number, something not right!!!
// ----- file t.cpp -----
#include <math.h>
#include <limits>
#include <iostream>
using namespace std;
#define ENDL endl
template<typename T>
int64_t get_mant(T x) {
int iexp;
x=frexp(x,&iexp);
x=ldexp(x,numeric_limits<T>::digits);
return static_cast<int64_t>(x);
};
int main()
{
cout << dec << get_mant(numeric_limits<float>::max()) << ", "
<< hex << get_mant(numeric_limits<float>::max()) << ENDL;
cout << dec << get_mant(numeric_limits<double>::max()) << ", "
<< hex << get_mant(numeric_limits<double>::max()) << ENDL;
cout << dec << get_mant(numeric_limits<long double>::max()) << ", "
<< hex << get_mant(numeric_limits<long double>::max()) << ENDL;
return 0;
};
// end file t.cpp -----
$ g++ t.cpp
]$ ./a.out
16777215, ffffff
9007199254740991, 1fffffffffffff
-9223372036854775808, 8000000000000000
On 10/2/2021 2:23 AM, wij wrote:
On Saturday, 2 October 2021 at 02:05:56 UTC+8,<snip>
james...@alumni.caltech.edu wrote:
On 10/1/21 12:09 PM, Radica...@theburrow.co.uk wrote:
On Fri, 1 Oct 2021 08:37:21 -0700 (PDT)I can't duplicate this problem: I get mint:9223372036854775807.
wij <wyn...@gmail.com> wrote:
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);
Result (mint) is a negative number, something not right!!!
// ----- file t.cpp -----
#include <math.h>
#include <limits>
#include <iostream>
using namespace std;
#define ENDL endl
template<typename T>
int64_t get_mant(T x) {
 int iexp;
 x=frexp(x,&iexp);
 x=ldexp(x,numeric_limits<T>::digits);
 return static_cast<int64_t>(x);
};
int main()
{
 cout << dec << get_mant(numeric_limits<float>::max()) << ", "
      << hex << get_mant(numeric_limits<float>::max()) << ENDL;
 cout << dec << get_mant(numeric_limits<double>::max()) << ", "
      << hex << get_mant(numeric_limits<double>::max()) << ENDL;
 cout << dec << get_mant(numeric_limits<long double>::max()) << ", "
      << hex << get_mant(numeric_limits<long double>::max()) << ENDL; >>  return 0;
};
// end file t.cpp -----
$ g++ t.cpp
]$ ./a.out
16777215, ffffff
9007199254740991, 1fffffffffffff
-9223372036854775808, 8000000000000000
There are two problems:
1)
The templated function uses 'frexp' and 'ldexp', which take both double arguments (not *long* double), hence UB occurs at those calls for the
'long double' type whenever this type is actually larger than 'double'.
2)
On my Linux box numeric_limits<*long* double>::digits is 64 (numeric_limits<double>::digits is 53), so the static_cast<int64_t>(x)
yields UB again.
=========
#include <cmath>
#include <limits>
#include <iostream>
using namespace std;
#define ENDL endl
uint64_t get_mantf(float x)
{
 int iexp;
 x=frexp(x,&iexp);
 x=ldexp(x,numeric_limits<float>::digits);
 return static_cast<uint64_t>(x);
};
uint64_t get_mant(double x)
{
 int iexp;
 x=frexp(x,&iexp);
 x=ldexp(x,numeric_limits<double>::digits);
 return static_cast<uint64_t>(x);
};
uint64_t get_mantl(long double x)
{
 int iexp;
 x=frexp(x,&iexp);
 x=ldexp(x,numeric_limits<long double>::digits);
 return static_cast<uint64_t>(x);
};
int main()
{
 cout << dec << numeric_limits<float>::digits << ", " << get_mantf(numeric_limits<float>::max()) << ", "
   << hex << get_mantf(numeric_limits<float>::max()) << ENDL;
 cout << dec << numeric_limits<double>::digits << ", " << get_mant(numeric_limits<double>::max()) << ", "
   << hex << get_mant(numeric_limits<double>::max()) << ENDL;
 cout << dec << numeric_limits<long double>::digits << ", " << get_mantl(numeric_limits<long double>::max()) << ", "
   << hex << get_mantl(numeric_limits<long double>::max()) << ENDL;
 return 0;
}
===================
$ c++ -std=c++11 -O2 -Wall mant.cc && ./a.out
24, 16777215, ffffff
53, 9007199254740991, 1fffffffffffff
64, 18446744073709551615, ffffffffffffffff
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);
Result (mint) is a negative number, something not right!!!
On 1 Oct 2021 17:37, wij wrote:
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);>>
Result (mint) is a negative number, something not right!!!
Apparently you're trying to obtain the bits of the mantissa of a `long double` number, represented as an `int64_t` value.
A `long double` is in practice either IEEE 754 64-bit, or IEEE 754
80-bit. In Windows that choice depends on the compiler. With Visual C++
(and hence probably also Intel) it's 64-bit, same as type `double`,
while with MinGW g++ (and hence probably also clang) it 80-bit,
originally the x86-family's math coprocessor's extended format. For
80-bit IEEE 754 the mantissa part is 64 bits.
With 64-bits mantissa there is a high chance of setting the sign bit of
an `int64_t` to 1, resulting in a negative value. I believe that that
will only /not/ happen for a denormal value, but, I'm tired and might be wrong about that. Anyway, instead use unsigned types for bit handling.
For example, in this case, use `uint64_t`.
However, instead of the shenanigans with `frexpl` and `ldexpl` I'd just
use `memcpy`.
On 02.10.2021., at 20:51, James Kuypermantissa.cpp:7:18: warning: magnitude of floating-point constant too
<jameskuyper@alumni.caltech.edu
<mailto:jameskuyper@alumni.caltech.edu>> wrote:
int main(void)
{
   typedef long double FType;
   FType max= std::numeric_limits<FType>::max();
   FType large = 0xA.BCDEF0123456789p+16380L;
   FType middling = 0xABCDEF01.23456789p0L;
   long long maxll = max;
   long long largell = large;
   long long middlingll = middling;
   std::cout << std::fixed << "max: " << max << std::endl;
   std::cout << "large: " << large << std::endl;
   std::cout << "middling: " << middling << std::endl;
   std::cout << "maxll: " << maxll << std::endl;
   std::cout << "largell: " << largell << std::endl;
   std::cout << "middlingll: " << middlingll << std::endl;
   std::cout << std::hexfloat << std::showbase << std::showpoint <<
       std::hex << "max: " << max << std::endl;
   std::cout << "large: " << large << std::endl;
   std::cout << "middling: " << middling << std::endl;
   std::cout << "maxll: " << maxll << std::endl;
   std::cout << "largell: " << largell << std::endl;
   std::cout << "middlingll: " << middlingll << std::endl;
}
large for type 'long double'; maximum is 1.7976931348623157E+308 [-Wliteral-range]
Sorry your program is not correct..
This is output on my system:97826204144723168738177180919299881250404026184124858368.000000
bmaxa@Branimirs-Air projects % ./a.out
max: 1797693134862315708145274237317043567980705675258449965989174768031572607800285387605895586327668781715404589535143824642343213268894641827684675467035375169860499105765512820762454900903893289440758685084551339423045832369032229481658085593321233482747
large: inf
middling: 2882400001.137778
maxll: 9223372036854775807
largell: 9223372036854775807
middlingll: 2882400001
max: 0x1.fffffffffffffp+1023
large: inf
middling: 0x1.579bde02468adp+31
maxll: 0x7fffffffffffffff
largell: 0x7fffffffffffffff
middlingll: 0xabcdef01
Greetings, Branimir.
On 10/2/21 3:00 PM, Alf P. Steinbach wrote:
On 1 Oct 2021 17:37, wij wrote:
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);>>
Result (mint) is a negative number, something not right!!!
Apparently you're trying to obtain the bits of the mantissa of a `long
double` number, represented as an `int64_t` value.
A `long double` is in practice either IEEE 754 64-bit, or IEEE 754
80-bit. In Windows that choice depends on the compiler. With Visual C++
(and hence probably also Intel) it's 64-bit, same as type `double`,
while with MinGW g++ (and hence probably also clang) it 80-bit,
originally the x86-family's math coprocessor's extended format. For
80-bit IEEE 754 the mantissa part is 64 bits.
With 64-bits mantissa there is a high chance of setting the sign bit of
an `int64_t` to 1, resulting in a negative value. I believe that that
will only /not/ happen for a denormal value, but, I'm tired and might be
wrong about that. Anyway, instead use unsigned types for bit handling.
For example, in this case, use `uint64_t`.
It would be safer and more portable to use FType; there's no portable guarantee that any integer type is large enough to hold the mantissa,
but FType is.
However, instead of the shenanigans with `frexpl` and `ldexpl` I'd just
use `memcpy`.
The advantage of the code as written is that (if you change mint to have FType) it will give the correct result even if your assumption about
IEEE 754 is false; that's not the case with memcpy().
On 2 Oct 2021 21:12, James Kuyper wrote:
On 10/2/21 3:00 PM, Alf P. Steinbach wrote:
On 1 Oct 2021 17:37, wij wrote:
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);>>
Result (mint) is a negative number, something not right!!!
Apparently you're trying to obtain the bits of the mantissa of a `long
double` number, represented as an `int64_t` value.
A `long double` is in practice either IEEE 754 64-bit, or IEEE 754
80-bit. In Windows that choice depends on the compiler. With Visual C++
(and hence probably also Intel) it's 64-bit, same as type `double`,
while with MinGW g++ (and hence probably also clang) it 80-bit,
originally the x86-family's math coprocessor's extended format. For
80-bit IEEE 754 the mantissa part is 64 bits.
With 64-bits mantissa there is a high chance of setting the sign bit of
an `int64_t` to 1, resulting in a negative value. I believe that that
will only /not/ happen for a denormal value, but, I'm tired and might be >>> wrong about that. Anyway, instead use unsigned types for bit handling. >>> For example, in this case, use `uint64_t`.
It would be safer and more portable to use FType; there's no portable
guarantee that any integer type is large enough to hold the mantissa,
but FType is.
I believe you intended to write `uintptr_t`, not `FType`.
If so, agreed.
It's late in the day for me, sorry.
However, instead of the shenanigans with `frexpl` and `ldexpl` I'd just
use `memcpy`.
The advantage of the code as written is that (if you change mint to have
FType) it will give the correct result even if your assumption about
IEEE 754 is false; that's not the case with memcpy().
Uhm, I'd rather assert IEEE 754 representation
(numeric_limits::is_iec559). Dealing with the bits of just about any representation seems to me a hopelessly daunting task. :-o
- Alf
On 2 Oct 2021 21:23, Alf P. Steinbach wrote:
On 2 Oct 2021 21:12, James Kuyper wrote:
On 10/2/21 3:00 PM, Alf P. Steinbach wrote:
On 1 Oct 2021 17:37, wij wrote:
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);>>
Result (mint) is a negative number, something not right!!!
Apparently you're trying to obtain the bits of the mantissa of a `long >>>> double` number, represented as an `int64_t` value.
A `long double` is in practice either IEEE 754 64-bit, or IEEE 754
80-bit. In Windows that choice depends on the compiler. With Visual C++ >>>> (and hence probably also Intel) it's 64-bit, same as type `double`,
while with MinGW g++ (and hence probably also clang) it 80-bit,
originally the x86-family's math coprocessor's extended format. For
80-bit IEEE 754 the mantissa part is 64 bits.
With 64-bits mantissa there is a high chance of setting the sign bit of >>>> an `int64_t` to 1, resulting in a negative value. I believe that that
will only /not/ happen for a denormal value, but, I'm tired and
might be
wrong about that. Anyway, instead use unsigned types for bit handling. >>>> For example, in this case, use `uint64_t`.
It would be safer and more portable to use FType; there's no portable
guarantee that any integer type is large enough to hold the mantissa,
but FType is.
I believe you intended to write `uintptr_t`, not `FType`.
If so, agreed.
It's late in the day for me, sorry.
It's /very/ late.
There AFAIK is no suitable type name for the integer type with
sufficient bits to represent the mantissa, or generally >N bits.
Unfortunately the standard library doesn't provide a mapping from number
of bits as a value, to integer type with that many bits. It can be done,
on the assumption that all types in `<stdint.h>` are present. And
perhaps one can then define a name like `FType` in terms of that mapping.
However, instead of the shenanigans with `frexpl` and `ldexpl` I'd just >>>> use `memcpy`.
The advantage of the code as written is that (if you change mint to have >>> FType) it will give the correct result even if your assumption about
IEEE 754 is false; that's not the case with memcpy().
Uhm, I'd rather assert IEEE 754 representation
(numeric_limits::is_iec559). Dealing with the bits of just about any
representation seems to me a hopelessly daunting task. :-o
- Alf
On 10/2/21 2:59 PM, Branimir Maksimovic wrote:
On 02.10.2021., at 20:51, James Kuypermantissa.cpp:7:18: warning: magnitude of floating-point constant too
<jameskuyper@alumni.caltech.edu
<mailto:jameskuyper@alumni.caltech.edu>> wrote:
int main(void)
{
   typedef long double FType;
   FType max= std::numeric_limits<FType>::max();
   FType large = 0xA.BCDEF0123456789p+16380L;
   FType middling = 0xABCDEF01.23456789p0L;
   long long maxll = max;
   long long largell = large;
   long long middlingll = middling;
   std::cout << std::fixed << "max: " << max << std::endl;
   std::cout << "large: " << large << std::endl;
   std::cout << "middling: " << middling << std::endl;
   std::cout << "maxll: " << maxll << std::endl;
   std::cout << "largell: " << largell << std::endl;
   std::cout << "middlingll: " << middlingll << std::endl;
   std::cout << std::hexfloat << std::showbase << std::showpoint <<
       std::hex << "max: " << max << std::endl;
   std::cout << "large: " << large << std::endl;
   std::cout << "middling: " << middling << std::endl;
   std::cout << "maxll: " << maxll << std::endl;
   std::cout << "largell: " << largell << std::endl;
   std::cout << "middlingll: " << middlingll << std::endl;
}
large for type 'long double'; maximum is 1.7976931348623157E+308
[-Wliteral-range]
Sorry your program is not correct..
The value of "large" was chosen to be almost as large as "max" on my
machine. It's apparently larger than LDBL_MAX on your system. A more
portable initializer would be
   FType large = 0x0.ABCDEF0123456789p0L * max;
Depending upon the value of LDBL_EPSILON on the target implementation,
that definition might result in the mantissa of "large" having fewer significant digits than it has on mine, but that's not particularly important.
On 10/2/21 3:00 PM, Alf P. Steinbach wrote:
On 1 Oct 2021 17:37, wij wrote:
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);>>
Result (mint) is a negative number, something not right!!!
Apparently you're trying to obtain the bits of the mantissa of a `long
double` number, represented as an `int64_t` value.
A `long double` is in practice either IEEE 754 64-bit, or IEEE 754
80-bit. In Windows that choice depends on the compiler. With Visual C++
(and hence probably also Intel) it's 64-bit, same as type `double`,
while with MinGW g++ (and hence probably also clang) it 80-bit,
originally the x86-family's math coprocessor's extended format. For
80-bit IEEE 754 the mantissa part is 64 bits.
With 64-bits mantissa there is a high chance of setting the sign bit of
an `int64_t` to 1, resulting in a negative value. I believe that that
will only /not/ happen for a denormal value, but, I'm tired and might be
wrong about that. Anyway, instead use unsigned types for bit handling.
For example, in this case, use `uint64_t`.
It would be safer and more portable to use FType; there's no portable guarantee that any integer type is large enough to hold the mantissa,
but FType is.
On 1 Oct 2021 17:37, wij wrote:
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);
Result (mint) is a negative number, something not right!!!Apparently you're trying to obtain the bits of the mantissa of a `long double` number, represented as an `int64_t` value.
A `long double` is in practice either IEEE 754 64-bit, or IEEE 754
80-bit. In Windows that choice depends on the compiler. With Visual C++
(and hence probably also Intel) it's 64-bit, same as type `double`,
while with MinGW g++ (and hence probably also clang) it 80-bit,
originally the x86-family's math coprocessor's extended format. For
80-bit IEEE 754 the mantissa part is 64 bits.
With 64-bits mantissa there is a high chance of setting the sign bit of
an `int64_t` to 1, resulting in a negative value. I believe that that
will only /not/ happen for a denormal value, but, I'm tired and might be wrong about that. Anyway, instead use unsigned types for bit handling.
For example, in this case, use `uint64_t`.
However, instead of the shenanigans with `frexpl` and `ldexpl` I'd just
use `memcpy`.
Due to the silliness in gcc regarding the standard's "strict aliasing"
rule, I'd not use a reinterpretation pointer cast.
- Alf
Jesus. And I think this still doesn't do an actual long double!
So, this should be the most elegant code:
#pragma once
#include <limits>
#include <cstdint>
#include <cassert>
struct dbl_parts
{
    static_assert(std::numeric_limits<double>::is_iec559, "must be standard fp");
    dbl_parts( double d );
    dbl_parts &operator =( double d );
    dbl_parts() = default;
    operator double();
    bool getSign();
    std::uint16_t getBiasedExponent();
    std::int16_t getExponent();
    std::uint64_t getMantissa();
    void setSign( bool sign );
    void setBiasedExponent( uint16_t exp );
    void setExponent( int16_t exp );
    void setMantissa( uint64_t mantissa );
private:
    static unsigned const
       MANTISSA_BITS = 52;
    using i64 = std::int64_t;
    using ui64 = std::uint64_t;
    static ui64 const
       SIGN_MASK = (ui64)1 << 63,
       EXP_MASK = (ui64)0x7FF << MANTISSA_BITS,
       MANTISSA_MASK = ~-((i64)1 << MANTISSA_BITS),
       MANTISSA_MAX = ((ui64)1 << MANTISSA_BITS) | MANTISSA_MASK;
    using ui16 = std::uint16_t;
    using i16 = std::int16_t;
    static ui16 const
       BEXP_DENORMAL = 0,
       BEXP_BASE = 0x3FF,
       BEXP_MAX = 0x7FF;
    static i16 const
       EXP_MIN = 0 - BEXP_BASE,
       EXP_MAX = BEXP_MAX - BEXP_BASE;
    union
    {
       double value;
       ui64 binary;
    };
};
inline
dbl_parts::dbl_parts( double d ) :
    value( d )
{
}
inline
dbl_parts &dbl_parts::operator =( double d )
{
    value = d;
    return *this;
}
inline
dbl_parts::operator double()
{
    return value;
}
inline
bool dbl_parts::getSign()
{
    return (i64)binary < 0;
}
inline
std::uint16_t dbl_parts::getBiasedExponent()
{
    return (ui16)(binary >> MANTISSA_BITS) & BEXP_MAX;
}
inline
int16_t dbl_parts::getExponent()
{
    return (i16)(getBiasedExponent() - BEXP_BASE);
}
inline
std::uint64_t dbl_parts::getMantissa()
{
    ui16 bExp = getBiasedExponent();
    ui64 hiBit = (ui64)(bExp && bExp != BEXP_MAX) << MANTISSA_BITS;
    return binary & MANTISSA_MASK | hiBit;
}
inline
void dbl_parts::setSign( bool sign )
{
    binary = binary & ~SIGN_MASK | (ui64)sign << 63;
}
inline
void dbl_parts::setBiasedExponent( std::uint16_t exp )
{
    assert(exp <= BEXP_MAX);
    binary = binary & (SIGN_MASK | MANTISSA_MASK) | (ui64)exp << MANTISSA_BITS;
}
inline
void dbl_parts::setExponent( std::int16_t exp )
{
    exp += BEXP_BASE;
    setBiasedExponent( exp );
}
inline
void dbl_parts::setMantissa( std::uint64_t mantissa )
{
#if !defined(NDEBUG)
    ui64 mantissaMax = MANTISSA_MASK | (ui64)(getBiasedExponent() != BEXP_DENORMAL && getBiasedExponent() != BEXP_MAX) << MANTISSA_BITS;
    assert(mantissa <= mantissaMax);
#endif
    binary = binary & (SIGN_MASK | EXP_MASK) | mantissa & MANTISSA_MASK; }
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);
Result (mint) is a negative number, something not right!!!
Am 03.10.2021 um 12:15 schrieb Bart:
Jesus. And I think this still doesn't do an actual long double!
long double isn't supported by many compilers for x86-64.
long double should be avoided when possible because loads
and stores are slow with long double.
I've not been following the details of the thread but it seems to have strayed from the original question (as happens of course).
wij <wyn...@gmail.com> writes:
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);
Result (mint) is a negative number, something not right!!!frexp (et. al.) return a signed result with a magnitude in the interval
[1/2, 1). If you want the 64 most significant digits of the mantissa of
a long double (it may have many more digits than that) then I'd go for something like this:
uint_least64_t ms64bits(long double d)
{
int exp;
return (uint_least64_t)std::scalbn(std::fabs(std::frexp(d, &exp)), 64);
}
If, as your code sketch suggests, you want all of them, then you are out
of luck as far as portable code goes because there may not be an integer
type wide enough.
But why do you want an integer value? For most mathematical uses the
result of std::frexp is what you really want.
--
Ben.
On 2021-10-02, James Kuyper <jameskuyper@alumni.caltech.edu> wrote:8274797826204144723168738177180919299881250404026184124858368.000000
On 10/2/21 2:59 PM, Branimir Maksimovic wrote:
On 02.10.2021., at 20:51, James Kuypermantissa.cpp:7:18: warning: magnitude of floating-point constant too
<jameskuyper@alumni.caltech.edu
<mailto:jameskuyper@alumni.caltech.edu>> wrote:
int main(void)
{
   typedef long double FType;
   FType max= std::numeric_limits<FType>::max();
   FType large = 0xA.BCDEF0123456789p+16380L;
   FType middling = 0xABCDEF01.23456789p0L;
   long long maxll = max;
   long long largell = large;
   long long middlingll = middling;
   std::cout << std::fixed << "max: " << max << std::endl;
   std::cout << "large: " << large << std::endl;
   std::cout << "middling: " << middling << std::endl;
   std::cout << "maxll: " << maxll << std::endl;
   std::cout << "largell: " << largell << std::endl;
   std::cout << "middlingll: " << middlingll << std::endl;
   std::cout << std::hexfloat << std::showbase << std::showpoint << >>>>        std::hex << "max: " << max << std::endl;
   std::cout << "large: " << large << std::endl;
   std::cout << "middling: " << middling << std::endl;
   std::cout << "maxll: " << maxll << std::endl;
   std::cout << "largell: " << largell << std::endl;
   std::cout << "middlingll: " << middlingll << std::endl;
}
large for type 'long double'; maximum is 1.7976931348623157E+308
[-Wliteral-range]
Sorry your program is not correct..
The value of "large" was chosen to be almost as large as "max" on my
machine. It's apparently larger than LDBL_MAX on your system. A more
portable initializer would be
   FType large = 0x0.ABCDEF0123456789p0L * max;
Depending upon the value of LDBL_EPSILON on the target implementation,
that definition might result in the mantissa of "large" having fewer
significant digits than it has on mine, but that's not particularly
important.
Now outputs:
bmaxa@Branimirs-Air projects % ./a.out
max: 17976931348623157081452742373170435679807056752584499659891747680315726078002853876058955863276687817154045895351438246423432132688946418276846754670353751698604991057655128207624549009038932894407586850845513394230458323690322294816580855933212334
large: 120645172288001372126619919515947355653853135095820093742269387851763173210613194516902557021349374744036403756647462602101944733127725809392434319981526284828181273773403605513307431121904981243948320057805774787281290926848565000291282225047507294705767371139696722108291516984908752371556782562287095382016.000000
middling: 2882400001.137778
maxll: 9223372036854775807
largell: 9223372036854775807
middlingll: 2882400001
max: 0x1.fffffffffffffp+1023
large: 0x1.579bde02468acp+1023
middling: 0x1.579bde02468adp+31
maxll: 0x7fffffffffffffff
largell: 0x7fffffffffffffff
middlingll: 0xabcdef01
On 2021-10-02, James Kuyper <jameskuyper@alumni.caltech.edu> wrote:
On 10/2/21 3:00 PM, Alf P. Steinbach wrote:
On 1 Oct 2021 17:37, wij wrote:
// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);>>
Result (mint) is a negative number, something not right!!!
Apparently you're trying to obtain the bits of the mantissa of a `long
double` number, represented as an `int64_t` value.
A `long double` is in practice either IEEE 754 64-bit, or IEEE 754
80-bit. In Windows that choice depends on the compiler. With Visual C++
(and hence probably also Intel) it's 64-bit, same as type `double`,
while with MinGW g++ (and hence probably also clang) it 80-bit,
originally the x86-family's math coprocessor's extended format. For
80-bit IEEE 754 the mantissa part is 64 bits.
With 64-bits mantissa there is a high chance of setting the sign bit of
an `int64_t` to 1, resulting in a negative value. I believe that that
will only /not/ happen for a denormal value, but, I'm tired and might be >>> wrong about that. Anyway, instead use unsigned types for bit handling.
For example, in this case, use `uint64_t`.
It would be safer and more portable to use FType; there's no portable
guarantee that any integer type is large enough to hold the mantissa,
but FType is.
How so if FType is just typedef?
b - base or radix of exponent representation (an integer > 1)
e - exponent (an integer between a minimum e_min and a maximum e_max )
p - precision (the number of base-b digits in the significand)
On 2 Oct 2021 21:12, James Kuyper wrote:...
It would be safer and more portable to use FType; there's no portable
guarantee that any integer type is large enough to hold the mantissa,
but FType is.
I believe you intended to write `uintptr_t`, not `FType`.
However, instead of the shenanigans with `frexpl` and `ldexpl` I'd just
use `memcpy`.
The advantage of the code as written is that (if you change mint to have
FType) it will give the correct result even if your assumption about
IEEE 754 is false; that's not the case with memcpy().
Uhm, I'd rather assert IEEE 754 representation
(numeric_limits::is_iec559). Dealing with the bits of just about any representation seems to me a hopelessly daunting task. :-o
Do you now concede that your approach is not guaranteed to result in aof course, you convinced me.
long long value containing the mantissa?
Note that b, e_min, e_max, and p are constants for any specific floating point type.Great!!!
In terms of that model, the value of the significand of a floating point value x, interpreted as an integer, can be represented in the same
floating point type by a number with exactly the same significand, and e
= p.
The key issue is whether such a representation is allowed, and it turns
out that there can be floating point representations which fit the C standard's model, for which e_max < p, preventing some signficands from
being representable in such a type.
The macro LDBL_MAX (corresponding to std::numeric_limits<long
double>::max()) is defined as expanding to the value of
(1 - b^(-p))*b^e_max, and is required to be at least 1e37. If, for
example, e_max == p-1 and b=2, then this means that for such a type, p
must be at least 124.
Every floating point format I'm familiar with has an e_max value much
larger than p, so I think, as a practical matter, that it's safe to
assume that signficands can be represented in the same floating point
type, but strictly speaking, it's not guaranteed.
Am 03.10.2021 um 12:15 schrieb Bart:
Jesus. And I think this still doesn't do an actual long double!
long double isn't supported by many compilers for x86-64.
long double should be avoided when possible because loads
and stores are slow with long double.
What you probably mean is that on some targets, "long double" is the
same size as "double". That's true for most 32-bit (and smaller)
targets. Most 64-bit targets support larger "long double".
"long double" is supported by any C++ or C compiler, for any target,
that tries to come close to any current standards compliance.
What you probably mean is that on some targets, "long double" is the
same size as "double". That's true for most 32-bit (and smaller)
targets. Most 64-bit targets support larger "long double".
On 03/10/2021 12:19, Bonita Montero wrote:
Am 03.10.2021 um 12:15 schrieb Bart:
Jesus. And I think this still doesn't do an actual long double!
long double isn't supported by many compilers for x86-64.
long double should be avoided when possible because loads
and stores are slow with long double.
"long double" is supported by any C++ or C compiler, for any target,
that tries to come close to any current standards compliance.
What you probably mean is that on some targets, "long double" is the
same size as "double". That's true for most 32-bit (and smaller)
targets. Most 64-bit targets support larger "long double".
As happens so often, there is /one/ major exception to the common
practices used by (AFAICS) every other OS, every other processor, every
other compiler manufacturer - on Windows, and with MSVC, "long double"
is 64-bit.
  DMC          10
  lccwin       16
On 03/10/2021 12:19, Bonita Montero wrote:
"long double" should not be used where "double" will do, because it can
be a great deal slower on many platforms (the load and saves are
irrelevant). You also have to question whether "long double" does what
you want, on any given target. On smaller targets (or more limited >compilers, like MSVC), it gives you no benefits in accuracy or range
compared to "double". On some, such as x86-64 with decent tools, it >generally gives you 80-bit types. On others - almost any other 64-bit
system - it gives you 128-bit quad double, but it is likely to be
implemented in software rather than hardware.
ARMv8 has 128-bit floating point, in hardware.
FWIW, it also has 512 to 2048 bit floating point,
in hardware, when the SVE extension is implemented.
On 2021-10-04, James Kuyper <jameskuyper@alumni.caltech.edu> wrote:...
In terms of that model, the value of the significand of a floating pointGreat!!!
value x, interpreted as an integer, can be represented in the same
floating point type by a number with exactly the same significand, and e
= p.
The key issue is whether such a representation is allowed, and it turns
out that there can be floating point representations which fit the C
standard's model, for which e_max < p, preventing some signficands from
being representable in such a type.
The macro LDBL_MAX (corresponding to std::numeric_limits<long
double>::max()) is defined as expanding to the value of
(1 - b^(-p))*b^e_max, and is required to be at least 1e37. If, for
example, e_max == p-1 and b=2, then this means that for such a type, p
must be at least 124.
Every floating point format I'm familiar with has an e_max value much
larger than p, so I think, as a practical matter, that it's safe to
assume that signficands can be represented in the same floating point
type, but strictly speaking, it's not guaranteed.
So intead of int we use float type and truncate?
Am 04.10.2021 um 15:06 schrieb Bart:
  DMC          10
  lccwin       16
Irrelevant.Irrelevant to whom?
You're half right. If you want the mantissa (== significand), you shouldThanks for explanaition.
not truncate - that might lose you the parts of the significand that represent the fractional part of the number. The OP had it right at the second to last step in his original code:
x = std::ldexp(x, numeric_limits<FType>::digits);
At this point, x already contains the significand; there's no further
need to convert it to an integer type - in fact, in most contexts you'd
want it in floating point format for later steps in the processing.
David Brown <david.brown@hesbynett.no> writes:
On 03/10/2021 12:19, Bonita Montero wrote:
"long double" should not be used where "double" will do, because it can
be a great deal slower on many platforms (the load and saves are >>irrelevant). You also have to question whether "long double" does what
you want, on any given target. On smaller targets (or more limited >>compilers, like MSVC), it gives you no benefits in accuracy or range >>compared to "double". On some, such as x86-64 with decent tools, it >>generally gives you 80-bit types. On others - almost any other 64-bit >>system - it gives you 128-bit quad double, but it is likely to be >>implemented in software rather than hardware.
ARMv8 has 128-bit floating point, in hardware.^ registers
Any code running on x86 can choose to use x87 instructions for floating point.
Am 04.10.2021 um 09:52 schrieb David Brown:
"long double" is supported by any C++ or C compiler, for any target,
that tries to come close to any current standards compliance.
Most compilers map long double to IEEE-754 double precision and not
extended precision. Even with Intel C++ you must supply a compiler
-switch that you have double as a extended precision.
What you probably mean is that on some targets, "long double" is the
same size as "double". That's true for most 32-bit (and smaller)
targets. Most 64-bit targets support larger "long double".
No, most don't.
scott@slp53.sl.home (Scott Lurndal) writes:
David Brown <david.brown@hesbynett.no> writes:^ registers
On 03/10/2021 12:19, Bonita Montero wrote:
"long double" should not be used where "double" will do, because it can
be a great deal slower on many platforms (the load and saves are
irrelevant). You also have to question whether "long double" does what
you want, on any given target. On smaller targets (or more limited
compilers, like MSVC), it gives you no benefits in accuracy or range
compared to "double". On some, such as x86-64 with decent tools, it
generally gives you 80-bit types. On others - almost any other 64-bit
system - it gives you 128-bit quad double, but it is likely to be
implemented in software rather than hardware.
ARMv8 has 128-bit floating point, in hardware.
It does not support 128-bit float point types.
MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other 64-bit
ARM targets have 128-bit (I don't know off-hand if they are IEEE quad precision). For RISC-V, even 32-bit targets have 128-bit "long double".
Here's quick survey of Windows C compilers:
Compiler sizeof(long double)
MSCV 8 bytes
gcc 16
clang 8
DMC 10
lccwin 16
tcc 8
bcc 8
On 2021-10-04, Bart <bc@freeuk.com> wrote:
Any code running on x86 can choose to use x87 instructions for floating
point.
Only in assembler...
Am 04.10.2021 um 20:04 schrieb David Brown:
MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other 64-bit
ARM targets have 128-bit (I don't know off-hand if they are IEEE quad
precision). For RISC-V, even 32-bit targets have 128-bit "long double".
There isn't any ARM-implementation with 128 bit FP.
Bart <bc@freeuk.com> wrote:
Here's quick survey of Windows C compilers:
Compiler sizeof(long double)
MSCV 8 bytes
gcc 16
clang 8
DMC 10
lccwin 16
tcc 8
bcc 8
Note that sizeof() doesn't tell how large the floating point number is,
only how much storage space the compiler is reserving for it. Some
compilers may well over-reserve space for 80-bit floating point, for alignment reasons (the extra bytes will be unused).
sizeof() is telling only if it gives less than 10 for long double.
Am 04.10.2021 um 20:04 schrieb David Brown:
MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other 64-bit
ARM targets have 128-bit (I don't know off-hand if they are IEEE quad
precision). For RISC-V, even 32-bit targets have 128-bit "long double".
There isn't any ARM-implementation with 128 bit FP.
Bart <bc@freeuk.com> wrote:
Here's quick survey of Windows C compilers:
Compiler sizeof(long double)
MSCV 8 bytes
gcc 16
clang 8
DMC 10
lccwin 16
tcc 8
bcc 8
Note that sizeof() doesn't tell how large the floating point number is,
only how much storage space the compiler is reserving for it. Some
compilers may well over-reserve space for 80-bit floating point, for alignment reasons (the extra bytes will be unused).
8 then the bits stored inside the long double are 80.
There isn't any ARM-implementation with 128 bit FP.
And you know that because ... what? Because you are confusing hardware floating point with floating point in general?
There are very few /hardware/ implementations of quad precision floating point - I think perhaps Power is the only architecture that actually has
it in practice. (Some architectures, such as SPARC and RISC-V, have
defined them in the architecture but have no physical devices supporting them.)
On 05/10/2021 07:12, Bonita Montero wrote:
Am 04.10.2021 um 20:04 schrieb David Brown:
MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other 64-bitThere isn't any ARM-implementation with 128 bit FP.
ARM targets have 128-bit (I don't know off-hand if they are IEEE quad
precision). For RISC-V, even 32-bit targets have 128-bit "long double". >>
And you know that because ... what? Because you are confusing hardware floating point with floating point in general?
On 05/10/2021 16:46, Bonita Montero wrote:
Am 05.10.2021 um 09:47 schrieb David Brown:
There isn't any ARM-implementation with 128 bit FP.
And you know that because ... what? Because you are confusing hardware >>> floating point with floating point in general?
There isn't even an ISA-spec. Its just an ABI.
And ... so what? For all your effort trying to move the goalposts,
you are still wrong - "long double" is supported, ...
David Brown <david.brown@hesbynett.no> writes:
On 05/10/2021 07:12, Bonita Montero wrote:
Am 04.10.2021 um 20:04 schrieb David Brown:
MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other 64-bit >>>> ARM targets have 128-bit (I don't know off-hand if they are IEEE quadThere isn't any ARM-implementation with 128 bit FP.
precision). For RISC-V, even 32-bit targets have 128-bit "long double". >>>
And you know that because ... what? Because you are confusing hardware
floating point with floating point in general?
$ uname -m
aarch64
$ cat c.c
#include <stdio.h>
#include <limits.h>
#include <float.h>
int main(void) {
printf("sizeof (long double) = %zu (%zu bits), LDBL_DIG = %d\n",
sizeof (long double),
CHAR_BIT * sizeof (long double),
LDBL_DIG);
}
$ gcc c.c -o c && ./c
sizeof (long double) = 16 (128 bits), LDBL_DIG = 33
$
Am 05.10.2021 um 09:47 schrieb David Brown:
There isn't any ARM-implementation with 128 bit FP.
And you know that because ... what? Because you are confusing hardware
floating point with floating point in general?
There isn't even an ISA-spec. Its just an ABI.
There are very few /hardware/ implementations of quad precision floating
point - I think perhaps Power is the only architecture that actually has
it in practice. (Some architectures, such as SPARC and RISC-V, have
defined them in the architecture but have no physical devices supporting
them.)
https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format
Am 05.10.2021 um 18:04 schrieb Keith Thompson:
David Brown <david.brown@hesbynett.no> writes:
On 05/10/2021 07:12, Bonita Montero wrote:
Am 04.10.2021 um 20:04 schrieb David Brown:
MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other
64-bit
ARM targets have 128-bit (I don't know off-hand if they are IEEE quad >>>>> precision). For RISC-V, even 32-bit targets have 128-bit "long
double".
There isn't any ARM-implementation with 128 bit FP.
And you know that because ... what? Because you are confusing hardware >>> floating point with floating point in general?
    $ uname -m
    aarch64
    $ cat c.c
    #include <stdio.h>
    #include <limits.h>
    #include <float.h>
    int main(void) {
        printf("sizeof (long double) = %zu (%zu bits), LDBL_DIG = %d\n",
               sizeof (long double),
               CHAR_BIT * sizeof (long double),
               LDBL_DIG);
    }
    $ gcc c.c -o c && ./c
    sizeof (long double) = 16 (128 bits), LDBL_DIG = 33
    $
That's pure software - your CPU doesn't natively support 128 bit FP.
Am 05.10.2021 um 18:04 schrieb Keith Thompson:
David Brown <david.brown@hesbynett.no> writes:
On 05/10/2021 07:12, Bonita Montero wrote:
Am 04.10.2021 um 20:04 schrieb David Brown:
MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other 64-bit >>>>> ARM targets have 128-bit (I don't know off-hand if they are IEEE quad >>>>> precision). For RISC-V, even 32-bit targets have 128-bit "long double". >>>>There isn't any ARM-implementation with 128 bit FP.
And you know that because ... what? Because you are confusing hardware
floating point with floating point in general?
$ uname -m
aarch64
$ cat c.c
#include <stdio.h>
#include <limits.h>
#include <float.h>
int main(void) {
printf("sizeof (long double) = %zu (%zu bits), LDBL_DIG = %d\n",
sizeof (long double),
CHAR_BIT * sizeof (long double),
LDBL_DIG);
}
$ gcc c.c -o c && ./c
sizeof (long double) = 16 (128 bits), LDBL_DIG = 33
$
That's pure software - your CPU doesn't natively support 128 bit FP.
David Brown <david.brown@hesbynett.no> writes:
On 05/10/2021 07:12, Bonita Montero wrote:
Am 04.10.2021 um 20:04 schrieb David Brown:
MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other 64-bit >>>> ARM targets have 128-bit (I don't know off-hand if they are IEEE quadThere isn't any ARM-implementation with 128 bit FP.
precision). For RISC-V, even 32-bit targets have 128-bit "long double". >>>
And you know that because ... what? Because you are confusing hardware
floating point with floating point in general?
$ uname -m
aarch64
$ cat c.c
#include <stdio.h>
#include <limits.h>
#include <float.h>
int main(void) {
printf("sizeof (long double) = %zu (%zu bits), LDBL_DIG = %d\n",
sizeof (long double),
CHAR_BIT * sizeof (long double),
LDBL_DIG);
}
$ gcc c.c -o c && ./c
sizeof (long double) = 16 (128 bits), LDBL_DIG = 33
$
Am 05.10.2021 um 20:21 schrieb Bart:
On 05/10/2021 18:14, Bonita Montero wrote:
That's pure software - your CPU doesn't natively support 128 bit FP.
Does it matter?
Yes, It's rare that you need 128 bit FP and it's even more rare that
the speed doesn't matter.
On 05/10/2021 18:14, Bonita Montero wrote:
Am 05.10.2021 um 18:04 schrieb Keith Thompson:
David Brown <david.brown@hesbynett.no> writes:
On 05/10/2021 07:12, Bonita Montero wrote:
Am 04.10.2021 um 20:04 schrieb David Brown:
MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other
64-bit
ARM targets have 128-bit (I don't know off-hand if they are IEEE quad >>>>>> precision). For RISC-V, even 32-bit targets have 128-bit "long
double".
There isn't any ARM-implementation with 128 bit FP.
And you know that because ... what? Because you are confusing hardware >>>> floating point with floating point in general?
    $ uname -m
    aarch64
    $ cat c.c
    #include <stdio.h>
    #include <limits.h>
    #include <float.h>
    int main(void) {
        printf("sizeof (long double) = %zu (%zu bits), LDBL_DIG = >>> %d\n",
               sizeof (long double),
               CHAR_BIT * sizeof (long double),
               LDBL_DIG);
    }
    $ gcc c.c -o c && ./c
    sizeof (long double) = 16 (128 bits), LDBL_DIG = 33
    $
That's pure software - your CPU doesn't natively support 128 bit FP.
Does it matter?
Until a couple of days ago, you didn't think 128-bit floating point
existed. ...
Am 05.10.2021 um 20:21 schrieb Bart:
On 05/10/2021 18:14, Bonita Montero wrote:
Am 05.10.2021 um 18:04 schrieb Keith Thompson:
David Brown <david.brown@hesbynett.no> writes:
On 05/10/2021 07:12, Bonita Montero wrote:
Am 04.10.2021 um 20:04 schrieb David Brown:
MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other >>>>>>> 64-bit
ARM targets have 128-bit (I don't know off-hand if they are IEEE >>>>>>> quad
precision). For RISC-V, even 32-bit targets have 128-bit "long >>>>>>> double".
There isn't any ARM-implementation with 128 bit FP.
And you know that because ... what? Because you are confusing
hardware
floating point with floating point in general?
    $ uname -m
    aarch64
    $ cat c.c
    #include <stdio.h>
    #include <limits.h>
    #include <float.h>
    int main(void) {
        printf("sizeof (long double) = %zu (%zu bits), LDBL_DIG = >>>> %d\n",
               sizeof (long double),
               CHAR_BIT * sizeof (long double),
               LDBL_DIG);
    }
    $ gcc c.c -o c && ./c
    sizeof (long double) = 16 (128 bits), LDBL_DIG = 33
    $
That's pure software - your CPU doesn't natively support 128 bit FP.
Does it matter?
Yes, It's rare that you need 128 bit FP and it's even more rare that
the speed doesn't matter.
On 06/10/2021 06:58, Bonita Montero wrote:
Am 05.10.2021 um 20:21 schrieb Bart:
On 05/10/2021 18:14, Bonita Montero wrote:
Am 05.10.2021 um 18:04 schrieb Keith Thompson:
David Brown <david.brown@hesbynett.no> writes:
On 05/10/2021 07:12, Bonita Montero wrote:
Am 04.10.2021 um 20:04 schrieb David Brown:
MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other >>>>>>>> 64-bit
ARM targets have 128-bit (I don't know off-hand if they are IEEE >>>>>>>> quad
precision). For RISC-V, even 32-bit targets have 128-bit "long >>>>>>>> double".
There isn't any ARM-implementation with 128 bit FP.
And you know that because ... what? Because you are confusing
hardware
floating point with floating point in general?
    $ uname -m
    aarch64
    $ cat c.c
    #include <stdio.h>
    #include <limits.h>
    #include <float.h>
    int main(void) {
        printf("sizeof (long double) = %zu (%zu bits), LDBL_DIG =
%d\n",
               sizeof (long double),
               CHAR_BIT * sizeof (long double),
               LDBL_DIG);
    }
    $ gcc c.c -o c && ./c
    sizeof (long double) = 16 (128 bits), LDBL_DIG = 33
    $
That's pure software - your CPU doesn't natively support 128 bit FP.
Does it matter?
Yes, It's rare that you need 128 bit FP and it's even more rare that
the speed doesn't matter.
You keep avoiding my point
Suppose you HAVE 80-bit or 128-FP, and need to extract the mantissa, how would you do it?
It seems the most popular new floating point format today is 16-bit
floating point (bfloat16 in ARM) - important in machine learning applications.
On 06/10/2021 07:58, Bonita Montero wrote:
Am 05.10.2021 um 20:21 schrieb Bart:
On 05/10/2021 18:14, Bonita Montero wrote:
That's pure software - your CPU doesn't natively support 128 bit FP.
Does it matter?
Yes, It's rare that you need 128 bit FP and it's even more rare that
the speed doesn't matter.
Until a couple of days ago, you didn't think 128-bit floating point
existed. Now you think you know how it is used and what is important to >users?
union
{
     long double value;
     struct
     {
        uint64_t mantissa;
        uint16_t exponent : 15,
                 sign : 1;
     };
}
Also, I believe bitfield allocation is implementation defined.
The way than I did is at least portable among x86-compilers.
On platforms where you have a differnt byte-order this won't
work, but these platforms don't have a 80 bit FP type anyway.
As the 1 high-bit is alwas included in the 64 bit mantissa of a 80 bit FP-value that's much easier. In contrast to smaller FP-values where the
one bit implicity except from denormal values or values with the maximum exponent.
union
{
    long double value;
    struct
    {
       uint64_t mantissa;
       uint16_t exponent : 15,
                sign : 1;
    };
}
Just store your value into value and extract the mantissa from mantissa.
On 10/6/2021 5:48 AM, Bonita Montero wrote:
As the 1 high-bit is alwas included in the 64 bit mantissa of a 80 bit
FP-value that's much easier. In contrast to smaller FP-values where the
one bit implicity except from denormal values or values with the maximum
exponent.
union
{
     long double value;
     struct
     {
        uint64_t mantissa;
        uint16_t exponent : 15,
                 sign : 1;
     };
}
Just store your value into value and extract the mantissa from mantissa.
Does anyone know if type punning through a union is still undefined
behavior?
Also, I believe bitfield allocation is implementation defined.
On 06/10/2021 06:58, Bonita Montero wrote:
Am 05.10.2021 um 20:21 schrieb Bart:
On 05/10/2021 18:14, Bonita Montero wrote:
Am 05.10.2021 um 18:04 schrieb Keith Thompson:
David Brown <david.brown@hesbynett.no> writes:
On 05/10/2021 07:12, Bonita Montero wrote:
Am 04.10.2021 um 20:04 schrieb David Brown:
MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other >>>>>>>> 64-bit
ARM targets have 128-bit (I don't know off-hand if they are IEEE >>>>>>>> quad
precision). For RISC-V, even 32-bit targets have 128-bit "long >>>>>>>> double".
There isn't any ARM-implementation with 128 bit FP.
And you know that because ... what? Because you are confusing
hardware
floating point with floating point in general?
    $ uname -m
    aarch64
    $ cat c.c
    #include <stdio.h>
    #include <limits.h>
    #include <float.h>
    int main(void) {
        printf("sizeof (long double) = %zu (%zu bits), LDBL_DIG =
%d\n",
               sizeof (long double),
               CHAR_BIT * sizeof (long double),
               LDBL_DIG);
    }
    $ gcc c.c -o c && ./c
    sizeof (long double) = 16 (128 bits), LDBL_DIG = 33
    $
That's pure software - your CPU doesn't natively support 128 bit FP.
Does it matter?
Yes, It's rare that you need 128 bit FP and it's even more rare that
the speed doesn't matter.
You keep avoiding my point
Suppose you HAVE 80-bit or 128-FP, and need to extract the mantissa, how would you do it?
It's like someone asking how to work out the 5th root of a number, and
you give a method for the square root. Then when pressed, you say that
is is rare to need a 5th root!
union
{
long double value;
struct
{
uint64_t mantissa;
uint16_t exponent : 15,
sign : 1;
};
}
Just store your value into value and extract the mantissa from mantissa.
Does anyone know if type punning through a union is still undefined
behavior?
Also, I believe bitfield allocation is implementation defined.
red floyd <no.spam.here@its.invalid> wrote:
union
{
long double value;
struct
{
uint64_t mantissa;
uint16_t exponent : 15,
sign : 1;
};
}
Just store your value into value and extract the mantissa from mantissa.
Does anyone know if type punning through a union is still undefined
behavior?
Also, I believe bitfield allocation is implementation defined.
Also, the above union assumes that the long double value is stored in the same byte order as the members of the struct. It also assumes that the long double occupies exactly 10 bytes and isn't padded in the wrong end.
And, as you say, it's not defined whether that sign bit will be the lowest
or highest bit of that 16-bit bitfield. (I don't think the standard even guarantees that the 1-bit will be stored in the same bytes as the 15-bits.)
You could make some compile-time checks to see that the union will have
the proper internal structure (and refuse to compile otherwise), although IIRC there's no way to check endianess at compile time (unless C++20
added some new features to do that).
On 10/6/2021 5:48 AM, Bonita Montero wrote:
As the 1 high-bit is alwas included in the 64 bit mantissa of a 80 bit
FP-value that's much easier. In contrast to smaller FP-values where the
one bit implicity except from denormal values or values with the maximum
exponent.
union
{
     long double value;
     struct
     {
        uint64_t mantissa;
        uint16_t exponent : 15,
                 sign : 1;
     };
}
Just store your value into value and extract the mantissa from mantissa.
Does anyone know if type punning through a union is still undefined
behavior?
Also, I believe bitfield allocation is implementation defined.
red floyd <no.spam.here@its.invalid> wrote:
union
{
long double value;
struct
{
uint64_t mantissa;
uint16_t exponent : 15,
sign : 1;
};
}
Just store your value into value and extract the mantissa from mantissa.
Does anyone know if type punning through a union is still undefined
behavior?
Also, I believe bitfield allocation is implementation defined.
Also, the above union assumes that the long double value is stored in the same byte order as the members of the struct. It also assumes that the long double occupies exactly 10 bytes and isn't padded in the wrong end.
Am 07.10.2021 um 08:20 schrieb David Brown:
C++20 added std::endian for convenience and standardisation of common
compiler features for endianness:
Why an endianess check here ?
There are no big-endian machines with 80 bit FP.
Am 07.10.2021 um 08:38 schrieb Bonita Montero:
Am 07.10.2021 um 08:20 schrieb David Brown:
C++20 added std::endian for convenience and standardisation of common
compiler features for endianness:
Why an endianess check here ?
There are no big-endian machines with 80 bit FP.
Oh, I just remembered the 68K FPUs. These have 80 bit FP.
But 68K is dead today.
Am 07.10.2021 um 07:35 schrieb Juha Nieminen:
red floyd <no.spam.here@its.invalid> wrote:
unionDoes anyone know if type punning through a union is still undefined
{
long double value;
struct
{
uint64_t mantissa;
uint16_t exponent : 15,
sign : 1;
};
}
Just store your value into value and extract the mantissa from mantissa. >>>
behavior?
Also, I believe bitfield allocation is implementation defined.
Also, the above union assumes that the long double value is stored in the
same byte order as the members of the struct. It also assumes that the long >> double occupies exactly 10 bytes and isn't padded in the wrong end.
It actually fits for all x86-compilers.
(And if anyone tells you type-punning unions "work" in C++, then ask why std::bit_cast<> was added to the language.)
Am 07.10.2021 um 07:35 schrieb Juha Nieminen:
red floyd <no.spam.here@its.invalid> wrote:
union
{
long double value;
struct
{
uint64_t mantissa;
uint16_t exponent : 15,
sign : 1;
};
}
Just store your value into value and extract the mantissa from
mantissa.
Does anyone know if type punning through a union is still undefined
behavior?
Also, I believe bitfield allocation is implementation defined.
Also, the above union assumes that the long double value is stored in the
same byte order as the members of the struct. It also assumes that the
long
double occupies exactly 10 bytes and isn't padded in the wrong end.
It actually fits for all x86-compilers.
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 07.10.2021 um 07:35 schrieb Juha Nieminen:
red floyd <no.spam.here@its.invalid> wrote:
unionDoes anyone know if type punning through a union is still undefined
{
long double value;
struct
{
uint64_t mantissa;
uint16_t exponent : 15,
sign : 1;
};
}
Just store your value into value and extract the mantissa from mantissa. >>>>
behavior?
Also, I believe bitfield allocation is implementation defined.
Also, the above union assumes that the long double value is stored in the >>> same byte order as the members of the struct. It also assumes that the long >>> double occupies exactly 10 bytes and isn't padded in the wrong end.
It actually fits for all x86-compilers.
ARM is becoming more and more common.
David Brown <david.brown@hesbynett.no> wrote:
(And if anyone tells you type-punning unions "work" in C++, then ask why
std::bit_cast<> was added to the language.)
Why hasn't type-punning been standardized in C++, given that it's standardized in C?
A few compilers, such as gcc, actually documented that type-punning
unions worked - most did not, though people expected them to support
them. I've never been clear as to whether the C standards committee
changed their mind to make type-punning fully defined, or if they simply >changed the wording of the standard to make it less ambiguous.
The 68k ISA lives on in the microcontroller world, as the ColdFire processors, though I don't believe any new ColdFire microcontrollers
have been released since NXP bought Freescale in 2015. Even at that
stage it was no longer a major development line for Freescale. A
decade or so before that, ColdFire was one of the most popular
architectures in small network devices such as SOHO NAT routers, ...
Am 07.10.2021 um 10:23 schrieb David Brown:
The 68k ISA lives on in the microcontroller world, as the ColdFire
processors, though I don't believe any new ColdFire microcontrollers
have been released since NXP bought Freescale in 2015. Even at that
stage it was no longer a major development line for Freescale. A
decade or so before that, ColdFire was one of the most popular
architectures in small network devices such as SOHO NAT routers, ...
MIPS was dominant in routers before it was replaced with ARM.
68K had a straighter instruction-set than x86, but the two times
indirect addressing-modes introduced with the 68020 were totally brain-damaged.
MIPS was dominant in routers before it was replaced with ARM.
MIPS was dominant in high-end routers and fast switches (with PowerPC
being the main competitor). ...
Am 09.10.2021 um 10:23 schrieb David Brown:
MIPS was dominant in routers before it was replaced with ARM.
MIPS was dominant in high-end routers and fast switches (with PowerPC
being the main competitor). ...
I don't know about former high-end routers, but MIPS was by far
the most dominant architecture on SOHO-Routers before ARM. 68k
played almost no role then. F.e. in Germany almost anyone uses
the Fritz!Box routers and they all were MIPS-based before AVM
switched to ARM.
On 08/10/2021 20:23, Bonita Montero wrote:
68K had a straighter instruction-set than x86, but the two times
indirect addressing-modes introduced with the 68020 were totally
brain-damaged.
68k was a better ISA than x86 in almost every way imaginable. But it
did get a few overly complicated addressing modes - some of these were dropped in later 68k devices. And the /implementation/ in the 68k
family was not as good - Motorola didn't have as many smart people and
as big budgets as Intel or even AMD. On the 68030, IIRC, someone
discovered that a software division routine worked faster than the
hardware division instruction.
When you look back at the original 68000 compared to the 8086, it is
clear that technically the 68000 ISA was a modern and forward-looking architecture while the 8086 was outdated and old-fashioned before the
first samples were made.
The story of the IBM PC shows how technical
brilliance is not enough to succeed - the worst cpu architecture around, combined with the worst OS ever hacked together, ended up dominant.
On 09/10/2021 09:23, David Brown wrote:
On 08/10/2021 20:23, Bonita Montero wrote:
68K had a straighter instruction-set than x86, but the two times
indirect addressing-modes introduced with the 68020 were totally
brain-damaged.
68k was a better ISA than x86 in almost every way imaginable. But it
did get a few overly complicated addressing modes - some of these were
dropped in later 68k devices. And the /implementation/ in the 68k
family was not as good - Motorola didn't have as many smart people and
as big budgets as Intel or even AMD. On the 68030, IIRC, someone
discovered that a software division routine worked faster than the
hardware division instruction.
When you look back at the original 68000 compared to the 8086, it is
clear that technically the 68000 ISA was a modern and forward-looking
architecture while the 8086 was outdated and old-fashioned before the
first samples were made.
That was my initial impression when I first looked at 68000 in the 80s.
Until I had a closer look at the instruction set, with a view to
generating code for it from a compiler. Then it had almost as much lack
of orthogonality as the 8086.
The obvious one is in have two lots of integer registers, 8 Data
registers and 8 Address registers, instead of just 16 general registers,
so that you are constantly thinking about which register set your
operands and intermediate results should go in.
(I never got round to using the chip; my company had moved on to doing
stuff for the IBM PC, instead of developing own hardware products.)
The story of the IBM PC shows how technical
brilliance is not enough to succeed - the worst cpu architecture around,
combined with the worst OS ever hacked together, ended up dominant.
I was looking forward to the Zilog Z80000, but unfortunately that never happened.
Am 09.10.2021 um 10:23 schrieb David Brown:
MIPS was dominant in routers before it was replaced with ARM.
MIPS was dominant in high-end routers and fast switches (with PowerPC
being the main competitor). ...
I don't know about former high-end routers, but MIPS was by far
the most dominant architecture on SOHO-Routers before ARM. 68k
played almost no role then. F.e. in Germany almost anyone uses
the Fritz!Box routers and they all were MIPS-based before AVM
switched to ARM.
I was looking forward to the Zilog Z80000, but unfortunately that never happened.
When you look back at the original 68000 compared to the 8086, it is
clear that technically the 68000 ISA was a modern and forward-looking architecture while the 8086 was outdated and old-fashioned before the
first samples were made. The story of the IBM PC shows how technical brilliance is not enough to succeed - the worst cpu architecture around, combined with the worst OS ever hacked together, ended up dominant.
On 09/10/2021 09:23, David Brown wrote:
When you look back at the original 68000 compared to the 8086, it is
clear that technically the 68000 ISA was a modern and forward-looking
architecture while the 8086 was outdated and old-fashioned before the
first samples were made. The story of the IBM PC shows how technical
brilliance is not enough to succeed - the worst cpu architecture around,
combined with the worst OS ever hacked together, ended up dominant.
Very nicely put.
Of course one reason for that is the success of the 8080, and its
successors the 8085 and Zilog's Z80 (which had an even less regular >instruction set).
AIUI Intel had a world leader on their hands, and felt that having some
sort of compatibility in the next generation was important.
While the 6800 had some success (and the 6809 was my favourite 8-bit
CPU) it was nowhere near that of the 8080.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 113 |
Nodes: | 8 (1 / 7) |
Uptime: | 42:26:25 |
Calls: | 2,498 |
Files: | 8,651 |
Messages: | 1,908,057 |