• SSE "denormals are zeros"

    From Bonita Montero@21:1/5 to All on Fri Jun 17 18:11:28 2016
    Does anyone know what the "denormals are zeros" flag of the
    x86 MXCSR is good for?
    Or more precisely: I know what it does, but I don't know why
    it should make sense to consider denormal values as zeros.

    --
    http://facebook.com/bonita.montero/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Wessel@21:1/5 to Bonita.Montero@gmail.com on Fri Jun 17 12:33:32 2016
    On Fri, 17 Jun 2016 18:11:28 +0200, Bonita Montero
    <Bonita.Montero@gmail.com> wrote:

    Does anyone know what the "denormals are zeros" flag of the
    x86 MXCSR is good for?
    Or more precisely: I know what it does, but I don't know why
    it should make sense to consider denormal values as zeros.


    Mainly performance - denormals tend to be slow (although less so on
    recent x86s). Some codes do things like converge to zero, but end up
    passing through the denormal range first - just skipping that can
    sometimes be a considerable performance improvement. There are some
    downsize to disabling gradual underflow, but in practice many cases
    where you get them you're on your way to zero anyway, and in most
    cases the advantages of gradual underflow are very small.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Wessel@21:1/5 to robertwessel2@yahoo.com on Fri Jun 17 12:36:57 2016
    On Fri, 17 Jun 2016 12:33:32 -0500, Robert Wessel
    <robertwessel2@yahoo.com> wrote:

    On Fri, 17 Jun 2016 18:11:28 +0200, Bonita Montero
    <Bonita.Montero@gmail.com> wrote:

    Does anyone know what the "denormals are zeros" flag of the
    x86 MXCSR is good for?
    Or more precisely: I know what it does, but I don't know why
    it should make sense to consider denormal values as zeros.


    Mainly performance - denormals tend to be slow (although less so on
    recent x86s). Some codes do things like converge to zero, but end up
    passing through the denormal range first - just skipping that can
    sometimes be a considerable performance improvement. There are some
    downsize to disabling gradual underflow, but in practice many cases
    where you get them you're on your way to zero anyway, and in most
    cases the advantages of gradual underflow are very small.


    And x86 implements two somewhat different options: flush-to-zero and denormals-are-zero, which differ mainly in how some exceptions and
    flags are handled.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Bonita Montero on Sat Jun 18 08:07:05 2016
    Bonita Montero wrote:

    I printed the sums only to prevent the compiler from optimizing away
    the summation. The result is that on my Xeon E3-1240 (Skylake) each
    iteratoin takes four clock-cycles when "d" is non-denormal. When "d"
    is a denormal, each iteration takes about 150 clock cycles! I'd never
    believe denormals would have such a huge performance-impact if I
    wouldn't have seen the opposite.

    Ouch!!!

    That is really horrible. :-(

    I have worked on implementing fp for the Mill cpu, there is no way you
    should allow denormals (on input and/or output to add more than a cycle
    of two to your processing time.

    To get to 150 cycles you effectively need a trap & fixup.

    According to Mitch Alsup you can handle denormals inline, in hw, with a
    total of 6 gate delays which is fraction of a cycle on any current process.

    And what about GPUs? I suppose they don't support denormals.
    Is this right?

    Usually so, yeah.

    The easiest is to treat denormals as zero, in which case you can do all
    your special-case handling with a very small lookup table based on the
    exponent field only:

    00.0 -> Zero
    00.1 to ff.e -> Normal
    ff.f -> Inf or NaN

    If you want/need to handle NaNs you still need to look at the mantissa
    for maximal exponents, but you can do that in parallel with the normal processing anyway, with plenty of time to spare.

    Handling denorms however require both a scan for first non-zero mantissa
    bit, a shift to normalize and adjusting the (internal) exponent so this
    could easily take several cycles unless you are smart.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Fri Jun 17 20:02:20 2016
    Am 17.06.2016 um 19:33 schrieb Robert Wessel:

    Mainly performance - denormals tend to be slow (although less so on
    recent x86s). ...

    I also asked for this on Stack Oveflow, and someone gave me a link to
    an interesting Intel-article on that: https://software.intel.com/en-us/node/513376

    So I wrote a little program to check the performance-impact of
    denormals. Here it is:

    #include <windows.h>
    #include <intrin.h>
    #include <iostream>

    using namespace std;

    union DBL
    {
    DWORDLONG dwlValue;
    double value;
    };

    int main()
    {
    DWORDLONG dwlTicks;
    DBL d;
    double sum;

    dwlTicks = __rdtsc();

    for( d.dwlValue = 0, sum = 0.0;
    d.dwlValue < 100000000; d.dwlValue++ )
    sum += d.value;

    dwlTicks = __rdtsc() - dwlTicks;
    cout << sum << endl;
    cout << dwlTicks / 100000000.0 << endl;

    dwlTicks = __rdtsc();

    for( d.dwlValue = 0x0010000000000000u, sum = 0.0;
    d.dwlValue < (0x0010000000000000u + 100000000); d.dwlValue++ )
    sum += d.value;

    dwlTicks = __rdtsc() - dwlTicks;
    cout << sum << endl;
    cout << dwlTicks / 100000000.0 << endl;

    return 0;
    }

    I printed the sums only to prevent the compiler from optimizing away
    the summation. The result is that on my Xeon E3-1240 (Skylake) each
    iteratoin takes four clock-cycles when "d" is non-denormal. When "d"
    is a denormal, each iteration takes about 150 clock cycles! I'd never
    believe denormals would have such a huge performance-impact if I
    wouldn't have seen the opposite.

    And what about GPUs? I suppose they don't support denormals.
    Is this right?

    --
    http://facebook.com/bonita.montero/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)