Forum: >>> Magnum BBS <<<

SSE "denormals are zeros"

From Bonita Montero@21:1/5 to All on Fri Jun 17 18:11:28 2016

Does anyone know what the "denormals are zeros" flag of the
x86 MXCSR is good for?
Or more precisely: I know what it does, but I don't know why
it should make sense to consider denormal values as zeros.

--
http://facebook.com/bonita.montero/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Wessel@21:1/5 to Bonita.Montero@gmail.com on Fri Jun 17 12:33:32 2016

On Fri, 17 Jun 2016 18:11:28 +0200, Bonita Montero
<Bonita.Montero@gmail.com> wrote:

Does anyone know what the "denormals are zeros" flag of the
x86 MXCSR is good for?
Or more precisely: I know what it does, but I don't know why
it should make sense to consider denormal values as zeros.

Mainly performance - denormals tend to be slow (although less so on
recent x86s). Some codes do things like converge to zero, but end up
passing through the denormal range first - just skipping that can
sometimes be a considerable performance improvement. There are some
downsize to disabling gradual underflow, but in practice many cases
where you get them you're on your way to zero anyway, and in most
cases the advantages of gradual underflow are very small.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Wessel@21:1/5 to robertwessel2@yahoo.com on Fri Jun 17 12:36:57 2016

On Fri, 17 Jun 2016 12:33:32 -0500, Robert Wessel
<robertwessel2@yahoo.com> wrote:

On Fri, 17 Jun 2016 18:11:28 +0200, Bonita Montero
<Bonita.Montero@gmail.com> wrote:

Does anyone know what the "denormals are zeros" flag of the
x86 MXCSR is good for?
Or more precisely: I know what it does, but I don't know why
it should make sense to consider denormal values as zeros.

Mainly performance - denormals tend to be slow (although less so on
recent x86s). Some codes do things like converge to zero, but end up
passing through the denormal range first - just skipping that can
sometimes be a considerable performance improvement. There are some
downsize to disabling gradual underflow, but in practice many cases
where you get them you're on your way to zero anyway, and in most
cases the advantages of gradual underflow are very small.

And x86 implements two somewhat different options: flush-to-zero and denormals-are-zero, which differ mainly in how some exceptions and
flags are handled.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Bonita Montero on Sat Jun 18 08:07:05 2016

Bonita Montero wrote:

I printed the sums only to prevent the compiler from optimizing away
the summation. The result is that on my Xeon E3-1240 (Skylake) each
iteratoin takes four clock-cycles when "d" is non-denormal. When "d"
is a denormal, each iteration takes about 150 clock cycles! I'd never
believe denormals would have such a huge performance-impact if I
wouldn't have seen the opposite.

Ouch!!!

That is really horrible. :-(

I have worked on implementing fp for the Mill cpu, there is no way you
should allow denormals (on input and/or output to add more than a cycle
of two to your processing time.

To get to 150 cycles you effectively need a trap & fixup.

According to Mitch Alsup you can handle denormals inline, in hw, with a
total of 6 gate delays which is fraction of a cycle on any current process.

And what about GPUs? I suppose they don't support denormals.
Is this right?

Usually so, yeah.

The easiest is to treat denormals as zero, in which case you can do all
your special-case handling with a very small lookup table based on the
exponent field only:

00.0 -> Zero
00.1 to ff.e -> Normal
ff.f -> Inf or NaN

If you want/need to handle NaNs you still need to look at the mantissa
for maximal exponents, but you can do that in parallel with the normal processing anyway, with plenty of time to spare.

Handling denorms however require both a scan for first non-zero mantissa
bit, a shift to normalize and adjusting the (internal) exponent so this
could easily take several cycles unless you are smart.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bonita Montero@21:1/5 to All on Fri Jun 17 20:02:20 2016

Am 17.06.2016 um 19:33 schrieb Robert Wessel:

Mainly performance - denormals tend to be slow (although less so on
recent x86s). ...

I also asked for this on Stack Oveflow, and someone gave me a link to
an interesting Intel-article on that: https://software.intel.com/en-us/node/513376

So I wrote a little program to check the performance-impact of
denormals. Here it is:

#include <windows.h>
#include <intrin.h>
#include <iostream>

using namespace std;

union DBL
{
DWORDLONG dwlValue;
double value;
};

int main()
{
DWORDLONG dwlTicks;
DBL d;
double sum;

dwlTicks = __rdtsc();

for( d.dwlValue = 0, sum = 0.0;
d.dwlValue < 100000000; d.dwlValue++ )
sum += d.value;

dwlTicks = __rdtsc() - dwlTicks;
cout << sum << endl;
cout << dwlTicks / 100000000.0 << endl;

dwlTicks = __rdtsc();

for( d.dwlValue = 0x0010000000000000u, sum = 0.0;
d.dwlValue < (0x0010000000000000u + 100000000); d.dwlValue++ )
sum += d.value;

dwlTicks = __rdtsc() - dwlTicks;
cout << sum << endl;
cout << dwlTicks / 100000000.0 << endl;

return 0;
}

I printed the sums only to prevent the compiler from optimizing away
the summation. The result is that on my Xeon E3-1240 (Skylake) each
iteratoin takes four clock-cycles when "d" is non-denormal. When "d"
is a denormal, each iteration takes about 150 clock cycles! I'd never
believe denormals would have such a huge performance-impact if I
wouldn't have seen the opposite.

And what about GPUs? I suppose they don't support denormals.
Is this right?

--
http://facebook.com/bonita.montero/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	293
Nodes:	16 (2 / 14)
Uptime:	239:39:41
Calls:	6,624
Files:	12,173
Messages:	5,320,014

SSE "denormals are zeros"

Who's Online

System Info