Forum: >>> Magnum BBS <<<

Re: What integer C type to use

From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Feb 20 21:03:59 2024

Scott Lurndal wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

Or maybe changing int from 32-bit to 64-bit would have caused
as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
day.

In Unix sizeof(int) == sizeof(int *) on both 16-bit and 32-bit >>architectures. Given the history of C, that's not surprising: BCPL
and B have a single type, the machine word, and it eventually became
C's int. You see this in "int" declarations being optional in various >>places. So code portable between 16-bit and 32-bit systems could not >>assume that int has a specific size (such as 32 bits), but if it
assumed that sizeof(int) == sizeof(int *), that would port fine
between 16-bit and 32-bit Unixes. There may have been C code that
assumed that sizeof(int)==4, but why cater to this kind of code which
did not even port to 16-bit systems?

Most of the problems encountered when moving unix (System V)
from 16-bit to 32-bit were more around missing typedefs for certain
data types (e.g. uids, gids, pids, etc), so there was
a lot of code that declared these as shorts, but the 32-bit
kernels defined these as 32-bit (unsigned) values).

That's when uid_t, pid_t, gid_t were added.

Then there were the folks who used 'short [int]' instead of 'int'
since they were the same size on the PDP-11.

This is a good reason to build a time machine and send back a few
of the Sopranos to take care of those programmers.

The Unix code ported relatively easily to I32LP64 because uintptr_t
had been used extensively rather than assumptions about
sizeof(int) == sizeof(int *).

At least the sense of forward progress.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Tue Feb 20 20:58:58 2024

Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 13:00, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

mitchalsup@aol.com (MitchAlsup1) writes:

We are in an era where long has higher performance than ints (except for >>>>> cache footprint overheads.)

C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s.

I presume the main reason for this was the size and cost of memory at
the time? Or do you know any other reason? Maybe some of the early
64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were >>faster at 16-bit.

I know no implementation of a 64-bit architecture where ALU operations (except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:

The ALU operations were slower because when you IADD x and y to give z
if you then want z to have no significance above 32-bits you require
a second instruction to smash the value back into the container range.
So 1 ALU in 64-bits takes 1 instruction while 1 ALU requiring 32-bits
requires 2. This is the reason RISC-V has 32-bit ALU instructions.

Guess 1: There was more software that depended on sizeof(int)==4 than software that depended on sizeof(int)==sizeof(char *).

There remains a lot of SW that is dependent on INT being smaller than LONG.

Guess 2: When benchmarketing without adapting the source code (as is
usual), I32LP64 produced better numbers then ILP64 for some
benchmarks, because arrays and other data structures with int elements
are smaller and have better cache hit rates.

Some truth in here.

My guess is that it was a mixture of 1 and 2, with 2 being the
decisive factor. I have certainly seen a lot of writing about how
64-bit (pointers) hurt performance, and it even led to the x32
nonsense (which never went anywhere, not surprising to me). These
days support for 32-bit applications is eliminated from ARM cores,
another indication that the performance advantages of 32-bit pointers
are minor.

Try programming array codes where the arrays are TB in size with P32.

My ONLY argument is that int was SUPPOSED to be the fastest arithmetic.
It was in PDP-11 days and VAX days, but it is no longer.

BTW, some people here have advocated the use of unsigned instead of
int. Which of the two results in better code depends on the
architecture. On AMD64 where the so-called 32-bit instructions
perform a 32->64-bit zero-extension, unsigned is better. On RV64G
where the so-called 32-bit instructions perform a 32->64-bit sign
extension, signed int is better. But actually the best way is to use
a full-width type like intptr_t or uintptr_t, which gives better
results than either.

I would suggest C "fast" types like int_fast32_t (or other "fast" sizes, >>picked to fit the range you need).

The semantics of unsigned are better when you are twiddling bits than
when signed. My programming practice is to use unsigned everywhere that
a negative number is unexpected.

Sure, and then the program might break when an array has more the 2^31 elements; or it might work on one platform and break on another one.

By contrast, with (u)intptr_t, on modern architectures you use the
type that's as wide as the GPRs. And I don't see a reason why to use something else for a loop counter.

This is equivalent to my argument above--int SHOULD be the fastest kind
of arithmetic; sadly this tenet has been cast aside.

For a type of which you store many in an array or other data
structure, you probably prefer int32_t rather than int_fast32_t if 32
bits is enough. So I don't see a reason for int_fast32_t etc.

These adapt suitably for different

targets. If you want to force the issue, then "int64_t" is IMHO clearer >>than "long long int" and does not give a strange impression where you
are using a type aimed at pointer arithmetic for general integer arithmetic.

Why do you bring up "long long int"? As for int64_t, that tends to be
slow (if supported at all) on 32-bit platforms, and it is more than
what is necessary for indexing arrays and for loop counters that are
used for indexing into arrays.

If you want fast local variables, use C's [u]int_fastN_t types. That's >>what they are for.

I don't see a point in those types. What's wrong with (u)intptr_t IYO?

In reality, this is a dusty deck problem.

Don't use "-fwrapv" unless you actually need it - in most
code, if your arithmetic overflows, you have a mistake in your code, so >>letting the compiler assume that will not happen is a good thing.

Thank you for giving a demonstration for Scott Lurndal. I assume that
you claim to be a programmer.

Anyway, if I have made a mistake in my code, why would let the
compiler assume that I did not make a mistake be a good thing?

There are more people willing to pay good money for a wrong answer fast
than there are willing to pay good money for correct answer slow.

I OTOH prefer if the compiler behaves consistently, so I use -fwrapv,
and for good performance, I write the code appropriately (e.g., by
using intptr_t instead of int).

Everything I do is inherently 64-bits except for interfaces to OS
services.

(And
it lets you check for overflow bugs using run-time sanitizers.)

If the compiler assumes that overflow does not happen, how do these "sanitizers" work?

Anyway, I certainly have code that relies on modulo arithmetic.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Tue Feb 20 21:01:45 2024

Anton Ertl wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

Or maybe changing int from 32-bit to 64-bit would have caused
as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
day.

In Unix sizeof(int) == sizeof(int *) on both 16-bit and 32-bit
architectures. Given the history of C, that's not surprising: BCPL
and B have a single type, the machine word, and it eventually became
C's int. You see this in "int" declarations being optional in various places. So code portable between 16-bit and 32-bit systems could not
assume that int has a specific size (such as 32 bits), but if it
assumed that sizeof(int) == sizeof(int *), that would port fine
between 16-bit and 32-bit Unixes. There may have been C code that
assumed that sizeof(int)==4, but why cater to this kind of code which
did not even port to 16-bit systems?

In any case, I32LP64 caused breakage for my code, and I expect that

# define int int64_t
# define long int64_t
# define unsigned uint64_t
# define ...

there was more code around with the assumption sizeof(int)==sizeof(int
*) than with the assumption sizeof(int)==4. Of course, we worked
around this misfeature of the C compilers on Digital OSF/1, but those
who assumed sizeof(int)==4 would have adapted their code if the
decision had been for ILP64.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Wed Feb 21 13:24:00 2024

On 20/02/2024 21:58, MitchAlsup1 wrote:

Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 13:00, Anton Ertl wrote:

Guess 1: There was more software that depended on sizeof(int)==4 than
software that depended on sizeof(int)==sizeof(char *).

There remains a lot of SW that is dependent on INT being smaller than LONG.

Really? They are the same size on all 32-bit platforms. And they are
the same size on one of the most common 64-bit platforms (Win64).

My ONLY argument is that int was SUPPOSED to be the fastest arithmetic.
It was in PDP-11 days and VAX days, but it is no longer.

It still is on 32-bit and smaller. But it is not on 64-bit platforms.

It used to be that "int" meant the same as "int_fast16_t" and "long"
meant "int_fast32_t". As you say, that is no longer the case. That is
why I prefer types with explicit ranges for many purposes, though I will
also use "int" for simplicity (or laziness, as some might say!).

I would suggest C "fast" types like int_fast32_t (or other "fast"
sizes, picked to fit the range you need).

The semantics of unsigned are better when you are twiddling bits than
when signed.

I agree on that - I don't think signed types are appropriate for any
kind of bit twiddling.

My programming practice is to use unsigned everywhere that
a negative number is unexpected.

That is a popular practice, but it is not necessarily the best choice.

Anyway, if I have made a mistake in my code, why would let the
compiler assume that I did not make a mistake be a good thing?

There are more people willing to pay good money for a wrong answer fast
than there are willing to pay good money for correct answer slow.

I prefer tools to find my mistakes quickly and then give the right
answers fast. It seems a better choice than either of your options.
That is particularly the case in the real-time programming I do, where
getting the right answer too slow is as wrong as getting the wrong
answer fast. Time requirements can be part of the correctness
requirements. (But if yours are your only options, then of course
correctness trumps speed.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Wed Feb 21 14:15:28 2024

MitchAlsup1 wrote:

Anton Ertl wrote:

My guess is that it was a mixture of 1 and 2, with 2 being the
decisive factor. I have certainly seen a lot of writing about how
64-bit (pointers) hurt performance, and it even led to the x32
nonsense (which never went anywhere, not surprising to me). These
days support for 32-bit applications is eliminated from ARM cores,
another indication that the performance advantages of 32-bit pointers
are minor.

Try programming array codes where the arrays are TB in size with P32.

That _does_ work (for some definition of "work") if those TB arrays are near-square matrices: Both x and y will easily stay in the 32-bit range,
it is up to the compiler to realize that the effective address of
element[x,y] needs more than those 32 bits, i.e. you need to generate something like

(uint64_t) y * (uint64_t) y_stride + (uint64_t) x * (uint64_t) x_stride

[snip]>

The semantics of unsigned are better when you are twiddling bits than
when signed. My programming practice is to use unsigned everywhere that
a negative number is unexpected.

Ditto.

Sure, and then the program might break when an array has more the 2^31
elements; or it might work on one platform and break on another one.

By contrast, with (u)intptr_t, on modern architectures you use the
type that's as wide as the GPRs. And I don't see a reason why to use
something else for a loop counter.

You are not alone in thinking that, i.e. usize being the only legal
array index in Rust.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Scott Lurndal on Wed Feb 21 19:14:11 2024

On 21/02/2024 18:27, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 21/02/2024 17:33, Michael S wrote:

If one wants top performance on 64-bit architectures then avoiding
'unsigned int' indices is a very good idea. Hoping that compiler will
somehow figure out what you meant instead of doing what you wrote is a
naivety.

Indeed.

I hope you're not agreeing that unsigned array indicies should be
avoided - they should instead be preferred.

I am agreeing that "unsigned int" indices should be avoided on 64-bit architectures (at least, when you are using them in loops) because the
compiler has to generate extra instructions to handle wrapping. You
don't have that problem with larger unsigned types - size_t is often an appropriate choice, but you might have reasons to pick something else
(an uint_fastN_t type, for example).

I am also quite happy with "int" as an index type, and don't see a
problem with it (obviously assuming you know that it has enough range
for the task - and usually you do).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to David Brown on Wed Feb 21 18:40:09 2024

David Brown <david.brown@hesbynett.no> writes:

On 21/02/2024 18:27, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 21/02/2024 17:33, Michael S wrote:

If one wants top performance on 64-bit architectures then avoiding
'unsigned int' indices is a very good idea. Hoping that compiler will
somehow figure out what you meant instead of doing what you wrote is a >>>> naivety.

Indeed.

I hope you're not agreeing that unsigned array indicies should be
avoided - they should instead be preferred.

I am agreeing that "unsigned int" indices should be avoided on 64-bit >architectures (at least, when you are using them in loops) because the >compiler has to generate extra instructions to handle wrapping

Does it?

A simple test case shows that using an unsigned long or unsigned
int as a loop index generates exactly the same non-optimized code,
albeit the unsigned long version was slightly larger (4 bytes)
in code footprint. This with GCC on 64-bit Fedora Core and
without optimization (with optimization (-O2) both generated
identical code)

unsigned long
test(void)
{
unsigned int x; /* Alternate unsigned long x; */
unsigned long vector[1000];
unsigned long result;

for(x = 0; x < 1000u; x++) {
unsigned long a = vector[x];
result += a;
}
return result;
}

unsigned int loop index:
00000000000000 <test>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 81 ec e8 1e 00 00 sub $0x1ee8,%rsp
b: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
12: eb 1b jmp 2f <test+0x2f>
14: 8b 45 fc mov -0x4(%rbp),%eax
17: 48 8b 84 c5 a0 e0 ff mov -0x1f60(%rbp,%rax,8),%rax
1e: ff
1f: 48 89 45 e8 mov %rax,-0x18(%rbp)
23: 48 8b 45 e8 mov -0x18(%rbp),%rax
27: 48 01 45 f0 add %rax,-0x10(%rbp)
2b: 83 45 fc 01 addl $0x1,-0x4(%rbp)
2f: 81 7d fc e7 03 00 00 cmpl $0x3e7,-0x4(%rbp)
36: 76 dc jbe 14 <test+0x14>
38: 48 8b 45 f0 mov -0x10(%rbp),%rax
3c: c9 leaveq
3d: c3 retq

Unsigned long loop index:
00000000000000 <test>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 81 ec e8 1e 00 00 sub $0x1ee8,%rsp
b: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp)
12: 00
13: eb 1d jmp 32 <test+0x32>
15: 48 8b 45 f8 mov -0x8(%rbp),%rax
19: 48 8b 84 c5 a0 e0 ff mov -0x1f60(%rbp,%rax,8),%rax
20: ff
21: 48 89 45 e8 mov %rax,-0x18(%rbp)
25: 48 8b 45 e8 mov -0x18(%rbp),%rax
29: 48 01 45 f0 add %rax,-0x10(%rbp)
2d: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)
32: 48 81 7d f8 e7 03 00 cmpq $0x3e7,-0x8(%rbp)
39: 00
3a: 76 d9 jbe 15 <test+0x15>
3c: 48 8b 45 f0 mov -0x10(%rbp),%rax
40: c9 leaveq
41: c3 retq

With -O2:
0000000000000000 <test>:
0: 48 81 ec d0 1e 00 00 sub $0x1ed0,%rsp
7: 48 8d 54 24 88 lea -0x78(%rsp),%rdx
c: 48 8d 8c 24 c8 1e 00 lea 0x1ec8(%rsp),%rcx
13: 00
14: 0f 1f 40 00 nopl 0x0(%rax)
18: 48 03 02 add (%rdx),%rax
1b: 48 83 c2 08 add $0x8,%rdx
1f: 48 39 ca cmp %rcx,%rdx
22: 75 f4 jne 18 <test+0x18>
24: 48 81 c4 d0 1e 00 00 add $0x1ed0,%rsp
2b: c3 retq

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Scott Lurndal on Wed Feb 21 21:33:46 2024

On 21/02/2024 19:40, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 21/02/2024 18:27, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 21/02/2024 17:33, Michael S wrote:

If one wants top performance on 64-bit architectures then avoiding
'unsigned int' indices is a very good idea. Hoping that compiler will >>>>> somehow figure out what you meant instead of doing what you wrote is a >>>>> naivety.

Indeed.

I hope you're not agreeing that unsigned array indicies should be
avoided - they should instead be preferred.

I am agreeing that "unsigned int" indices should be avoided on 64-bit
architectures (at least, when you are using them in loops) because the
compiler has to generate extra instructions to handle wrapping

Does it?

A simple test case shows that using an unsigned long or unsigned
int as a loop index generates exactly the same non-optimized code,
albeit the unsigned long version was slightly larger (4 bytes)
in code footprint. This with GCC on 64-bit Fedora Core and
without optimization (with optimization (-O2) both generated
identical code)

unsigned long
test(void)
{
unsigned int x; /* Alternate unsigned long x; */
unsigned long vector[1000];
unsigned long result;

for(x = 0; x < 1000u; x++) {
unsigned long a = vector[x];
result += a;
}
return result;
}

unsigned int loop index:

There's no point in talking about the most efficient generated code
without enabling optimisation! So this example is irrelevant.

Unsigned long loop index:

This one too.

With -O2:

There's no point in talking about generated code with a source that uses uninitialised data. The complier could just as well have turned the
code into "return 0;".

However, in code where the compiler can be sure there is no wrapping on
the index, and can turn the code into pointer arithmetic and stop when
the pointer gets to the end, then no extra code is needed for an
"unsigned int" index.

But if there /is/ a possibility of wrapping, such as in the "sext"
function posted by Anton, it is needed.

Or if we adapt your function to have a range:

unsigned long
test_uint(long unsigned int a, long unsigned int b)
{
unsigned int x; /* Alternate unsigned long x; */
extern unsigned long vector[1000];
unsigned long result = 0;

for(x = a; x <= b; x++) {
unsigned long a = vector[x];
result += a;
}
return result;
}

Using "unsigned int" instead of "int" or "unsigned long" means 5
instructions in the inner loop, instead of 4. Of course there are other
ways to re-arrange the function to get the optimal result. But the
point is, when the compiler does not know for sure that unsigned int
arithmetic will not wrap, it can lead to extra instructions.

<https://godbolt.org/z/a5en1j5xY>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to David Brown on Wed Feb 21 22:48:52 2024

David Brown wrote:

On 21/02/2024 19:40, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 21/02/2024 18:27, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 21/02/2024 17:33, Michael S wrote:

If one wants top performance on 64-bit architectures then avoiding >>>>>> 'unsigned int' indices is a very good idea. Hoping that compiler will >>>>>> somehow figure out what you meant instead of doing what you wrote
is a
naivety.

Indeed.

I hope you're not agreeing that unsigned array indicies should be
avoided - they should instead be preferred.

I am agreeing that "unsigned int" indices should be avoided on 64-bit
architectures (at least, when you are using them in loops) because the
compiler has to generate extra instructions to handle wrapping

Does it?

A simple test case shows that using an unsigned long or unsigned
int as a loop index generates exactly the same non-optimized code,
albeit the unsigned long version was slightly larger (4 bytes)
in code footprint.   This with GCC on 64-bit Fedora Core and
without optimization (with optimization (-O2) both generated
identical code)

unsigned long
test(void)
{
     unsigned int x;                    /* Alternate unsigned long x; */
     unsigned long vector[1000];
     unsigned long result;

     for(x = 0; x < 1000u; x++) {
         unsigned long a = vector[x];
         result += a;
     }
     return result;
}

unsigned int loop index:

There's no point in talking about the most efficient generated code
without enabling optimisation! So this example is irrelevant.

Unsigned long loop index:

This one too.

With -O2:

There's no point in talking about generated code with a source that uses uninitialised data. The complier could just as well have turned the
code into "return 0;".

However, in code where the compiler can be sure there is no wrapping on
the index, and can turn the code into pointer arithmetic and stop when
the pointer gets to the end, then no extra code is needed for an
"unsigned int" index.

But if there /is/ a possibility of wrapping, such as in the "sext"
function posted by Anton, it is needed.

Or if we adapt your function to have a range:

unsigned long
test_uint(long unsigned int a, long unsigned int b)
{
    unsigned int x;                    /* Alternate unsigned long x; */
    extern unsigned long vector[1000];
    unsigned long result = 0;

    for(x = a; x <= b; x++) {
        unsigned long a = vector[x];
        result += a;
    }
    return result;
}

I really hope you are not in the habit of reusing function parameter
names as local variables inside the function? Also the various
inconsistent integer names (long unsigned int, unsigned long), are these supposed to be different types?

I would expect to see code very close to this for a sane setup:

xor rax,rax ; result variable
.. setup code for beginning and end of the range, with check that a<=b
next:
add rax,[rsi]
lea rsi,[rsi+8]
cmp esi,edi
jb next

All four instructions could run in the same cycle, but since these are unsigned variables we can easily parallelize a bit, using
rax,rbx,rcx,rdx as accumulators, and at this point the performance will
be limited by how many data items we can load per cycle. The obvious workaround is of course to use SIMD, with a 16/32/64-byte register as
the wide accumulator.

Using "unsigned int" instead of "int" or "unsigned long" means 5 instructions in the inner loop, instead of 4. Of course there are other ways to re-arrange the function to get the optimal result. But the
point is, when the compiler does not know for sure that unsigned int arithmetic will not wrap, it can lead to extra instructions.

<https://godbolt.org/z/a5en1j5xY>

I see that the code is different: Can you actually measure a performance difference?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Michael S on Thu Feb 22 01:49:19 2024

On Thu, 22 Feb 2024 01:37:45 +0200
Michael S <already5chosen@yahoo.com> wrote:

On Wed, 21 Feb 2024 22:48:52 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

David Brown wrote:

On 21/02/2024 19:40, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 21/02/2024 18:27, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 21/02/2024 17:33, Michael S wrote:

If one wants top performance on 64-bit architectures then
avoiding 'unsigned int' indices is a very good idea. Hoping
that compiler will somehow figure out what you meant instead
of doing what you wrote is a
naivety.

Indeed.

I hope you're not agreeing that unsigned array indicies should
be avoided - they should instead be preferred.

I am agreeing that "unsigned int" indices should be avoided on
64-bit architectures (at least, when you are using them in
loops) because the compiler has to generate extra instructions
to handle wrapping

Does it?

A simple test case shows that using an unsigned long or unsigned
int as a loop index generates exactly the same non-optimized
code, albeit the unsigned long version was slightly larger (4
bytes) in code footprint.�� This with GCC on 64-bit Fedora Core
and without optimization (with optimization (-O2) both generated
identical code)

unsigned long
test(void)
{
�� unsigned int x;�� /* Alternate unsigned
long x; */ unsigned long vector[1000];
�� unsigned long result;

�� for(x = 0; x < 1000u; x++) {
�� unsigned long a = vector[x];
�� result += a;
�� }
�� return result;
}

unsigned int loop index:

There's no point in talking about the most efficient generated
code without enabling optimisation!� So this example is
irrelevant.

Unsigned long loop index:

This one too.

With -O2:

There's no point in talking about generated code with a source
that uses uninitialised data.� The complier could just as well
have turned the code into "return 0;".

However, in code where the compiler can be sure there is no
wrapping on the index, and can turn the code into pointer
arithmetic and stop when the pointer gets to the end, then no
extra code is needed for an "unsigned int" index.

But if there /is/ a possibility of wrapping, such as in the
"sext" function posted by Anton, it is needed.

Or if we adapt your function to have a range:

unsigned long
test_uint(long unsigned int a, long unsigned int b)
{
�� unsigned int x;�� /* Alternate unsigned long
x; */ extern unsigned long vector[1000];
�� unsigned long result = 0;

�� for(x = a; x <= b; x++) {
�� unsigned long a = vector[x];
�� result += a;
�� }
�� return result;
}

I really hope you are not in the habit of reusing function
parameter names as local variables inside the function? Also the
various inconsistent integer names (long unsigned int, unsigned
long), are these supposed to be different types?

I would expect to see code very close to this for a sane setup:

xor rax,rax ; result variable
.. setup code for beginning and end of the range, with check that
a<=b next:
add rax,[rsi]
lea rsi,[rsi+8]
cmp esi,edi
jb next

All four instructions could run in the same cycle, but since these
are unsigned variables we can easily parallelize a bit, using rax,rbx,rcx,rdx as accumulators, and at this point the performance
will be limited by how many data items we can load per cycle. The
obvious workaround is of course to use SIMD, with a 16/32/64-byte
register as the wide accumulator.

Using "unsigned int" instead of "int" or "unsigned long" means 5 instructions in the inner loop, instead of 4.� Of course there are
other ways to re-arrange the function to get the optimal result.
But the point is, when the compiler does not know for sure that
unsigned int arithmetic will not wrap, it can lead to extra
instructions.

<https://godbolt.org/z/a5en1j5xY>

I see that the code is different: Can you actually measure a
performance difference?

Terje

In those particular case the speed difference is likely hard to
measure on all but the smallest cores.

Below is the example where I would expect significant difference even
on cores that were state of the art just 4-5 years ago, like Intel
Skylake or Arm Cortex-A76:

double foo(const double* src, unsigned len)
{
double acc42 = 0;
double acc52 = 0;
double acc62 = 0;
double acc72 = 0;
for (unsigned i = 0; i < len; ++i) {
acc42 += src[i+42];
acc52 += src[i+52];
acc62 += src[i+62];
acc72 += src[i+72];
}
return acc42*acc52*acc62*acc72;
}

Godbolt links:
x86-64: https://godbolt.org/z/jErxbfshM
ARM64: https://godbolt.org/z/xzr9xMTnW

P.S.
Thinking a bit more about it... No, on Skylake there wouldn't be
significant speed difference. All variant will be latency-bound and will
run very closely to 4 cycles per iteration. In order to see the
difference we will need higher amount of accumulators - at least 6, but
8 are better.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Thu Feb 22 01:37:45 2024

On Wed, 21 Feb 2024 22:48:52 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

David Brown wrote:

On 21/02/2024 19:40, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 21/02/2024 18:27, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 21/02/2024 17:33, Michael S wrote:

If one wants top performance on 64-bit architectures then
avoiding 'unsigned int' indices is a very good idea. Hoping
that compiler will somehow figure out what you meant instead
of doing what you wrote is a
naivety.

Indeed.

I hope you're not agreeing that unsigned array indicies should be
avoided - they should instead be preferred.

I am agreeing that "unsigned int" indices should be avoided on
64-bit architectures (at least, when you are using them in loops)
because the compiler has to generate extra instructions to handle
wrapping

Does it?

A simple test case shows that using an unsigned long or unsigned
int as a loop index generates exactly the same non-optimized code,
albeit the unsigned long version was slightly larger (4 bytes)
in code footprint.�� This with GCC on 64-bit Fedora Core and
without optimization (with optimization (-O2) both generated
identical code)

unsigned long
test(void)
{
�� unsigned int x;�� /* Alternate unsigned long
x; */ unsigned long vector[1000];
�� unsigned long result;

�� for(x = 0; x < 1000u; x++) {
�� unsigned long a = vector[x];
�� result += a;
�� }
�� return result;
}

unsigned int loop index:

There's no point in talking about the most efficient generated code without enabling optimisation!� So this example is irrelevant.

Unsigned long loop index:

This one too.

With -O2:

There's no point in talking about generated code with a source that
uses uninitialised data.� The complier could just as well have
turned the code into "return 0;".

However, in code where the compiler can be sure there is no
wrapping on the index, and can turn the code into pointer
arithmetic and stop when the pointer gets to the end, then no extra
code is needed for an "unsigned int" index.

But if there /is/ a possibility of wrapping, such as in the "sext" function posted by Anton, it is needed.

Or if we adapt your function to have a range:

unsigned long
test_uint(long unsigned int a, long unsigned int b)
{
�� unsigned int x;�� /* Alternate unsigned long
x; */ extern unsigned long vector[1000];
�� unsigned long result = 0;

�� for(x = a; x <= b; x++) {
�� unsigned long a = vector[x];
�� result += a;
�� }
�� return result;
}

I really hope you are not in the habit of reusing function parameter
names as local variables inside the function? Also the various
inconsistent integer names (long unsigned int, unsigned long), are
these supposed to be different types?

I would expect to see code very close to this for a sane setup:

xor rax,rax ; result variable
.. setup code for beginning and end of the range, with check that
a<=b next:
add rax,[rsi]
lea rsi,[rsi+8]
cmp esi,edi
jb next

All four instructions could run in the same cycle, but since these
are unsigned variables we can easily parallelize a bit, using rax,rbx,rcx,rdx as accumulators, and at this point the performance
will be limited by how many data items we can load per cycle. The
obvious workaround is of course to use SIMD, with a 16/32/64-byte
register as the wide accumulator.

Using "unsigned int" instead of "int" or "unsigned long" means 5 instructions in the inner loop, instead of 4.� Of course there are
other ways to re-arrange the function to get the optimal result.
But the point is, when the compiler does not know for sure that
unsigned int arithmetic will not wrap, it can lead to extra
instructions.

<https://godbolt.org/z/a5en1j5xY>

I see that the code is different: Can you actually measure a
performance difference?

Terje

In those particular case the speed difference is likely hard to
measure on all but the smallest cores.

Below is the example where I would expect significant difference even
on cores that were state of the art just 4-5 years ago, like Intel
Skylake or Arm Cortex-A76:

double foo(const double* src, unsigned len)
{
double acc42 = 0;
double acc52 = 0;
double acc62 = 0;
double acc72 = 0;
for (unsigned i = 0; i < len; ++i) {
acc42 += src[i+42];
acc52 += src[i+52];
acc62 += src[i+62];
acc72 += src[i+72];
}
return acc42*acc52*acc62*acc72;
}

Godbolt links:
x86-64: https://godbolt.org/z/jErxbfshM
ARM64: https://godbolt.org/z/xzr9xMTnW

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Terje Mathisen on Thu Feb 22 09:48:29 2024

On 21/02/2024 22:48, Terje Mathisen wrote:

David Brown wrote:

On 21/02/2024 19:40, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 21/02/2024 18:27, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 21/02/2024 17:33, Michael S wrote:

If one wants top performance on 64-bit architectures then avoiding >>>>>>> 'unsigned int' indices is a very good idea. Hoping that compiler >>>>>>> will
somehow figure out what you meant instead of doing what you wrote >>>>>>> is a
naivety.

Indeed.

I hope you're not agreeing that unsigned array indicies should be
avoided - they should instead be preferred.

I am agreeing that "unsigned int" indices should be avoided on 64-bit
architectures (at least, when you are using them in loops) because the >>>> compiler has to generate extra instructions to handle wrapping

Does it?

A simple test case shows that using an unsigned long or unsigned
int as a loop index generates exactly the same non-optimized code,
albeit the unsigned long version was slightly larger (4 bytes)
in code footprint.   This with GCC on 64-bit Fedora Core and
without optimization (with optimization (-O2) both generated
identical code)

unsigned long
test(void)
{
     unsigned int x;                    /* Alternate unsigned long x; */
     unsigned long vector[1000];
     unsigned long result;

     for(x = 0; x < 1000u; x++) {
         unsigned long a = vector[x];
         result += a;
     }
     return result;
}

unsigned int loop index:

There's no point in talking about the most efficient generated code
without enabling optimisation! So this example is irrelevant.

Unsigned long loop index:

This one too.

With -O2:

There's no point in talking about generated code with a source that
uses uninitialised data. The complier could just as well have turned
the code into "return 0;".

However, in code where the compiler can be sure there is no wrapping
on the index, and can turn the code into pointer arithmetic and stop
when the pointer gets to the end, then no extra code is needed for an
"unsigned int" index.

But if there /is/ a possibility of wrapping, such as in the "sext"
function posted by Anton, it is needed.

Or if we adapt your function to have a range:

unsigned long
test_uint(long unsigned int a, long unsigned int b)
{
     unsigned int x;                    /* Alternate unsigned long x; */
     extern unsigned long vector[1000];
     unsigned long result = 0;

     for(x = a; x <= b; x++) {
         unsigned long a = vector[x];
         result += a;
     }
     return result;
}

I really hope you are not in the habit of reusing function parameter
names as local variables inside the function?

No, of course not. I just didn't notice that while playing around with
Scott's function on godbolt.org.

Also the various
inconsistent integer names (long unsigned int, unsigned long), are these supposed to be different types?

Again, it's just playing around. I don't use type names like that in my
code - I use sized types, or plain "int" for a simple local variable of
small size.

I would expect to see code very close to this for a sane setup:

xor rax,rax ; result variable
.. setup code for beginning and end of the range, with check that a<=b next:
add rax,[rsi]
lea rsi,[rsi+8]
cmp esi,edi
   jb next

I was showing what compilers /do/, not what the /could/ do. It is
certainly possible for the code to check the relationships between "a"
and "b" (and possibly check for awkward end conditions - I haven't
thought through the details) so that the loop can be shorter in the case
when "a <= b", and longer for the case "a > b".

All four instructions could run in the same cycle, but since these are unsigned variables we can easily parallelize a bit, using
rax,rbx,rcx,rdx as accumulators, and at this point the performance will
be limited by how many data items we can load per cycle. The obvious workaround is of course to use SIMD, with a 16/32/64-byte register as
the wide accumulator.

I turned off SSE on godbolt to make it easier to see how the code is
generated. For real-world use aiming at maximal efficiency, you'd want
to let the compiler use every trick it knows.

Using "unsigned int" instead of "int" or "unsigned long" means 5
instructions in the inner loop, instead of 4. Of course there are
other ways to re-arrange the function to get the optimal result. But
the point is, when the compiler does not know for sure that unsigned
int arithmetic will not wrap, it can lead to extra instructions.

<https://godbolt.org/z/a5en1j5xY>

I see that the code is different: Can you actually measure a performance difference?

godbolt is great for many things, but does not do performance
measurement as far as I know.

But you make a good point - sometimes these things do not make a
difference in speed, on devices like x86-64 processors.

My work is in embedded systems, and on the cores I use, it is the norm
that extra instructions means extra time - Cortex-M cores and other microcontrollers are much simpler in their scheduling and execution.
There can be a certain amount of overlap between instructions, but not
nearly the kind of thing you see on "big" processors. These are 32-bit
cores, but I have seen the same thing when code has been ported from
smaller devices or written by people used to smaller devices, and they
use "uint8_t" or "uint16_t" for loop variables causing unnecessary extra instructions on 32-bit devices.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to David Brown on Thu Feb 22 10:44:33 2024

David Brown wrote:

unsigned long
test_uint(long unsigned int a, long unsigned int b)
{
unsigned int x; /* Alternate unsigned long x; */
extern unsigned long vector[1000];
unsigned long result = 0;

for(x = a; x <= b; x++) {
unsigned long a = vector[x];
result += a;
}
return result;
}

I was showing what compilers /do/, not what the /could/ do. It is certainly possible for the code to check the relationships between "a"
and "b" (and possibly check for awkward end conditions - I haven't
thought through the details) so that the loop can be shorter in the case when "a <= b", and longer for the case "a > b".

How is that possible?

a, b and x are all unsigned, which means that afaik there is no possible
way for x = a; x <= b; x++ to run any iterations, if a is greater than b?

Or am I misattributing the range-based loop to you?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Terje Mathisen on Thu Feb 22 11:24:41 2024

On 22/02/2024 10:44, Terje Mathisen wrote:

David Brown wrote:

unsigned long
test_uint(long unsigned int a, long unsigned int b)
{
     unsigned int x;                    /* Alternate unsigned long x; */
     extern unsigned long vector[1000];
     unsigned long result = 0;

     for(x = a; x <= b; x++) {
         unsigned long a = vector[x];
         result += a;
     }
     return result;
}

I was showing what compilers /do/, not what the /could/ do. It is
certainly possible for the code to check the relationships between "a"
and "b" (and possibly check for awkward end conditions - I haven't
thought through the details) so that the loop can be shorter in the
case when "a <= b", and longer for the case "a > b".

How is that possible?

a, b and x are all unsigned, which means that afaik there is no possible
way for x = a; x <= b; x++ to run any iterations, if a is greater than b?

Or am I misattributing the range-based loop to you?

Sorry, I was not thinking straight at all. The complication here is if
b is greater than or equal to UINT_MAX. Then there is no end point, and
the loop is infinite. (The C standard actually gives the implementation
leave to assume this cannot happen - it can assume there are no infinite
loops unless the controlling expression is constant, or the loop
contains observable behaviour. But it seems gcc does not optimise based
on this.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to David Brown on Thu Feb 22 13:42:45 2024

David Brown wrote:

On 22/02/2024 10:44, Terje Mathisen wrote:

David Brown wrote:

unsigned long
test_uint(long unsigned int a, long unsigned int b)
{
Â Â Â Â unsigned int x;Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â /*
Alternate unsigned long x; */
Â Â Â Â extern unsigned long vector[1000];
Â Â Â Â unsigned long result = 0;

Â Â Â Â for(x = a; x <= b; x++) {
Â Â Â Â Â Â Â Â unsigned long a = vector[x];
Â Â Â Â Â Â Â Â result += a;
Â Â Â Â }
Â Â Â Â return result;
}

I was showing what compilers /do/, not what the /could/ do.Â It is
certainly possible for the code to check the relationships between
"a" and "b" (and possibly check for awkward end conditions - I
haven't thought through the details) so that the loop can be shorter
in the case when "a <= b", and longer for the case "a > b".

How is that possible?

a, b and x are all unsigned, which means that afaik there is no
possible way for x = a; x <= b; x++ to run any iterations, if a is
greater than b?

Or am I misattributing the range-based loop to you?

Sorry, I was not thinking straight at all. The complication here is if
b is greater than or equal to UINT_MAX. Then there is no end point, and the loop is infinite. (The C standard actually gives the implementation leave to assume this cannot happen - it can assume there are no infinite loops unless the controlling expression is constant, or the loop
contains observable behaviour. But it seems gcc does not optimise based
on this.)

OK, I think I understand your point.

The take-home learning should be that in a range-based loop, the range endpoints and the loop variable all needs to be of the same type.

I.e. one of the many C traps that Rust simply will not allow you to
write or fall into.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Terje Mathisen on Thu Feb 22 14:54:48 2024

On 22/02/2024 13:42, Terje Mathisen wrote:

David Brown wrote:

On 22/02/2024 10:44, Terje Mathisen wrote:

David Brown wrote:

unsigned long
test_uint(long unsigned int a, long unsigned int b)
{
Â Â Â Â unsigned int x;Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â /*
Alternate unsigned long x; */
Â Â Â Â extern unsigned long vector[1000];
Â Â Â Â unsigned long result = 0;

Â Â Â Â for(x = a; x <= b; x++) {
Â Â Â Â Â Â Â Â unsigned long a = vector[x];
Â Â Â Â Â Â Â Â result += a;
Â Â Â Â }
Â Â Â Â return result;
}

I was showing what compilers /do/, not what the /could/ do.Â It is
certainly possible for the code to check the relationships between
"a" and "b" (and possibly check for awkward end conditions - I
haven't thought through the details) so that the loop can be shorter
in the case when "a <= b", and longer for the case "a > b".

How is that possible?

a, b and x are all unsigned, which means that afaik there is no
possible way for x = a; x <= b; x++ to run any iterations, if a is
greater than b?

Or am I misattributing the range-based loop to you?

Sorry, I was not thinking straight at all. The complication here is
if b is greater than or equal to UINT_MAX. Then there is no end
point, and the loop is infinite. (The C standard actually gives the
implementation leave to assume this cannot happen - it can assume
there are no infinite loops unless the controlling expression is
constant, or the loop contains observable behaviour. But it seems gcc
does not optimise based on this.)

OK, I think I understand your point.

It was easier after I had made a real point and not just a muddled mistake!

The take-home learning should be that in a range-based loop, the range endpoints and the loop variable all needs to be of the same type.

I certainly think they should be of the same type. But there are still
cases where an "int" loop variable gives better code than an "unsigned
int" - though an unsigned integer type of "full size" for the target is
likely to be just as good. Oh, and it's a good idea to make sure you
are not accidentally allowing infinite loops (using < rather than <= helps).

I.e. one of the many C traps that Rust simply will not allow you to
write or fall into.

C is a very flexible language - you can write good code, and bad code.
I think Rust limits you more, which makes it harder to make a number of
errors. But it is quite possible to write C code without mixing
signedness of integers, or getting buffer overflows and memory leaks,
and many of the other things that Rust is designed to prevent. You need
to write your C code in an appropriate way, and you need to use your
tools well (such as lots of static warnings).

For my use, I have yet to see the point of Rust over C++. Rust might
make it slightly easier to avoid some mistakes than C, but it is lacking
the features in C++ that let you have better control. Again, it is
easier to write /bad/ code in C++ - you can write some truly horrendous
stuff. But used well, C++ can protect you against all kinds of errors.

I haven't tried Rust for more than a "Hello, world" program, though I
have read a bit more about it. It seems to have some nice features and
a more modern outlook - modern C++ is encumbered by having to stay
mostly compatible with C and old C++. But it seems to me to be lacking features I like in C++, such as constructors in classes, user-defined conversions (with control of explicit or implicit conversions), function overloading, public/private distinctions, and many other features. It's possible that this is because I am thinking in C++ terms and not
thinking in Rust terms, but I have trouble seeing what Rust would give
me that I don't have in C++, while I can easily see things from C++ that
I would miss.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to David Brown on Fri Feb 23 16:17:35 2024

On Thu, 22 Feb 2024 14:54:48 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 22/02/2024 13:42, Terje Mathisen wrote:

I.e. one of the many C traps that Rust simply will not allow you to
write or fall into.

C is a very flexible language - you can write good code, and bad
code. I think Rust limits you more, which makes it harder to make a
number of errors. But it is quite possible to write C code without
mixing signedness of integers, or getting buffer overflows and memory
leaks, and many of the other things that Rust is designed to prevent.
You need to write your C code in an appropriate way, and you need to
use your tools well (such as lots of static warnings).

For my use, I have yet to see the point of Rust over C++. Rust might
make it slightly easier to avoid some mistakes than C, but it is
lacking the features in C++ that let you have better control. Again,
it is easier to write /bad/ code in C++ - you can write some truly
horrendous stuff. But used well, C++ can protect you against all
kinds of errors.

I haven't tried Rust for more than a "Hello, world" program, though I
have read a bit more about it. It seems to have some nice features
and a more modern outlook - modern C++ is encumbered by having to
stay mostly compatible with C and old C++. But it seems to me to be
lacking features I like in C++, such as constructors in classes,
user-defined conversions (with control of explicit or implicit
conversions), function overloading, public/private distinctions, and
many other features. It's possible that this is because I am
thinking in C++ terms and not thinking in Rust terms, but I have
trouble seeing what Rust would give me that I don't have in C++,
while I can easily see things from C++ that I would miss.

IMHO, Rust has exactly one advantage - it facilitates static code
analysis to the level, impossible in most (all?) other popular
imperative languages.
Other than that, it is the worst, least productive [among relatively
new, Algol-family] programming language I ever took a look at.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Feb 23 19:55:39 2024

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

I know no implementation of a 64-bit architecture where ALU operations
(except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:

A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.

Now, only in cache footprint. All your std operations take the same amount
of cycles DP vs. SP. Excepting for cache footprint latency::
perf(DP) == perf(SP)

It would also broken Fortran, where storage aasociation rules mean
that both REAL and INTEGER have to have the same size, and DOUBLE
PRECISION twice that. Breaking that would have invalidated just
about every large scientific program at the time.

Yes, my argument that int be the fastest integral data type does not
extend to languages that mandate a particular size. FORTRAN mandates
a size, C implies int is faster than short or long--which is no
longer true.

Cray got away with 64-bit REAL and 128-bit DOUBLE PRECISION because
they were the fastest anyway, but anybody else making that choice
would have been laughed right out of the market.

And ints were stored in 64-bit containers but only had 24-bits of
significance (later 32 with -XMP).

So, backward compatibility, your favorite topic.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Sat Feb 24 10:32:40 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

The take-home learning should be that in a range-based loop, the range=20 >endpoints and the loop variable all needs to be of the same type.

The sext/zext examples <2024Feb20.130029@mips.complang.tuwien.ac.at>
use the same type for the loop variable and the endpoints, and the
unsigned case is still bigger than the int -fwrapv case, and that is
bigger than the (u)intptr_t case.

The actual takeaway lesson is to always use a full-width type, which
avoids the need for extending a sub-width type to full width.

Of course, if you are a compiler supremacist like David Brown, you use
a type like int (without -fwrapv or the like) that gives few
guarantees, and hope that the compiler manages to convert the lack of guarantees into code that is as efficient as what you get when using
the full-width type; in the sext example that works, but I would not
rely on it.

Just to avoid readers having to look up the posting referred to above,
here's the part that the present posting is about:

|void sext(int M, int *ic, int *is)
|{
| int k;
| for (k = 1; k <= M; k++) {
| ic[k] += is[k];
| }
|}
|
|which is based on the only loop (from 456.hmmer) in SPECint 2006 where
|the difference between -fwrapv and the default produces a measurable |performance difference (as reported in section 3.3 of |<https://people.eecs.berkeley.edu/~akcheung/papers/apsys12.pdf>). I
|created variations of this function, where the types of M and k were
|changed to b) unsigned, c) intptr_t, d) uintptr_t and compiled the
|code with "gcc -Wall -fwrapv -O3 -c -fno-unroll-loops". The loop body
|looks as follows on RV64GC:
|
| int unsigned (u)intptr_t
|.L3: .L8: .L15:
|slli a5,a4,0x2 slli a5,a4,0x20 lw a5,0(a1)
|add a6,a1,a5 srli a5,a5,0x1e lw a4,4(a2)
|add a5,a5,a2 add a6,a1,a5 addi a1,a1,4
|lw a3,0(a6) add a5,a5,a2 addi a2,a2,4
|lw a5,0(a5) lw a3,0(a6) addw a5,a5,a4
|addiw a4,a4,1 lw a5,0(a5) sw a5,-4(a1)
|addw a5,a5,a3 addiw a4,a4,1 bne a2,a3,54 <.L15>
|sw a5,0(a6) addw a5,a5,a3
|bge a0,a4,6 <.L3> sw a5,0(a6)
| bgeu a0,a4,28 <.L8>
|
|There is no difference between the intptr_t loop body and the
|uintptr_t loop. And without -fwrapv, the int loop looks just like the |(u)intptr_t loop (because the C compiler then assumes that signed
|integer overflow does not happen).

And just so you can play around yourself, here you find the source
code of all four variants, and the AMD64 and RV64GC output with
gcc-12.2: <https://godbolt.org/z/6a7TTTTM1>

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Sat Feb 24 09:32:19 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

MitchAlsup1 wrote:

[...]

The semantics of unsigned are better when you are twiddling bits than
when signed. My programming practice is to use unsigned everywhere that
a negative number is unexpected.

Ditto.

10-4.

It always surprises me to encounter people who think signed types
should be the default, with almost no exceptions. It's one thing
to be somewhat cavalier about shooting yourself in the foot, it's
quite another to deliberately choose a large caliber weapon ahead
of time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Tim Rentsch on Sat Feb 24 18:48:37 2024

Tim Rentsch wrote:

10-4.

It always surprises me to encounter people who think signed types
should be the default, with almost no exceptions. It's one thing
to be somewhat cavalier about shooting yourself in the foot, it's
quite another to deliberately choose a large caliber weapon ahead
of time.

I am of the opinion that the proper default type is unsigned--the
only reason to use a singed type is if at some point in it life it
needs to hold a negative number.

With C-promotions, unsigned is safer when mixing signed and unsigned.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Sat Feb 24 21:14:52 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Tim Rentsch wrote:

10-4.

It always surprises me to encounter people who think signed types
should be the default, with almost no exceptions. It's one thing
to be somewhat cavalier about shooting yourself in the foot, it's
quite another to deliberately choose a large caliber weapon ahead
of time.

I am of the opinion that the proper default type is unsigned--the
only reason to use a singed type is if at some point in it life it
needs to hold a negative number.

I expect you mean that the proper default choice for signedness
is unsigned -- unsigned int rather than int, unsigned long rather
than long, etc. I agree with that view. Whether unsigned int is
a better default than unsigned long, or vice versa, is another
discussion.

With C-promotions, unsigned is safer when mixing signed and unsigned.

Here too I expect you mean when the respective types have the
same width. Mixing a signed long and an unsigned int, for
example, is a red flag in my book.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Sun Feb 25 13:40:59 2024

On 23/02/2024 20:55, MitchAlsup1 wrote:

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

I know no implementation of a 64-bit architecture where ALU operations
(except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:

A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.

Now, only in cache footprint. All your std operations take the same amount
of cycles DP vs. SP. Excepting for cache footprint latency:: perf(DP) == perf(SP)

That's true - except when it is not.

It is not true when you are using vector instructions, and you can do
twice as many SP instructions as DP instructions in the same register
and instruction.

It is not true when you are using accelerators of various kinds, such as graphics card processors.

And it is not true on smaller processors, such as in the embedded world.
On microcontrollers with floating point hardware for single and double precision, SP can be up to twice as fast as DP. And for many of the
more popular microcontrollers, you can have hardware SP but DP is done
in software - the difference there is clearly massive.

But for big processors doing non-vector adds and multiplies, DP and SP
are usually equal in clock cycles (other than memory and cache effects).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Sun Feb 25 15:25:56 2024

On 24/02/2024 11:32, Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

The take-home learning should be that in a range-based loop, the range=20
endpoints and the loop variable all needs to be of the same type.

The sext/zext examples <2024Feb20.130029@mips.complang.tuwien.ac.at>
use the same type for the loop variable and the endpoints, and the
unsigned case is still bigger than the int -fwrapv case, and that is
bigger than the (u)intptr_t case.

The actual takeaway lesson is to always use a full-width type, which
avoids the need for extending a sub-width type to full width.

Of course, if you are a compiler supremacist like David Brown, you use
a type like int (without -fwrapv or the like) that gives few
guarantees, and hope that the compiler manages to convert the lack of guarantees into code that is as efficient as what you get when using
the full-width type; in the sext example that works, but I would not
rely on it.

If by "compiler supremacist" you mean "someone who tries their best to understand the language and the tools", then us "compiler supremacists"
know that you have a great deal of guarantees about "int". We also know
that you have few absolute guarantees about performance and exact code generation for any types - you only have guarantees about observable
behaviour.

So you need to use types (or options) that first and foremost guarantee
the /behaviour/ of your code will be correct according to your needs.
And if you need the best efficiency, you have to look at generated code,
and measure the results, and try out different combinations of source
code and compiler flags. Rough rules can help - such as making sure the compiler knows you won't have wrapping or overflow (so use "int" or big
types) - but never guarantee performance.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Sun Feb 25 16:48:32 2024

On 24/02/2024 19:48, MitchAlsup1 wrote:

Tim Rentsch wrote:

10-4.

It always surprises me to encounter people who think signed types
should be the default, with almost no exceptions. It's one thing
to be somewhat cavalier about shooting yourself in the foot, it's
quite another to deliberately choose a large caliber weapon ahead
of time.

I am of the opinion that the proper default type is unsigned--the
only reason to use a singed type is if at some point in it life it needs
to hold a negative number.

I am not convinced there should be a default type at all.

But like it or not, "int" is the nearest C has to a default type. It is
the type for things like integer constants. That does not mean it's the
type you should choose for everything, but it will turn up very often.

With C-promotions, unsigned is safer when mixing signed and unsigned.

I don't like the way C handles this, and choose compiler warnings to
tell me if I am doing it accidentally. If you don't know the C rules
here, or are not careful about your types, then mixing signed and
unsigned types is going to be unsafe no matter what types you pick -
because by definition, if you have mixing, then you have a signed and an unsigned type involved. The safe thing is to stick to either all
signed, or all unsigned - then there is no mixing.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to David Brown on Sun Feb 25 18:52:05 2024

David Brown <david.brown@hesbynett.no> schrieb:

On 24/02/2024 19:48, MitchAlsup1 wrote:

Tim Rentsch wrote:

10-4.

It always surprises me to encounter people who think signed types
should be the default, with almost no exceptions. It's one thing
to be somewhat cavalier about shooting yourself in the foot, it's
quite another to deliberately choose a large caliber weapon ahead
of time.

I am of the opinion that the proper default type is unsigned--the
only reason to use a singed type is if at some point in it life it needs
to hold a negative number.

I am not convinced there should be a default type at all.

Fortan has a default integer and a default real, and a default
double precision.

These days (since 1991, so it's been longer than most programming
languages exist) it has a KIND mechanism. For example, if you need
at least 12 significant digits, you can write

integer, parameter :: wp = selected_real_kind(12)

to declare a number wp which you can then use in the declaration
of a real variable, such as

real(wp) :: a, b, c

and can use to suffix to constants:

a = 1.2_wp

Same mechanism for integers.

Later revisions also included a 32-bit and a 64-bit types
from an intrinsic module.

Same as in C with int: INTEGER is what you use if you don't really
care about. There is a restriction that Fortran's default integer
has at least five digits, so 17 bits plus signed. PDP-11 ints would
no longer be conforming, I'm afraid :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Sun Feb 25 20:22:40 2024

On 24/02/2024 21:57, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 18:47, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 13:00, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

mitchalsup@aol.com (MitchAlsup1) writes:

Another possible reason is that it is very useful to have integer types
with sizes 1, 2, 4 and 8 bytes. C doesn't have many standard integer
types, so if "int" is 64-bit, you have "short" as either 16-bit and have
no 32-bit type, or "short" is 32-bit and you have no 16-bit type. With
32-bit "int", it's easy to have each size without having to add extended
integer types or add new standard integer types (like "short short int"
for 16-bit and "short int" for 32-bit).

short had been 16-bit on 16-bit machines and on 32-bit machines, so
the right choice for 64-bit machines is to make it 16-bit, too. > As
for a 32-bit type, that would then obviously be "long short".

OK. I am not convinced that "char, short, long short, int, long" would
be that much better than "char, short short, short, int, long",
especially when "long long" is added to the end, but either would do.

(There was an April Fool's proposal for the C standards with a method of
mixing "long" and "short" qualifiers to get any size of integer you
like. I haven't been able to find it, but I'd be very grateful if
anyone has a link.)

The C
compiler people had no qualms at adding "long long" when they wanted something bigger then 32 bits on 32-bit systems, so what should keep
them from adding "long short"?

They could have done that, yes.

I saw benchmarks showing x32 being measurably faster,

Sure, it's measurably faster. That's obviously not sufficient for
incurring the cost of x32.

Agreed.

but it's not
unlikely that the differences got less with more modern x86-64
processors (with bigger caches)

Doubtful. The L1 caches have not become bigger since the days of the
K7 (1999) with its 64KB D and I caches (and the K7 actually did not
support AMD64). There has been some growth in L2+L3 combined in
recent years, but x32 already flopped earlier.

But other caches have got bigger, and buses between the caches have got
wider (have they not?).

and it's simply not worth the effort
having another set of libraries and compiler targets just to make some
kinds of code marginally faster.

Exactly.

And support for 32-bit has /not/ been "eliminated from ARM cores".

Of course it has. E.g., the Cortex-X1 supports A32/T32, and its
descendants Cortex-X2, X3, X4 don't. The Cortex-A710 supports
A32/T32, it's successors A715 and A720 do not. Cortex-A510 supports
A32/T32, A520 doesn't.

It
may have been eliminated from the latest AArch64 cores - I don't keep
good track of these. But for every such core sold, there will be
hundreds (my guestimate) of 32-bit ARM cores sold in microcontrollers
and embedded systems.

Irrelevant for the question at hand: Are the performance benefits of
32-bit applications sufficient to pay for the cost of maintaining a
32-bit software infrastructure on an otherwise 64-bit system? The
answer is no.

For ARM, you are more likely to get away with saying "you need to
re-compile your old code to run on this new system". In the x86 world, especially for Windows, that's a non-starter. An x86-64 processor that couldn't run x86-32 binaries would not sell well.

I would suggest C "fast" types like int_fast32_t (or other "fast" sizes, >>>> picked to fit the range you need).

Sure, and then the program might break when an array has more the 2^31
elements; or it might work on one platform and break on another one.

You need to pick the appropriate size for your data, as I said.

In general-purpose computing, you usually don't know that size. E.g.,
for a sort routine, do you use int_fast32_t, int_fast8_t,
int_fast16_t, or int_fast64_t for the array size?

Oh, I think you often have a good idea about the size. However, I quite
agree that if you don't know, you have to pick the biggest practical size.

(And either way, you will want to do some checking and sanitizing of
your data when it comes in.)

By contrast, with (u)intptr_t, on modern architectures you use the
type that's as wide as the GPRs. And I don't see a reason why to use
something else for a loop counter.

I like my types to say what they mean. "uintptr_t" says "this object
holds addresses for converting pointer values back and forth with an
integer type".

Exactly. It's the unnamed type of BCPL and B, and the int of Unix C
before the I32LP64 mistake.

This is 2024. No one cares about BCPL or B. No one even cares about
how people wrote C before I32LP64 systems were common. The "int" of C
is "int" - that's what almost all C programmers use when they don't have
a good reason for picking something else, or when they simply don't
think about it. You might not like that (and I would certainly prefer
that more programmers thought more about their types), but that's the
reality.

When I am writing C, if I want a type that represents the size of an
unknown object, I use "size_t" - because that is the type used by the C standards and by C code. I couldn't care less what types BCPL used. I couldn't care less what types were used in C programming before the
"sizeof" operator gave a value of type "size_t", or before standard
library functions like "memcpy" used parameters of type "size_t".

If I want an unsigned type that can hold up to 64 bits, I use
"uint64_t". If I want a type that will hold a pointer converted to an
integer type, I use "uintptr_t".

I fully appreciate that you prefer to use unsigned types for
general-purpose variables or loop indexes, and I fully appreciate you
want them to be 64-bit for the systems you target. But I don't
understand why you want to call it "uintptr_t".

Nonetheless, it is good design to use appropriate type names for
appropriate usage. This makes the code clearer, and increases portability.

On the contrary, in the Gforth project we have had many more
portability problems with C code with its integer type zoo than in the
Forth code which just has single-cell (a machine word), double cell,
and char as integer types. Likewise, Forth code from others tends to
be pretty portable between 32-bit and 64-bit systems, even if the code
has only been tested on one kind of system.

Forth always uses 16-bit cells, if I remember correctly? (It is a
/very/ long time since I have tried writing anything in Forth, and that
was on 8-bit systems.) If you want fixed size types, why not use them
in C? "int32_t" is the same size in every C implementation.

As for int64_t, that tends to be
slow (if supported at all) on 32-bit platforms, and it is more than
what is necessary for indexing arrays and for loop counters that are
used for indexing into arrays.

And that is why it makes sense to use the "fast" types. If you need a
16-bit range, use "int_fast16_t". It will be 64-bit on 64-bit systems,
32-bit on 32-bit systems, and 16-bit on 16-bit and 8-bit systems -
always supporting the range you need, as fast as possible.

That makes no sense. E.g., an in-memory sorting routine might well
have to sort 100G elements or more on a suitably large (64-bit)
machine. So according you one should use int_fast64_t. But that
would be slow and unnecessarily large on a 32-bit system where you
cannot hold and sort that many items anyway.

That would be a case where you don't know the range you need in absolute
terms, but you know them in C terms - the type to use is "size_t".

Don't use "-fwrapv" unless you actually need it - in most
code, if your arithmetic overflows, you have a mistake in your code, so >>>> letting the compiler assume that will not happen is a good thing.

Thank you for giving a demonstration for Scott Lurndal. I assume that
you claim to be a programmer.

Sorry, that comment went over my head - I don't know what
"demonstration" you are referring to.

I wrote in <2024Feb20.091522@mips.complang.tuwien.ac.at>:
|Those ideas are that integer overflows do not happen and that a
|competent programmer proactively prevents them from happening, by
|sizing the types accordingly, and checking the inputs.

Scott Lurndal replied <SW2BN.153110$taff.74839@fx41.iad>:
|Can't say that I've known a programmer who thought that way.

OK. Yes, I am a programmer whose signed integer arithmetic does not
overflow. I use appropriately sized types, or check values so that they
don't overflow. I may also use unsigned types and check afterwards for wrapping, if that is the most appropriate choice. (And yes, /sometimes/ wrapping is the right thing and appropriate for the code.)

I don't know of any programmers who think letting their signed integer arithmetic overflow is a good idea. I do know some that think wrapping
is a good idea because they feel predictable incorrect answers are
better than unpredictable effects.

For comparison to C, look at the Zig language. (It's not a language I
have used or know in detail, but I know a little about it.) Unsigned
integer arithmetic overflow is UB in Zig, just like signed integer
arithmetic overflow. There are standard options and block settings
(roughly equivalent to pragmas) to control whether these give a run-time
error, or are assumed never to happen (for optimisation purposes). And
if you want to add "x" and "y" with wrapping, you use "x +% y" as an
explicit choice of operator.

That seems to me to be the right approach for an efficient high-level
language.

I don't know about Zig, but for a language like C, I would prefer
types that ask for trap-on-overflow arithmetic, or for modulo
arithmetic instead of introducing an additional operator.

How then do you treat "x + 1" ? Should it have trapping or wrapping
semantics, depending on the type of "x" alone? It is not the /type/
that wraps or traps, it is the operation. In Forth, you use the
operator to distinguish the operation - "+", ".+", "2+", etc. (assuming
I remembered these correctly).

You could reasonably say that if "x" is of a wrapping type and "y" is of
a trapping type, then "x + y" is a compile-time error. Your code might
be a little more wordy to handle explicit conversions, but it would
certainly be a safe choice.

The undefined option is just a bad idea: Wang et al <https://people.eecs.berkeley.edu/~akcheung/papers/apsys12.pdf> found
that it gave a measurable speedup over -fwrapv in only one loop in SPECint2006, and that speedup is only due to the I32LP64 mistake.

I am not interested in the speed of SPECint2006 functions. I am
interested in the speed of the code I write on the targets I use, and I
see "int" leading to faster code than "unsigned int" (this is primarily
on 32-bit targets these days, but occasionally on older 16-bit or 8-bit systems).

I.e., a competently designed dialect would not have that mistake, and
there would be no measurable speedup from the undefined option, just
the possibility of the compiler doing something unexpected.

If you want "int z = x + y;" with wrapping, write :

int z = (unsigned) x + (unsigned) y;

I'll leave that to you, not just for the following reason:

Once upon a time someone like you suggested using some casting
approach for getting x-1>=x (with signed x) to work as intended. I
tried it, and the result was that gcc-3.something still "optimized" it
to false.

Compilers are not perfect, and that is a challenge we have to live with.
But I don't make major stylistic decisions based on a compiler bug.

And even then, I always put it in a
pragma so that the code works even if someone uses different compiler flags.

Yes, announcing such things in the source code is a good idea in
principle. In practice newer versions of gcc tend to need more sanity
flags (the default of newer gccs is more insanity), and the older
versions do not understand the new flags and fail if you pass them.
So we check every sanity flag in autoconf and use those that the gcc
version used accepts. Doing that through pragmas filled in by
configure does not seem to be any better than using flags through the Makefile.

You make your choices based on how you prefer to build and distribute
your software. So I can only say how /I/ prefer to do these things. To
me, a pragma in the code is clearer and less prone to the risk of
someone re-using the code elsewhere without understanding the necessary
flags to get expected correct results.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Sun Feb 25 22:32:35 2024

David Brown wrote:

On 24/02/2024 21:57, Anton Ertl wrote:

The C
compiler people had no qualms at adding "long long" when they wanted
something bigger then 32 bits on 32-bit systems, so what should keep
them from adding "long short"?

They could have done that, yes.

I saw benchmarks showing x32 being measurably faster,

Sure, it's measurably faster. That's obviously not sufficient for
incurring the cost of x32.

Agreed.

but it's not
unlikely that the differences got less with more modern x86-64
processors (with bigger caches)

Doubtful. The L1 caches have not become bigger since the days of the
K7 (1999) with its 64KB D and I caches (and the K7 actually did not
support AMD64). There has been some growth in L2+L3 combined in
recent years, but x32 already flopped earlier.

But other caches have got bigger, and buses between the caches have got
wider (have they not?).

Yes, Opteron busses (HT) started out as 64-bits/cycle and many processors
are now spotting busses of ½ cache line width per cycle.

and it's simply not worth the effort
having another set of libraries and compiler targets just to make some
kinds of code marginally faster.

Exactly.

For ARM, you are more likely to get away with saying "you need to
re-compile your old code to run on this new system". In the x86 world, especially for Windows, that's a non-starter. An x86-64 processor that couldn't run x86-32 binaries would not sell well.

I would suggest C "fast" types like int_fast32_t (or other "fast" sizes, >>>>> picked to fit the range you need).

Sure, and then the program might break when an array has more the 2^31 >>>> elements; or it might work on one platform and break on another one.

You need to pick the appropriate size for your data, as I said.

In general-purpose computing, you usually don't know that size. E.g.,
for a sort routine, do you use int_fast32_t, int_fast8_t,
int_fast16_t, or int_fast64_t for the array size?

How about size_t ??

Oh, I think you often have a good idea about the size. However, I quite agree that if you don't know, you have to pick the biggest practical size.

We are in the era (now) where the size of individual datums is irrelevant,
only the size of agglomerations of data have relevant sizes (array, struct). int i; signed long j; are irrelevant in terms of memory bandwidth after caching.

(And either way, you will want to do some checking and sanitizing of
your data when it comes in.)

---------------------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Mon Feb 26 12:50:35 2024

On 25/02/2024 23:32, MitchAlsup1 wrote:

David Brown wrote:

On 24/02/2024 21:57, Anton Ertl wrote:

I would suggest C "fast" types like int_fast32_t (or other "fast"
sizes,
picked to fit the range you need).

Sure, and then the program might break when an array has more the 2^31 >>>>> elements; or it might work on one platform and break on another one. >>>>>

You need to pick the appropriate size for your data, as I said.

In general-purpose computing, you usually don't know that size. E.g.,
for a sort routine, do you use int_fast32_t, int_fast8_t,
int_fast16_t, or int_fast64_t for the array size?

How about size_t ??

Sure. That's often the appropriate choice when an object represents the
size of something, or the number of elements in an array of completely
unknown size.

Oh, I think you often have a good idea about the size. However, I
quite agree that if you don't know, you have to pick the biggest
practical size.

We are in the era (now) where the size of individual datums is irrelevant, only the size of agglomerations of data have relevant sizes (array,
struct).
int i; signed long j; are irrelevant in terms of memory bandwidth after caching.

For individual local variables, you are right (except for
microcontrollers, where it still matters a little - though not nearly as
much as it used to). It's only for storage or calculations with
aggregate data that size matters. And ironically we are now in an era
where small sizes, such as 16-bit floats, are more popular than ever before.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Mon Feb 26 12:41:54 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

I know no implementation of a 64-bit architecture where ALU operations
(except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:

A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.

Now, only in cache footprint. All your std operations take the same amount
of cycles DP vs. SP. Excepting for cache footprint latency::
perf(DP) == perf(SP)

Latency, I of course belive you.

Throughput... For 32 vs. 64 bit reals, I would assume that the
area for a unit is about twice as big for the 64-bit version.

You can then calculate twice the number of results for a similar effort.

Hm. AVX-512 can do calculations either on 32 16-bit, 16 32-bit or 8
64-bit reals. Does anybody how they implemented their floating point
units - are they separate for each precision, or do they use some clever
(or horrible) way to use a single unit for all three?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Mon Feb 26 19:53:45 2024

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

I know no implementation of a 64-bit architecture where ALU operations >>>> (except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:

A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.

Now, only in cache footprint. All your std operations take the same amount >> of cycles DP vs. SP. Excepting for cache footprint latency::
perf(DP) == perf(SP)

Latency, I of course belive you.

Throughput... For 32 vs. 64 bit reals, I would assume that the
area for a unit is about twice as big for the 64-bit version.

You can then calculate twice the number of results for a similar effort.

An FMAC that can do 1×DP or 2×SP is 15%-20% larger than an FMAC that can
only do 1×DP or 1×SP. {For those interested, it if the Augend element
and the long adder (168-bits) that are clever and horrible at the same
time enabling 2×SP.}

Hm. AVX-512 can do calculations either on 32 16-bit, 16 32-bit or 8
64-bit reals. Does anybody how they implemented their floating point
units - are they separate for each precision, or do they use some clever
(or horrible) way to use a single unit for all three?

Think:: Clever and horrible at the same time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Thomas Koenig on Tue Feb 27 13:43:14 2024

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

I know no implementation of a 64-bit architecture where ALU operations >>>> (except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:

A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.

Now, only in cache footprint. All your std operations take the same amount >> of cycles DP vs. SP. Excepting for cache footprint latency::
perf(DP) == perf(SP)

FDIV float vs double is still quite often slightly faster, but not close
to 2 X faster even though the mantissa is only 24/53 as large.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Wed Feb 28 01:25:02 2024

Terje Mathisen wrote:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

I know no implementation of a 64-bit architecture where ALU operations >>>>> (except maybe division where present) is slower in 64 bits than in 32 >>>>> bits. I would have chosen ILP64 at the time, so I can only guess at >>>>> their reasons:

A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.

Now, only in cache footprint. All your std operations take the same amount >>> of cycles DP vs. SP. Excepting for cache footprint latency::
perf(DP) == perf(SP)

FDIV float vs double is still quite often slightly faster, but not close
to 2 X faster even though the mantissa is only 24/53 as large.

DIV algorithms have the property that they double the number of accurate
bits each iteration. This holds for both Goldschmidt and Newton-Raphson.

So instead of 17 cycles DP FDIV one gets 15 cycle SP FDIV.

SRT based FDIV is a lot more linear in that doubling the bits requires
doubling of the iteration cycles.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Mon Mar 11 19:56:32 2024

David Brown wrote:

On 23/02/2024 20:55, MitchAlsup1 wrote:

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

I know no implementation of a 64-bit architecture where ALU operations >>>> (except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:

A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.

Now, only in cache footprint. All your std operations take the same amount >> of cycles DP vs. SP. Excepting for cache footprint latency:: perf(DP) ==
perf(SP)

That's true - except when it is not.

It is not true when you are using vector instructions, and you can do
twice as many SP instructions as DP instructions in the same register
and instruction.

You use the word vector where you mean SIMD. A CRAY-YMP doing single
would not be twice as fast because it was designed for 1 FADD + 1 FMUL
+ 2 LD + 1 ST per cycle continuously. CRAY-YMP is the epitome of a
Vector machine.

The alternative word would be short-vector instead of SIMD.

It is not true when you are using accelerators of various kinds, such as graphics card processors.

And it is not true on smaller processors, such as in the embedded world.
On microcontrollers with floating point hardware for single and double precision, SP can be up to twice as fast as DP. And for many of the
more popular microcontrollers, you can have hardware SP but DP is done
in software - the difference there is clearly massive.

But for big processors doing non-vector adds and multiplies, DP and SP
are usually equal in clock cycles (other than memory and cache effects).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Mon Mar 11 20:07:10 2024

Terje Mathisen wrote:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

I know no implementation of a 64-bit architecture where ALU operations >>>>> (except maybe division where present) is slower in 64 bits than in 32 >>>>> bits. I would have chosen ILP64 at the time, so I can only guess at >>>>> their reasons:

A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.

Now, only in cache footprint. All your std operations take the same amount >>> of cycles DP vs. SP. Excepting for cache footprint latency::
perf(DP) == perf(SP)

FDIV float vs double is still quite often slightly faster, but not close
to 2 X faster even though the mantissa is only 24/53 as large.

On a machine where DDIV takes 17 cycles, SDIV would take 15 both using Goldschmidt algorithm with a Newton-Raphson fix up at the end.
On a machine using SRT Division, single is about twice as fast as
double.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Tue Mar 12 11:14:47 2024

On 11/03/2024 20:56, MitchAlsup1 wrote:

David Brown wrote:

On 23/02/2024 20:55, MitchAlsup1 wrote:

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

I know no implementation of a 64-bit architecture where ALU operations >>>>> (except maybe division where present) is slower in 64 bits than in 32 >>>>> bits. I would have chosen ILP64 at the time, so I can only guess at >>>>> their reasons:

A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.

Now, only in cache footprint. All your std operations take the same
amount
of cycles DP vs. SP. Excepting for cache footprint latency:: perf(DP)
== perf(SP)

That's true - except when it is not.

It is not true when you are using vector instructions, and you can do
twice as many SP instructions as DP instructions in the same register
and instruction.

You use the word vector where you mean SIMD.

Yes, I was using the word somewhat interchangeably, as I was talking in
general terms. Perhaps I should have been more precise. I know this
thread talked about "Cray style vectors", but I thought this branch had diverged - I don't know anywhere near enough about the details of Cray
machines to talk much about them.

A CRAY-YMP doing single
would not be twice as fast because it was designed for 1 FADD + 1 FMUL +
2 LD + 1 ST per cycle continuously. CRAY-YMP is the epitome of a
Vector machine.

The alternative word would be short-vector instead of SIMD.

It is not true when you are using accelerators of various kinds, such
as graphics card processors.

And it is not true on smaller processors, such as in the embedded
world. On microcontrollers with floating point hardware for single
and double precision, SP can be up to twice as fast as DP. And for
many of the more popular microcontrollers, you can have hardware SP
but DP is done in software - the difference there is clearly massive.

But for big processors doing non-vector adds and multiplies, DP and SP
are usually equal in clock cycles (other than memory and cache effects).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to David Brown on Tue Mar 12 14:44:28 2024

On Tue, 12 Mar 2024 11:14:47 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 11/03/2024 20:56, MitchAlsup1 wrote:

David Brown wrote:

On 23/02/2024 20:55, MitchAlsup1 wrote:

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

I know no implementation of a 64-bit architecture where ALU
operations (except maybe division where present) is slower in
64 bits than in 32 bits.� I would have chosen ILP64 at the
time, so I can only guess at their reasons:

A guess: people did not want sizeof(float) != sizeof(float).
float is cerainly faster than double.

Now, only in cache footprint. All your std operations take the
same amount
of cycles DP vs. SP. Excepting for cache footprint latency::
perf(DP) == perf(SP)

That's true - except when it is not.

It is not true when you are using vector instructions, and you can
do twice as many SP instructions as DP instructions in the same
register and instruction.

You use the word vector where you mean SIMD.

Yes, I was using the word somewhat interchangeably, as I was talking
in general terms. Perhaps I should have been more precise. I know
this thread talked about "Cray style vectors", but I thought this
branch had diverged - I don't know anywhere near enough about the
details of Cray machines to talk much about them.

Even for Cray/NEC-style vectors, the same throughput for different
precision is not an universal property. Cray's and NEC's vector
processors happen to be designed like that, but one can easily imagine
vector processors of similar style that have 2 or even 3 times higher throughput for SP vs DP.
I personally never encountered such machines, but would be surprised if
it were never built and sold back by one or another usual suspect (may
be, Fujitsu?) in days when designers liked Cray's style.

Which, of course, leaves the question of what property makes vector
processor Cray-style. Just having ALU/FPU several times narrower than
VR is, IMHO, not enough to be considered Cray-style.
In my book, the critical distinction is that at least one size of
partial (chopped) none-load-store vector operations has higher
throughput (and hopefully, but not necessarily lower latency) than full
vector operations of the same type.

A CRAY-YMP doing single
would not be twice as fast because it was designed for 1 FADD + 1
FMUL + 2 LD + 1 ST per cycle continuously. CRAY-YMP is the epitome
of a Vector machine.

The alternative word would be short-vector instead of SIMD.

It is not true when you are using accelerators of various kinds,
such as graphics card processors.

And it is not true on smaller processors, such as in the embedded
world. � On microcontrollers with floating point hardware for
single and double precision, SP can be up to twice as fast as DP.
And for many of the more popular microcontrollers, you can have
hardware SP but DP is done in software - the difference there is
clearly massive.

But for big processors doing non-vector adds and multiplies, DP
and SP are usually equal in clock cycles (other than memory and
cache effects).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Tue Mar 12 17:18:36 2024

Michael S wrote:

On Tue, 12 Mar 2024 11:14:47 +0100
David Brown <david.brown@hesbynett.no> wrote:

You use the word vector where you mean SIMD.

Yes, I was using the word somewhat interchangeably, as I was talking
in general terms. Perhaps I should have been more precise. I know
this thread talked about "Cray style vectors", but I thought this
branch had diverged - I don't know anywhere near enough about the
details of Cray machines to talk much about them.

Even for Cray/NEC-style vectors, the same throughput for different
precision is not an universal property. Cray's and NEC's vector
processors happen to be designed like that, but one can easily imagine
vector processors of similar style that have 2 or even 3 times higher throughput for SP vs DP.
I personally never encountered such machines, but would be surprised if
it were never built and sold back by one or another usual suspect (may
be, Fujitsu?) in days when designers liked Cray's style.

While theoretically possible, they did not do this because both halves
of a 2×SP would not arrive from memory necessarily simultaneously.
{Consider a gather load you need a vector of addresses 2× as long
for pairs of SP going into a single vector register element.}

Which, of course, leaves the question of what property makes vector
processor Cray-style. Just having ALU/FPU several times narrower than
VR is, IMHO, not enough to be considered Cray-style.

That property is that the length of the vector register is chosen to
absorb the latency to memory. SMID is too short to have this property.

In my book, the critical distinction is that at least one size of
partial (chopped) none-load-store vector operations has higher
throughput (and hopefully, but not necessarily lower latency) than full vector operations of the same type.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Tue Mar 12 19:49:18 2024

On Tue, 12 Mar 2024 17:18:36 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

Michael S wrote:

On Tue, 12 Mar 2024 11:14:47 +0100
David Brown <david.brown@hesbynett.no> wrote:

You use the word vector where you mean SIMD.

Yes, I was using the word somewhat interchangeably, as I was
talking in general terms. Perhaps I should have been more
precise. I know this thread talked about "Cray style vectors",
but I thought this branch had diverged - I don't know anywhere
near enough about the details of Cray machines to talk much about
them.

Even for Cray/NEC-style vectors, the same throughput for different precision is not an universal property. Cray's and NEC's vector
processors happen to be designed like that, but one can easily
imagine vector processors of similar style that have 2 or even 3
times higher throughput for SP vs DP.
I personally never encountered such machines, but would be
surprised if it were never built and sold back by one or another
usual suspect (may be, Fujitsu?) in days when designers liked
Cray's style.

While theoretically possible, they did not do this because both halves
of a 2�SP would not arrive from memory necessarily simultaneously.
{Consider a gather load you need a vector of addresses 2� as long
for pairs of SP going into a single vector register element.}

Doctor, it hurts when I do this!
So, what prevents you from providing no gather with resolution
below 64 bits?

Which, of course, leaves the question of what property makes vector processor Cray-style. Just having ALU/FPU several times narrower
than VR is, IMHO, not enough to be considered Cray-style.

That property is that the length of the vector register is chosen to
absorb the latency to memory. SMID is too short to have this property.

I don't like this definition at all.
For starter, what is "memory"? Does L1D cache count, or only L2 and
higher?
Then, what is "absorb" ? Is the whole VR register file part of
absorbent or latency should be covered by one register? Is OoO machinery
part of absorbent? Is HW threading part of absorbent? And for any of
your possible answers I have my "Why?".

In my book, the critical distinction is that at least one size of
partial (chopped) none-load-store vector operations has higher
throughput (and hopefully, but not necessarily lower latency) than
full vector operations of the same type.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Tue Mar 12 19:00:46 2024

Michael S wrote:

On Tue, 12 Mar 2024 17:18:36 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

Michael S wrote:

On Tue, 12 Mar 2024 11:14:47 +0100
David Brown <david.brown@hesbynett.no> wrote:

You use the word vector where you mean SIMD.

Yes, I was using the word somewhat interchangeably, as I was
talking in general terms. Perhaps I should have been more
precise. I know this thread talked about "Cray style vectors",
but I thought this branch had diverged - I don't know anywhere
near enough about the details of Cray machines to talk much about
them.

Even for Cray/NEC-style vectors, the same throughput for different
precision is not an universal property. Cray's and NEC's vector
processors happen to be designed like that, but one can easily
imagine vector processors of similar style that have 2 or even 3
times higher throughput for SP vs DP.
I personally never encountered such machines, but would be
surprised if it were never built and sold back by one or another
usual suspect (may be, Fujitsu?) in days when designers liked
Cray's style.

While theoretically possible, they did not do this because both halves
of a 2×SP would not arrive from memory necessarily simultaneously.
{Consider a gather load you need a vector of addresses 2× as long
for pairs of SP going into a single vector register element.}

Doctor, it hurts when I do this!
So, what prevents you from providing no gather with resolution
below 64 bits?

Well, then, you have SP values in a container than could hold 2 and you
don't get any SIMD speedup.

Which, of course, leaves the question of what property makes vector
processor Cray-style. Just having ALU/FPU several times narrower
than VR is, IMHO, not enough to be considered Cray-style.

That property is that the length of the vector register is chosen to
absorb the latency to memory. SMID is too short to have this property.

I don't like this definition at all.
For starter, what is "memory"? Does L1D cache count, or only L2 and
higher?

Those machines had no L1 or L2 (or LLC) caches. Consider the problems
for which they were designed--arrays as big as the memory (sometimes
bigger !!) and processed over and over again with numerical algorithms.
Caches would simply miss on each memory reference (ignoring TLB effects)
With the caches never supplying data to the calculations why have them
at all ??

Then, what is "absorb" ?

Absorb means that the first data of a vector arrives and can start
calculation before the last address of the memory reference goes out.
This, in turn, means that one can create a continuous stream of
outbound addresses forever and thus cone can create a stream of
continuous calculations forever. {{Where 'forever' means thousands
of cycles but no where near the lifetime of the universe.}}

Now, obviously, this means the memory system has to be able to make
forward progress on all those memory accesses continuously.

Is the whole VR register file part of
absorbent or latency should be covered by one register?

A single register covers a single memory reference latency.

Is OoO machinery
part of absorbent?

The only OoO in the CRAYs was delivery of gather data back to the
vector register*. Scatter stores were sent out in order, as were the
addresses of the gather loads.

(*) bank conflicts would delay conflicting accesses but not those
of other banks, creating an OoO effect of returning data. This was
re-ordered back to IO prior to forwarding data into calculation.

Is HW threading part of absorbent?

Absolutely not--none of the CRAYs did this--later XMPs and YMPs did
use lanes (SIMD with vector) but always did calculations in order
and always sent out addresses (and data when appropriate) in order.

And for any of
your possible answers I have my "Why?".

No harm in asking.

In my book, the critical distinction is that at least one size of
partial (chopped) none-load-store vector operations has higher
throughput (and hopefully, but not necessarily lower latency) than
full vector operations of the same type.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Wed Mar 13 06:17:01 2024

Michael S <already5chosen@yahoo.com> schrieb:

Even for Cray/NEC-style vectors, the same throughput for different
precision is not an universal property. Cray's and NEC's vector
processors happen to be designed like that, but one can easily imagine
vector processors of similar style that have 2 or even 3 times higher throughput for SP vs DP.
I personally never encountered such machines, but would be surprised if
it were never built and sold back by one or another usual suspect (may
be, Fujitsu?) in days when designers liked Cray's style.

I worked on such a machine, and (IIRC) single precision was faster
on that machine. That may have been due to the comparatively
low memory throughput of the single load/store pipeline that it had.

But it's been a few decades, and my memory may be off (and I don't
have any handbooks from the time).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Wed Mar 13 11:08:59 2024

On Tue, 12 Mar 2024 19:00:46 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

Michael S wrote:

On Tue, 12 Mar 2024 17:18:36 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

While theoretically possible, they did not do this because both
halves of a 2�SP would not arrive from memory necessarily
simultaneously. {Consider a gather load you need a vector of
addresses 2� as long for pairs of SP going into a single vector
register element.}

Doctor, it hurts when I do this!
So, what prevents you from providing no gather with resolution
below 64 bits?

Well, then, you have SP values in a container than could hold 2 and
you don't get any SIMD speedup.

(1) - a need for full gather it hopefully rare. Majority of time things accessed continuously or, at worst, with non-unit strides.
My personal rule of thumb is "if I need generic gather, most likely I
shouldn't have been bothered with vectorizing." Of course, as every
rule of thumb, it's imprecise.
(2) - there are several important applications that naturally have
pair-wise data layout. Complex numbers is just one.

Which, of course, leaves the question of what property makes
vector processor Cray-style. Just having ALU/FPU several times
narrower than VR is, IMHO, not enough to be considered
Cray-style.

That property is that the length of the vector register is chosen
to absorb the latency to memory. SMID is too short to have this
property.

I don't like this definition at all.
For starter, what is "memory"? Does L1D cache count, or only L2 and
higher?

Those machines had no L1 or L2 (or LLC) caches. Consider the problems
for which they were designed--arrays as big as the memory (sometimes
bigger !!) and processed over and over again with numerical
algorithms. Caches would simply miss on each memory reference
(ignoring TLB effects) With the caches never supplying data to the calculations why have them at all ??

Then, what is "absorb" ?

Absorb means that the first data of a vector arrives and can start calculation before the last address of the memory reference goes out.
This, in turn, means that one can create a continuous stream of
outbound addresses forever and thus cone can create a stream of
continuous calculations forever. {{Where 'forever' means thousands
of cycles but no where near the lifetime of the universe.}}

Now, obviously, this means the memory system has to be able to make
forward progress on all those memory accesses continuously.

Is the whole VR register file part of
absorbent or latency should be covered by one register?

A single register covers a single memory reference latency.

Is OoO
machinery part of absorbent?

The only OoO in the CRAYs was delivery of gather data back to the
vector register*. Scatter stores were sent out in order, as were the addresses of the gather loads.

(*) bank conflicts would delay conflicting accesses but not those
of other banks, creating an OoO effect of returning data. This was
re-ordered back to IO prior to forwarding data into calculation.

Is HW threading part of absorbent?

Absolutely not--none of the CRAYs did this--later XMPs and YMPs did
use lanes (SIMD with vector) but always did calculations in order
and always sent out addresses (and data when appropriate) in order.

And for
any of your possible answers I have my "Why?".

No harm in asking.

It seems, we are talking about different things.
You are talking about Cray vectors, as done in Cray's 1/X-MP/Y-MP
series. I.e. something fixed, known and of interest mostly for
computing historians among us.
OTH, I am trying to discuss a vague notion of "Cray-style vectors". My intentions are to see what was applicable in more recent times and
which ideas are not totally obsolete for a future.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Wed Mar 13 15:45:00 2024

Michael S wrote:

On Tue, 12 Mar 2024 19:00:46 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

Michael S wrote:

On Tue, 12 Mar 2024 17:18:36 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

While theoretically possible, they did not do this because both
halves of a 2×SP would not arrive from memory necessarily
simultaneously. {Consider a gather load you need a vector of
addresses 2× as long for pairs of SP going into a single vector
register element.}

Doctor, it hurts when I do this!
So, what prevents you from providing no gather with resolution
below 64 bits?

Well, then, you have SP values in a container than could hold 2 and
you don't get any SIMD speedup.

(1) - a need for full gather it hopefully rare. Majority of time things accessed continuously or, at worst, with non-unit strides.
My personal rule of thumb is "if I need generic gather, most likely I shouldn't have been bothered with vectorizing." Of course, as every
rule of thumb, it's imprecise.

CRAY-1 XMP gained considerable speedup (about 20%) on its benchmarks
of the day after adding Scatter/Gather and 5 more Livermore Loops
would vectorize.

(2) - there are several important applications that naturally have
pair-wise data layout. Complex numbers is just one.

What about Quaterions--which alleviate the programmer from having to
remember which multiplications are subtracted instead of added.

Which, of course, leaves the question of what property makes
vector processor Cray-style. Just having ALU/FPU several times
narrower than VR is, IMHO, not enough to be considered
Cray-style.

That property is that the length of the vector register is chosen
to absorb the latency to memory. SMID is too short to have this
property.

I don't like this definition at all.
For starter, what is "memory"? Does L1D cache count, or only L2 and
higher?

Those machines had no L1 or L2 (or LLC) caches. Consider the problems
for which they were designed--arrays as big as the memory (sometimes
bigger !!) and processed over and over again with numerical
algorithms. Caches would simply miss on each memory reference
(ignoring TLB effects) With the caches never supplying data to the
calculations why have them at all ??

Then, what is "absorb" ?

Absorb means that the first data of a vector arrives and can start
calculation before the last address of the memory reference goes out.
This, in turn, means that one can create a continuous stream of
outbound addresses forever and thus cone can create a stream of
continuous calculations forever. {{Where 'forever' means thousands
of cycles but no where near the lifetime of the universe.}}

Now, obviously, this means the memory system has to be able to make
forward progress on all those memory accesses continuously.

Is the whole VR register file part of
absorbent or latency should be covered by one register?

A single register covers a single memory reference latency.

Is OoO
machinery part of absorbent?

The only OoO in the CRAYs was delivery of gather data back to the
vector register*. Scatter stores were sent out in order, as were the
addresses of the gather loads.

(*) bank conflicts would delay conflicting accesses but not those
of other banks, creating an OoO effect of returning data. This was
re-ordered back to IO prior to forwarding data into calculation.

Is HW threading part of absorbent?

Absolutely not--none of the CRAYs did this--later XMPs and YMPs did
use lanes (SIMD with vector) but always did calculations in order
and always sent out addresses (and data when appropriate) in order.

And for
any of your possible answers I have my "Why?".

No harm in asking.

It seems, we are talking about different things.
You are talking about Cray vectors, as done in Cray's 1/X-MP/Y-MP
series. I.e. something fixed, known and of interest mostly for
computing historians among us.

Fair enough, but it remains my model for how to discuss vector calculations.

OTH, I am trying to discuss a vague notion of "Cray-style vectors". My intentions are to see what was applicable in more recent times and
which ideas are not totally obsolete for a future.

Only after you figure out a way to feed 2-LDs and consume 1 ST per-cycle continuously (cache miss or cache hit; TLB miss or TLB hit) are you in a position to utilize CRAY-like vector-register architecture effectively.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Mar 13 15:02:12 2024

OTH, I am trying to discuss a vague notion of "Cray-style vectors". My intentions are to see what was applicable in more recent times and
which ideas are not totally obsolete for a future.

Another way to look at the difference between SSE-style vectors (which
I'd call "short vectors") at the ISA level is the fact that SSE-style
vector instructions are designed under the assumption that the latency
of a vector instruction will be more or less the same as that of
a non-vector instruction (i.e. you have enough ALUs to do all the
operations at the same time), whereas Cray-style vector instructions
(which we could call "long vectors") are designed under the assumption
that the latency will be somewhat proportional to the length of the
vector because the core of the CPU will only access a chunk of the
vector at a time.

So, short vectors have a fairly free hand at shuffling data across their
vector (e.g. bitmatrix transpose), and they can be implemented/scheduled/dispatched just like any other instruction, but
the vector length tends to be severely limited and exposed all over
the place.

In contrast long vectors usually depend on specialized implementations
(e.g. chaining) to get good performance, but their length is
easier/cheaper to change.

AFAICT long vectors made sense when we could build machines with
a memory bandwidth that was higher and ALUs were more expensive.
Nowadays we tend to have the opposite.

Also, the massive number of transistors we spend nowadays on OoO means
that a good OoO CPU can dispatch individual non-vector instructions to
ALUs just as well as the Cray did with its vectors with chaining.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Wed Mar 13 19:24:15 2024

Stefan Monnier wrote:

OTH, I am trying to discuss a vague notion of "Cray-style vectors". My
intentions are to see what was applicable in more recent times and
which ideas are not totally obsolete for a future.

Another way to look at the difference between SSE-style vectors (which
I'd call "short vectors") at the ISA level is the fact that SSE-style
vector instructions are designed under the assumption that the latency
of a vector instruction will be more or less the same as that of
a non-vector instruction (i.e. you have enough ALUs to do all the
operations at the same time),

In the early-mid 1980s there was a class of processor assist engines
using the tern Array-Processor that performed a Cray-like vector in
the same latency as a scalar operation. CRAY streamed single-operands
through FUs, Array processors took entire <but shorter> vectors through
lanes of calculations.

I would call these "medium vector" to distinguish from (short vector)
SIMD and (long vector) CRAY {or just vector without qualifier}. So we
have::
SIMD <short> vector
ARRAY <medium> vector
CRAY <long> vector
CDC <memory> vector

whereas Cray-style vector instructions
(which we could call "long vectors") are designed under the assumption
that the latency will be somewhat proportional to the length of the
vector because the core of the CPU will only access a chunk of the
vector at a time.

So, short vectors have a fairly free hand at shuffling data across their vector (e.g. bitmatrix transpose), and they can be implemented/scheduled/dispatched just like any other instruction, but
the vector length tends to be severely limited and exposed all over
the place.

Consuming OpCode space like nobody's business.

In contrast long vectors usually depend on specialized implementations
(e.g. chaining) to get good performance, but their length is
easier/cheaper to change.

The only limitation is when one masks out beats of the vector from
calculation of memory referencing. This is what kept CRAY at 64-element vectors. It was also the Achilles heal of CRAY--once memory gets more
than 64 beats away, the length of the vector can no longer absorb the
latency to memory. NEC did not have this problem.

AFAICT long vectors made sense when we could build machines with
a memory bandwidth that was higher and ALUs were more expensive.
Nowadays we tend to have the opposite.

While BW is important (very) it is latency that is crucial. Latency
to memory must be smaller than vector length.

Also, the massive number of transistors we spend nowadays on OoO means
that a good OoO CPU can dispatch individual non-vector instructions to
ALUs just as well as the Cray did with its vectors with chaining.

Not "just as well" but "within spitting distance of"

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Stefan Monnier on Thu Mar 14 12:58:33 2024

On Wed, 13 Mar 2024 15:02:12 -0400
Stefan Monnier <monnier@iro.umontreal.ca> wrote:

OTH, I am trying to discuss a vague notion of "Cray-style vectors".
My intentions are to see what was applicable in more recent times
and which ideas are not totally obsolete for a future.

Another way to look at the difference between SSE-style vectors (which
I'd call "short vectors") at the ISA level is the fact that SSE-style
vector instructions are designed under the assumption that the latency
of a vector instruction will be more or less the same as that of
a non-vector instruction (i.e. you have enough ALUs to do all the
operations at the same time), whereas Cray-style vector instructions
(which we could call "long vectors") are designed under the assumption
that the latency will be somewhat proportional to the length of the
vector because the core of the CPU will only access a chunk of the
vector at a time.

I sympathize with your direction, but want to point out that
historically in x86 world implementations of those or another SIMD
instruction sets that had twice higher throughput for scalar FP ops vs full-width FP ops were and are very common. As to latency, many of
these implementations had (and may be still have?) one cycle longer
latency for scalar vs full-width.

Similar throughput differences can be seen on few aarch64 processors.
E.g. on Cortex-A75, where Q-form of common FP arithmetic instructions
has half of throughput of scalar and D-form, but latency is not
affected.
I'd guess that the same applies to LITTLE Arm cores, but can't be sure,
because Cortex-A53 Software Optimization Guide does not exist and
Cortex-A55 Software Optimization Guide is not very comprehensible.

So, short vectors have a fairly free hand at shuffling data across
their vector (e.g. bitmatrix transpose), and they can be implemented/scheduled/dispatched just like any other instruction, but
the vector length tends to be severely limited and exposed all over
the place.

In contrast long vectors usually depend on specialized implementations
(e.g. chaining) to get good performance, but their length is
easier/cheaper to change.

AFAICT long vectors made sense when we could build machines with
a memory bandwidth that was higher and ALUs were more expensive.
Nowadays we tend to have the opposite.

16x DP FMA (dual-issue 512-bit), as in server-class Intel cores, is
still expensive, less so in area, more so in effect on max. power
consumption.
As to memory, I find this argument false, at least in its original form
often stated here by Mitch.
There are many important algorithms, both in dense linear algebra and
in digital signal processing, where it is easily possible to reduce
memory read to DP FMA ratio to 5-6 bytes/DP-FMA. With 32 software
visible vector registers and with more programmer's effort the ratio
could be pushed further down, sometimes below 4 bytes.
I know little about other computationally intensive fields (apart
from dense linear algebra and DSP, that is), but would expect that if
not 5-6 then at least 8-9 bytes/DP-FMA (or equivalents for lower
precisions) are not out of reach in other field as well.

Also, the massive number of transistors we spend nowadays on OoO means
that a good OoO CPU can dispatch individual non-vector instructions to
ALUs just as well as the Cray did with its vectors with chaining.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Fri Mar 15 10:16:37 2024

MitchAlsup1 wrote:

Michael S wrote:

Doctor, it hurts when I do this!
So, what prevents you from providing no gather with resolution
below 64 bits?

Well, then, you have SP values in a container than could hold 2 and you
don't get any SIMD speedup.

I've looked at this issue since Larrabee (which was the first cpu I had
access to which supported gather):

For a 32-bit gather in a 64-bit environment, the only good solution I've
been able to come up with is to use a 64-bit base register and then
require all the sources to be within 4GB from that base, so that you can
use the 32-bit wide gather addresses as indices/offsets from that base.

<dest_vector_reg> = gather(base_reg, src_vector_reg)

It would be up to the implementer to decide if the src_vector_reg is
used as signed or unsigned offsets from the base. It is also possible to
extend adressability by having the offsets be scaled by the element
size, so effectively

dest[i] = [base+src[i]*4]

for single precision values.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Paul A. Clayton on Fri Mar 15 14:01:09 2024

On Thu, 14 Mar 2024 23:56:04 -0400
"Paul A. Clayton" <paaronclayton@gmail.com> wrote:

On 3/13/24 3:24 PM, MitchAlsup1 wrote:

Stefan Monnier wrote:

[snip]

So, short vectors have a fairly free hand at shuffling data across
their vector (e.g. bitmatrix transpose), and they can be
implemented/scheduled/dispatched just like any other instruction,
but the vector length tends to be severely limited and exposed all
over the place.

Consuming OpCode space like nobody's business.

Is that necessarily the case? Excluding the shuffle operations, I
think only loads and stores would need to have length specifiers.
Shuffle operations become much more expensive with larger
'vectors', so providing the same granularity of shuffle for larger
vectors seems questionable. (With scatter/gather, permute/shuffle
may be less useful anyway.)

<snip>

Am I missing something when I assume that lane-based (SIMD)
operations do not need size information in the instruction? The
extra metadata is not free (perhaps especially as that controls
execution at least for efficiency), but if opcode explosion is so
undesirable using metadata might be preferred.

I would guess that Mitch operates under assumption that we are still
talking about ISA that is very similar to CRAY Y-MP.
I.e. non-load-store, but more like AVX512, where one of the source
operands could be in memory.
That is, I had never seen CRAY Y-MP ISA docs, but would imagine that it
was like that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Fri Mar 15 20:03:33 2024

Paul A. Clayton wrote:

On 3/13/24 3:24 PM, MitchAlsup1 wrote:

Stefan Monnier wrote:

[snip]

So, short vectors have a fairly free hand at shuffling data across their >>> vector (e.g. bitmatrix transpose), and they can be
implemented/scheduled/dispatched just like any other instruction, but
the vector length tends to be severely limited and exposed all over
the place.

Consuming OpCode space like nobody's business.

Is that necessarily the case? Excluding the shuffle operations, I
think only loads and stores would need to have length specifiers.

Add signed byte add unsigned byte, add signed half, add unsigned half
add signed int, add unsigned int, add long, add 2 floats, add double

compared to ADD integer, add float and add double.

And then there is the addsub group, too.

I may be possible to avoid OpCode explosion, but neither x86, nor ARM
got anywhere close to avoiding the deluge of OpCodes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Fri Mar 15 20:11:36 2024

MitchAlsup1 wrote:

Paul A. Clayton wrote:

On 3/13/24 3:24 PM, MitchAlsup1 wrote:

Stefan Monnier wrote:

[snip]

So, short vectors have a fairly free hand at shuffling data across their >>>> vector (e.g. bitmatrix transpose), and they can be
implemented/scheduled/dispatched just like any other instruction, but
the vector length tends to be severely limited and exposed all over
the place.

Consuming OpCode space like nobody's business.

Is that necessarily the case? Excluding the shuffle operations, I
think only loads and stores would need to have length specifiers.

Add signed byte add unsigned byte, add signed half, add unsigned half
add signed int, add unsigned int, add long, add 2 floats, add double

All of the above come in 64-bit, 128-bit, 256-bit, and 512-bit variants.

compared to ADD integer, add float and add double.

And then there is the addsub group, too.

I may be possible to avoid OpCode explosion, but neither x86, nor ARM
got anywhere close to avoiding the deluge of OpCodes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	299
Nodes:	16 (2 / 14)
Uptime:	40:53:29
Calls:	6,682
Files:	12,225
Messages:	5,343,435

Re: What integer C type to use

Who's Online

System Info