anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
Or maybe changing int from 32-bit to 64-bit would have caused
as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
day.
In Unix sizeof(int) == sizeof(int *) on both 16-bit and 32-bit >>architectures. Given the history of C, that's not surprising: BCPL
and B have a single type, the machine word, and it eventually became
C's int. You see this in "int" declarations being optional in various >>places. So code portable between 16-bit and 32-bit systems could not >>assume that int has a specific size (such as 32 bits), but if it
assumed that sizeof(int) == sizeof(int *), that would port fine
between 16-bit and 32-bit Unixes. There may have been C code that
assumed that sizeof(int)==4, but why cater to this kind of code which
did not even port to 16-bit systems?
Most of the problems encountered when moving unix (System V)
from 16-bit to 32-bit were more around missing typedefs for certain
data types (e.g. uids, gids, pids, etc), so there was
a lot of code that declared these as shorts, but the 32-bit
kernels defined these as 32-bit (unsigned) values).
That's when uid_t, pid_t, gid_t were added.
Then there were the folks who used 'short [int]' instead of 'int'
since they were the same size on the PDP-11.
The Unix code ported relatively easily to I32LP64 because uintptr_t
had been used extensively rather than assumptions about
sizeof(int) == sizeof(int *).
David Brown <david.brown@hesbynett.no> writes:
On 20/02/2024 13:00, Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
mitchalsup@aol.com (MitchAlsup1) writes:
We are in an era where long has higher performance than ints (except for >>>>> cache footprint overheads.)
C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s.
I presume the main reason for this was the size and cost of memory at
the time? Or do you know any other reason? Maybe some of the early
64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were >>faster at 16-bit.
I know no implementation of a 64-bit architecture where ALU operations (except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:
Guess 1: There was more software that depended on sizeof(int)==4 than software that depended on sizeof(int)==sizeof(char *).
Guess 2: When benchmarketing without adapting the source code (as is
usual), I32LP64 produced better numbers then ILP64 for some
benchmarks, because arrays and other data structures with int elements
are smaller and have better cache hit rates.
My guess is that it was a mixture of 1 and 2, with 2 being the
decisive factor. I have certainly seen a lot of writing about how
64-bit (pointers) hurt performance, and it even led to the x32
nonsense (which never went anywhere, not surprising to me). These
days support for 32-bit applications is eliminated from ARM cores,
another indication that the performance advantages of 32-bit pointers
are minor.
BTW, some people here have advocated the use of unsigned instead of
int. Which of the two results in better code depends on the
architecture. On AMD64 where the so-called 32-bit instructions
perform a 32->64-bit zero-extension, unsigned is better. On RV64G
where the so-called 32-bit instructions perform a 32->64-bit sign
extension, signed int is better. But actually the best way is to use
a full-width type like intptr_t or uintptr_t, which gives better
results than either.
I would suggest C "fast" types like int_fast32_t (or other "fast" sizes, >>picked to fit the range you need).
Sure, and then the program might break when an array has more the 2^31 elements; or it might work on one platform and break on another one.
By contrast, with (u)intptr_t, on modern architectures you use the
type that's as wide as the GPRs. And I don't see a reason why to use something else for a loop counter.
For a type of which you store many in an array or other data
structure, you probably prefer int32_t rather than int_fast32_t if 32
bits is enough. So I don't see a reason for int_fast32_t etc.
These adapt suitably for different
targets. If you want to force the issue, then "int64_t" is IMHO clearer >>than "long long int" and does not give a strange impression where you
are using a type aimed at pointer arithmetic for general integer arithmetic.
Why do you bring up "long long int"? As for int64_t, that tends to be
slow (if supported at all) on 32-bit platforms, and it is more than
what is necessary for indexing arrays and for loop counters that are
used for indexing into arrays.
If you want fast local variables, use C's [u]int_fastN_t types. That's >>what they are for.
I don't see a point in those types. What's wrong with (u)intptr_t IYO?
Don't use "-fwrapv" unless you actually need it - in most
code, if your arithmetic overflows, you have a mistake in your code, so >>letting the compiler assume that will not happen is a good thing.
Thank you for giving a demonstration for Scott Lurndal. I assume that
you claim to be a programmer.
Anyway, if I have made a mistake in my code, why would let the
compiler assume that I did not make a mistake be a good thing?
I OTOH prefer if the compiler behaves consistently, so I use -fwrapv,
and for good performance, I write the code appropriately (e.g., by
using intptr_t instead of int).
(And
it lets you check for overflow bugs using run-time sanitizers.)
If the compiler assumes that overflow does not happen, how do these "sanitizers" work?
Anyway, I certainly have code that relies on modulo arithmetic.
- anton
scott@slp53.sl.home (Scott Lurndal) writes:
Or maybe changing int from 32-bit to 64-bit would have caused
as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
day.
In Unix sizeof(int) == sizeof(int *) on both 16-bit and 32-bit
architectures. Given the history of C, that's not surprising: BCPL
and B have a single type, the machine word, and it eventually became
C's int. You see this in "int" declarations being optional in various places. So code portable between 16-bit and 32-bit systems could not
assume that int has a specific size (such as 32 bits), but if it
assumed that sizeof(int) == sizeof(int *), that would port fine
between 16-bit and 32-bit Unixes. There may have been C code that
assumed that sizeof(int)==4, but why cater to this kind of code which
did not even port to 16-bit systems?
In any case, I32LP64 caused breakage for my code, and I expect that
there was more code around with the assumption sizeof(int)==sizeof(int
*) than with the assumption sizeof(int)==4. Of course, we worked
around this misfeature of the C compilers on Digital OSF/1, but those
who assumed sizeof(int)==4 would have adapted their code if the
decision had been for ILP64.
- anton
Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
On 20/02/2024 13:00, Anton Ertl wrote:
Guess 1: There was more software that depended on sizeof(int)==4 than
software that depended on sizeof(int)==sizeof(char *).
There remains a lot of SW that is dependent on INT being smaller than LONG.
My ONLY argument is that int was SUPPOSED to be the fastest arithmetic.
It was in PDP-11 days and VAX days, but it is no longer.
I would suggest C "fast" types like int_fast32_t (or other "fast"
sizes, picked to fit the range you need).
The semantics of unsigned are better when you are twiddling bits than
when signed.
My programming practice is to use unsigned everywhere that
a negative number is unexpected.
Anyway, if I have made a mistake in my code, why would let the
compiler assume that I did not make a mistake be a good thing?
There are more people willing to pay good money for a wrong answer fast
than there are willing to pay good money for correct answer slow.
Anton Ertl wrote:
My guess is that it was a mixture of 1 and 2, with 2 being the
decisive factor. I have certainly seen a lot of writing about how
64-bit (pointers) hurt performance, and it even led to the x32
nonsense (which never went anywhere, not surprising to me). These
days support for 32-bit applications is eliminated from ARM cores,
another indication that the performance advantages of 32-bit pointers
are minor.
Try programming array codes where the arrays are TB in size with P32.
The semantics of unsigned are better when you are twiddling bits than
when signed. My programming practice is to use unsigned everywhere that
a negative number is unexpected.
Sure, and then the program might break when an array has more the 2^31
elements; or it might work on one platform and break on another one.
By contrast, with (u)intptr_t, on modern architectures you use the
type that's as wide as the GPRs. And I don't see a reason why to use
something else for a loop counter.
David Brown <david.brown@hesbynett.no> writes:
On 21/02/2024 17:33, Michael S wrote:
If one wants top performance on 64-bit architectures then avoiding
'unsigned int' indices is a very good idea. Hoping that compiler will
somehow figure out what you meant instead of doing what you wrote is a
naivety.
Indeed.
I hope you're not agreeing that unsigned array indicies should be
avoided - they should instead be preferred.
On 21/02/2024 18:27, Scott Lurndal wrote:
David Brown <david.brown@hesbynett.no> writes:
On 21/02/2024 17:33, Michael S wrote:
If one wants top performance on 64-bit architectures then avoiding
'unsigned int' indices is a very good idea. Hoping that compiler will
somehow figure out what you meant instead of doing what you wrote is a >>>> naivety.
Indeed.
I hope you're not agreeing that unsigned array indicies should be
avoided - they should instead be preferred.
I am agreeing that "unsigned int" indices should be avoided on 64-bit >architectures (at least, when you are using them in loops) because the >compiler has to generate extra instructions to handle wrapping
David Brown <david.brown@hesbynett.no> writes:
On 21/02/2024 18:27, Scott Lurndal wrote:
David Brown <david.brown@hesbynett.no> writes:
On 21/02/2024 17:33, Michael S wrote:
If one wants top performance on 64-bit architectures then avoiding
'unsigned int' indices is a very good idea. Hoping that compiler will >>>>> somehow figure out what you meant instead of doing what you wrote is a >>>>> naivety.
Indeed.
I hope you're not agreeing that unsigned array indicies should be
avoided - they should instead be preferred.
I am agreeing that "unsigned int" indices should be avoided on 64-bit
architectures (at least, when you are using them in loops) because the
compiler has to generate extra instructions to handle wrapping
Does it?
A simple test case shows that using an unsigned long or unsigned
int as a loop index generates exactly the same non-optimized code,
albeit the unsigned long version was slightly larger (4 bytes)
in code footprint. This with GCC on 64-bit Fedora Core and
without optimization (with optimization (-O2) both generated
identical code)
unsigned long
test(void)
{
unsigned int x; /* Alternate unsigned long x; */
unsigned long vector[1000];
unsigned long result;
for(x = 0; x < 1000u; x++) {
unsigned long a = vector[x];
result += a;
}
return result;
}
unsigned int loop index:
Unsigned long loop index:
With -O2:
On 21/02/2024 19:40, Scott Lurndal wrote:
David Brown <david.brown@hesbynett.no> writes:
On 21/02/2024 18:27, Scott Lurndal wrote:
David Brown <david.brown@hesbynett.no> writes:
On 21/02/2024 17:33, Michael S wrote:
If one wants top performance on 64-bit architectures then avoiding >>>>>> 'unsigned int' indices is a very good idea. Hoping that compiler will >>>>>> somehow figure out what you meant instead of doing what you wrote
is a
naivety.
Indeed.
I hope you're not agreeing that unsigned array indicies should be
avoided - they should instead be preferred.
I am agreeing that "unsigned int" indices should be avoided on 64-bit
architectures (at least, when you are using them in loops) because the
compiler has to generate extra instructions to handle wrapping
Does it?
A simple test case shows that using an unsigned long or unsigned
int as a loop index generates exactly the same non-optimized code,
albeit the unsigned long version was slightly larger (4 bytes)
in code footprint.  This with GCC on 64-bit Fedora Core and
without optimization (with optimization (-O2) both generated
identical code)
unsigned long
test(void)
{
    unsigned int x;                   /* Alternate unsigned long x; */
    unsigned long vector[1000];
    unsigned long result;
    for(x = 0; x < 1000u; x++) {
        unsigned long a = vector[x];
        result += a;
    }
    return result;
}
unsigned int loop index:
There's no point in talking about the most efficient generated code
without enabling optimisation! So this example is irrelevant.
Unsigned long loop index:
This one too.
With -O2:
There's no point in talking about generated code with a source that uses uninitialised data. The complier could just as well have turned the
code into "return 0;".
However, in code where the compiler can be sure there is no wrapping on
the index, and can turn the code into pointer arithmetic and stop when
the pointer gets to the end, then no extra code is needed for an
"unsigned int" index.
But if there /is/ a possibility of wrapping, such as in the "sext"
function posted by Anton, it is needed.
Or if we adapt your function to have a range:
unsigned long
test_uint(long unsigned int a, long unsigned int b)
{
   unsigned int x;                   /* Alternate unsigned long x; */
   extern unsigned long vector[1000];
   unsigned long result = 0;
   for(x = a; x <= b; x++) {
       unsigned long a = vector[x];
       result += a;
   }
   return result;
}
Using "unsigned int" instead of "int" or "unsigned long" means 5 instructions in the inner loop, instead of 4. Of course there are other ways to re-arrange the function to get the optimal result. But the
point is, when the compiler does not know for sure that unsigned int arithmetic will not wrap, it can lead to extra instructions.
<https://godbolt.org/z/a5en1j5xY>
On Wed, 21 Feb 2024 22:48:52 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
David Brown wrote:
On 21/02/2024 19:40, Scott Lurndal wrote:
David Brown <david.brown@hesbynett.no> writes:
On 21/02/2024 18:27, Scott Lurndal wrote:
David Brown <david.brown@hesbynett.no> writes:
On 21/02/2024 17:33, Michael S wrote:
If one wants top performance on 64-bit architectures then
avoiding 'unsigned int' indices is a very good idea. Hoping
that compiler will somehow figure out what you meant instead
of doing what you wrote is a
naivety.
Indeed.
I hope you're not agreeing that unsigned array indicies should
be avoided - they should instead be preferred.
I am agreeing that "unsigned int" indices should be avoided on
64-bit architectures (at least, when you are using them in
loops) because the compiler has to generate extra instructions
to handle wrapping
Does it?
A simple test case shows that using an unsigned long or unsigned
int as a loop index generates exactly the same non-optimized
code, albeit the unsigned long version was slightly larger (4
bytes) in code footprint. This with GCC on 64-bit Fedora Core
and without optimization (with optimization (-O2) both generated
identical code)
unsigned long
test(void)
{
unsigned int x; /* Alternate unsigned
long x; */ unsigned long vector[1000];
unsigned long result;
for(x = 0; x < 1000u; x++) {
unsigned long a = vector[x];
result += a;
}
return result;
}
unsigned int loop index:
There's no point in talking about the most efficient generated
code without enabling optimisation! So this example is
irrelevant.
Unsigned long loop index:
This one too.
With -O2:
There's no point in talking about generated code with a source
that uses uninitialised data. The complier could just as well
have turned the code into "return 0;".
However, in code where the compiler can be sure there is no
wrapping on the index, and can turn the code into pointer
arithmetic and stop when the pointer gets to the end, then no
extra code is needed for an "unsigned int" index.
But if there /is/ a possibility of wrapping, such as in the
"sext" function posted by Anton, it is needed.
Or if we adapt your function to have a range:
unsigned long
test_uint(long unsigned int a, long unsigned int b)
{
unsigned int x; /* Alternate unsigned long
x; */ extern unsigned long vector[1000];
unsigned long result = 0;
for(x = a; x <= b; x++) {
unsigned long a = vector[x];
result += a;
}
return result;
}
I really hope you are not in the habit of reusing function
parameter names as local variables inside the function? Also the
various inconsistent integer names (long unsigned int, unsigned
long), are these supposed to be different types?
I would expect to see code very close to this for a sane setup:
xor rax,rax ; result variable
.. setup code for beginning and end of the range, with check that
a<=b next:
add rax,[rsi]
lea rsi,[rsi+8]
cmp esi,edi
jb next
All four instructions could run in the same cycle, but since these
are unsigned variables we can easily parallelize a bit, using rax,rbx,rcx,rdx as accumulators, and at this point the performance
will be limited by how many data items we can load per cycle. The
obvious workaround is of course to use SIMD, with a 16/32/64-byte
register as the wide accumulator.
Using "unsigned int" instead of "int" or "unsigned long" means 5 instructions in the inner loop, instead of 4. Of course there are
other ways to re-arrange the function to get the optimal result.
But the point is, when the compiler does not know for sure that
unsigned int arithmetic will not wrap, it can lead to extra
instructions.
<https://godbolt.org/z/a5en1j5xY>
I see that the code is different: Can you actually measure a
performance difference?
Terje
In those particular case the speed difference is likely hard to
measure on all but the smallest cores.
Below is the example where I would expect significant difference even
on cores that were state of the art just 4-5 years ago, like Intel
Skylake or Arm Cortex-A76:
double foo(const double* src, unsigned len)
{
double acc42 = 0;
double acc52 = 0;
double acc62 = 0;
double acc72 = 0;
for (unsigned i = 0; i < len; ++i) {
acc42 += src[i+42];
acc52 += src[i+52];
acc62 += src[i+62];
acc72 += src[i+72];
}
return acc42*acc52*acc62*acc72;
}
Godbolt links:
x86-64: https://godbolt.org/z/jErxbfshM
ARM64: https://godbolt.org/z/xzr9xMTnW
David Brown wrote:
On 21/02/2024 19:40, Scott Lurndal wrote:
David Brown <david.brown@hesbynett.no> writes:
On 21/02/2024 18:27, Scott Lurndal wrote:
David Brown <david.brown@hesbynett.no> writes:
On 21/02/2024 17:33, Michael S wrote:
If one wants top performance on 64-bit architectures then
avoiding 'unsigned int' indices is a very good idea. Hoping
that compiler will somehow figure out what you meant instead
of doing what you wrote is a
naivety.
Indeed.
I hope you're not agreeing that unsigned array indicies should be
avoided - they should instead be preferred.
I am agreeing that "unsigned int" indices should be avoided on
64-bit architectures (at least, when you are using them in loops)
because the compiler has to generate extra instructions to handle
wrapping
Does it?
A simple test case shows that using an unsigned long or unsigned
int as a loop index generates exactly the same non-optimized code,
albeit the unsigned long version was slightly larger (4 bytes)
in code footprint. This with GCC on 64-bit Fedora Core and
without optimization (with optimization (-O2) both generated
identical code)
unsigned long
test(void)
{
unsigned int x; /* Alternate unsigned long
x; */ unsigned long vector[1000];
unsigned long result;
for(x = 0; x < 1000u; x++) {
unsigned long a = vector[x];
result += a;
}
return result;
}
unsigned int loop index:
There's no point in talking about the most efficient generated code without enabling optimisation! So this example is irrelevant.
Unsigned long loop index:
This one too.
With -O2:
There's no point in talking about generated code with a source that
uses uninitialised data. The complier could just as well have
turned the code into "return 0;".
However, in code where the compiler can be sure there is no
wrapping on the index, and can turn the code into pointer
arithmetic and stop when the pointer gets to the end, then no extra
code is needed for an "unsigned int" index.
But if there /is/ a possibility of wrapping, such as in the "sext" function posted by Anton, it is needed.
Or if we adapt your function to have a range:
unsigned long
test_uint(long unsigned int a, long unsigned int b)
{
unsigned int x; /* Alternate unsigned long
x; */ extern unsigned long vector[1000];
unsigned long result = 0;
for(x = a; x <= b; x++) {
unsigned long a = vector[x];
result += a;
}
return result;
}
I really hope you are not in the habit of reusing function parameter
names as local variables inside the function? Also the various
inconsistent integer names (long unsigned int, unsigned long), are
these supposed to be different types?
I would expect to see code very close to this for a sane setup:
xor rax,rax ; result variable
.. setup code for beginning and end of the range, with check that
a<=b next:
add rax,[rsi]
lea rsi,[rsi+8]
cmp esi,edi
jb next
All four instructions could run in the same cycle, but since these
are unsigned variables we can easily parallelize a bit, using rax,rbx,rcx,rdx as accumulators, and at this point the performance
will be limited by how many data items we can load per cycle. The
obvious workaround is of course to use SIMD, with a 16/32/64-byte
register as the wide accumulator.
Using "unsigned int" instead of "int" or "unsigned long" means 5 instructions in the inner loop, instead of 4. Of course there are
other ways to re-arrange the function to get the optimal result.
But the point is, when the compiler does not know for sure that
unsigned int arithmetic will not wrap, it can lead to extra
instructions.
<https://godbolt.org/z/a5en1j5xY>
I see that the code is different: Can you actually measure a
performance difference?
Terje
David Brown wrote:
On 21/02/2024 19:40, Scott Lurndal wrote:
David Brown <david.brown@hesbynett.no> writes:
On 21/02/2024 18:27, Scott Lurndal wrote:
David Brown <david.brown@hesbynett.no> writes:
On 21/02/2024 17:33, Michael S wrote:
If one wants top performance on 64-bit architectures then avoiding >>>>>>> 'unsigned int' indices is a very good idea. Hoping that compiler >>>>>>> will
somehow figure out what you meant instead of doing what you wrote >>>>>>> is a
naivety.
Indeed.
I hope you're not agreeing that unsigned array indicies should be
avoided - they should instead be preferred.
I am agreeing that "unsigned int" indices should be avoided on 64-bit
architectures (at least, when you are using them in loops) because the >>>> compiler has to generate extra instructions to handle wrapping
Does it?
A simple test case shows that using an unsigned long or unsigned
int as a loop index generates exactly the same non-optimized code,
albeit the unsigned long version was slightly larger (4 bytes)
in code footprint.  This with GCC on 64-bit Fedora Core and
without optimization (with optimization (-O2) both generated
identical code)
unsigned long
test(void)
{
    unsigned int x;                   /* Alternate unsigned long x; */
    unsigned long vector[1000];
    unsigned long result;
    for(x = 0; x < 1000u; x++) {
        unsigned long a = vector[x];
        result += a;
    }
    return result;
}
unsigned int loop index:
There's no point in talking about the most efficient generated code
without enabling optimisation! So this example is irrelevant.
Unsigned long loop index:
This one too.
With -O2:
There's no point in talking about generated code with a source that
uses uninitialised data. The complier could just as well have turned
the code into "return 0;".
However, in code where the compiler can be sure there is no wrapping
on the index, and can turn the code into pointer arithmetic and stop
when the pointer gets to the end, then no extra code is needed for an
"unsigned int" index.
But if there /is/ a possibility of wrapping, such as in the "sext"
function posted by Anton, it is needed.
Or if we adapt your function to have a range:
unsigned long
test_uint(long unsigned int a, long unsigned int b)
{
    unsigned int x;                   /* Alternate unsigned long x; */
    extern unsigned long vector[1000];
    unsigned long result = 0;
    for(x = a; x <= b; x++) {
        unsigned long a = vector[x];
        result += a;
    }
    return result;
}
I really hope you are not in the habit of reusing function parameter
names as local variables inside the function?
Also the various
inconsistent integer names (long unsigned int, unsigned long), are these supposed to be different types?
I would expect to see code very close to this for a sane setup:
 xor rax,rax ; result variable
 .. setup code for beginning and end of the range, with check that a<=b next:
 add rax,[rsi]
 lea rsi,[rsi+8]
 cmp esi,edi
  jb next
All four instructions could run in the same cycle, but since these are unsigned variables we can easily parallelize a bit, using
rax,rbx,rcx,rdx as accumulators, and at this point the performance will
be limited by how many data items we can load per cycle. The obvious workaround is of course to use SIMD, with a 16/32/64-byte register as
the wide accumulator.
Using "unsigned int" instead of "int" or "unsigned long" means 5
instructions in the inner loop, instead of 4. Of course there are
other ways to re-arrange the function to get the optimal result. But
the point is, when the compiler does not know for sure that unsigned
int arithmetic will not wrap, it can lead to extra instructions.
<https://godbolt.org/z/a5en1j5xY>
I see that the code is different: Can you actually measure a performance difference?
unsigned long
test_uint(long unsigned int a, long unsigned int b)
{
unsigned int x; /* Alternate unsigned long x; */
extern unsigned long vector[1000];
unsigned long result = 0;
for(x = a; x <= b; x++) {
unsigned long a = vector[x];
result += a;
}
return result;
}
I was showing what compilers /do/, not what the /could/ do. It is certainly possible for the code to check the relationships between "a"
and "b" (and possibly check for awkward end conditions - I haven't
thought through the details) so that the loop can be shorter in the case when "a <= b", and longer for the case "a > b".
David Brown wrote:
unsigned long
test_uint(long unsigned int a, long unsigned int b)
{
    unsigned int x;                   /* Alternate unsigned long x; */
    extern unsigned long vector[1000];
    unsigned long result = 0;
    for(x = a; x <= b; x++) {
        unsigned long a = vector[x];
        result += a;
    }
    return result;
}
I was showing what compilers /do/, not what the /could/ do. It is
certainly possible for the code to check the relationships between "a"
and "b" (and possibly check for awkward end conditions - I haven't
thought through the details) so that the loop can be shorter in the
case when "a <= b", and longer for the case "a > b".
How is that possible?
a, b and x are all unsigned, which means that afaik there is no possible
way for x = a; x <= b; x++ to run any iterations, if a is greater than b?
Or am I misattributing the range-based loop to you?
On 22/02/2024 10:44, Terje Mathisen wrote:
David Brown wrote:
unsigned long
test_uint(long unsigned int a, long unsigned int b)
{
    unsigned int x;                   /*
Alternate unsigned long x; */
    extern unsigned long vector[1000];
    unsigned long result = 0;
    for(x = a; x <= b; x++) {
        unsigned long a = vector[x];
        result += a;
    }
    return result;
}
I was showing what compilers /do/, not what the /could/ do. It is
certainly possible for the code to check the relationships between
"a" and "b" (and possibly check for awkward end conditions - I
haven't thought through the details) so that the loop can be shorter
in the case when "a <= b", and longer for the case "a > b".
How is that possible?
a, b and x are all unsigned, which means that afaik there is no
possible way for x = a; x <= b; x++ to run any iterations, if a is
greater than b?
Or am I misattributing the range-based loop to you?
Sorry, I was not thinking straight at all. The complication here is if
b is greater than or equal to UINT_MAX. Then there is no end point, and the loop is infinite. (The C standard actually gives the implementation leave to assume this cannot happen - it can assume there are no infinite loops unless the controlling expression is constant, or the loop
contains observable behaviour. But it seems gcc does not optimise based
on this.)
David Brown wrote:
On 22/02/2024 10:44, Terje Mathisen wrote:
David Brown wrote:
unsigned long
test_uint(long unsigned int a, long unsigned int b)
{
    unsigned int x;                   /*
Alternate unsigned long x; */
    extern unsigned long vector[1000];
    unsigned long result = 0;
    for(x = a; x <= b; x++) {
        unsigned long a = vector[x];
        result += a;
    }
    return result;
}
I was showing what compilers /do/, not what the /could/ do. It is
certainly possible for the code to check the relationships between
"a" and "b" (and possibly check for awkward end conditions - I
haven't thought through the details) so that the loop can be shorter
in the case when "a <= b", and longer for the case "a > b".
How is that possible?
a, b and x are all unsigned, which means that afaik there is no
possible way for x = a; x <= b; x++ to run any iterations, if a is
greater than b?
Or am I misattributing the range-based loop to you?
Sorry, I was not thinking straight at all. The complication here is
if b is greater than or equal to UINT_MAX. Then there is no end
point, and the loop is infinite. (The C standard actually gives the
implementation leave to assume this cannot happen - it can assume
there are no infinite loops unless the controlling expression is
constant, or the loop contains observable behaviour. But it seems gcc
does not optimise based on this.)
OK, I think I understand your point.
The take-home learning should be that in a range-based loop, the range endpoints and the loop variable all needs to be of the same type.
I.e. one of the many C traps that Rust simply will not allow you to
write or fall into.
On 22/02/2024 13:42, Terje Mathisen wrote:
I.e. one of the many C traps that Rust simply will not allow you to
write or fall into.
C is a very flexible language - you can write good code, and bad
code. I think Rust limits you more, which makes it harder to make a
number of errors. But it is quite possible to write C code without
mixing signedness of integers, or getting buffer overflows and memory
leaks, and many of the other things that Rust is designed to prevent.
You need to write your C code in an appropriate way, and you need to
use your tools well (such as lots of static warnings).
For my use, I have yet to see the point of Rust over C++. Rust might
make it slightly easier to avoid some mistakes than C, but it is
lacking the features in C++ that let you have better control. Again,
it is easier to write /bad/ code in C++ - you can write some truly
horrendous stuff. But used well, C++ can protect you against all
kinds of errors.
I haven't tried Rust for more than a "Hello, world" program, though I
have read a bit more about it. It seems to have some nice features
and a more modern outlook - modern C++ is encumbered by having to
stay mostly compatible with C and old C++. But it seems to me to be
lacking features I like in C++, such as constructors in classes,
user-defined conversions (with control of explicit or implicit
conversions), function overloading, public/private distinctions, and
many other features. It's possible that this is because I am
thinking in C++ terms and not thinking in Rust terms, but I have
trouble seeing what Rust would give me that I don't have in C++,
while I can easily see things from C++ that I would miss.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
I know no implementation of a 64-bit architecture where ALU operations
(except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:
A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.
It would also broken Fortran, where storage aasociation rules mean
that both REAL and INTEGER have to have the same size, and DOUBLE
PRECISION twice that. Breaking that would have invalidated just
about every large scientific program at the time.
Cray got away with 64-bit REAL and 128-bit DOUBLE PRECISION because
they were the fastest anyway, but anybody else making that choice
would have been laughed right out of the market.
So, backward compatibility, your favorite topic.
The take-home learning should be that in a range-based loop, the range=20 >endpoints and the loop variable all needs to be of the same type.
MitchAlsup1 wrote:[...]
The semantics of unsigned are better when you are twiddling bits than
when signed. My programming practice is to use unsigned everywhere that
a negative number is unexpected.
Ditto.
10-4.
It always surprises me to encounter people who think signed types
should be the default, with almost no exceptions. It's one thing
to be somewhat cavalier about shooting yourself in the foot, it's
quite another to deliberately choose a large caliber weapon ahead
of time.
Tim Rentsch wrote:
10-4.
It always surprises me to encounter people who think signed types
should be the default, with almost no exceptions. It's one thing
to be somewhat cavalier about shooting yourself in the foot, it's
quite another to deliberately choose a large caliber weapon ahead
of time.
I am of the opinion that the proper default type is unsigned--the
only reason to use a singed type is if at some point in it life it
needs to hold a negative number.
With C-promotions, unsigned is safer when mixing signed and unsigned.
Thomas Koenig wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
I know no implementation of a 64-bit architecture where ALU operations
(except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:
A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.
Now, only in cache footprint. All your std operations take the same amount
of cycles DP vs. SP. Excepting for cache footprint latency:: perf(DP) == perf(SP)
Terje Mathisen <terje.mathisen@tmsw.no> writes:
The take-home learning should be that in a range-based loop, the range=20
endpoints and the loop variable all needs to be of the same type.
The sext/zext examples <2024Feb20.130029@mips.complang.tuwien.ac.at>
use the same type for the loop variable and the endpoints, and the
unsigned case is still bigger than the int -fwrapv case, and that is
bigger than the (u)intptr_t case.
The actual takeaway lesson is to always use a full-width type, which
avoids the need for extending a sub-width type to full width.
Of course, if you are a compiler supremacist like David Brown, you use
a type like int (without -fwrapv or the like) that gives few
guarantees, and hope that the compiler manages to convert the lack of guarantees into code that is as efficient as what you get when using
the full-width type; in the sext example that works, but I would not
rely on it.
Tim Rentsch wrote:
10-4.
It always surprises me to encounter people who think signed types
should be the default, with almost no exceptions. It's one thing
to be somewhat cavalier about shooting yourself in the foot, it's
quite another to deliberately choose a large caliber weapon ahead
of time.
I am of the opinion that the proper default type is unsigned--the
only reason to use a singed type is if at some point in it life it needs
to hold a negative number.
With C-promotions, unsigned is safer when mixing signed and unsigned.
On 24/02/2024 19:48, MitchAlsup1 wrote:
Tim Rentsch wrote:
10-4.
It always surprises me to encounter people who think signed types
should be the default, with almost no exceptions. It's one thing
to be somewhat cavalier about shooting yourself in the foot, it's
quite another to deliberately choose a large caliber weapon ahead
of time.
I am of the opinion that the proper default type is unsigned--the
only reason to use a singed type is if at some point in it life it needs
to hold a negative number.
I am not convinced there should be a default type at all.
David Brown <david.brown@hesbynett.no> writes:
On 20/02/2024 18:47, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:Another possible reason is that it is very useful to have integer types
On 20/02/2024 13:00, Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
mitchalsup@aol.com (MitchAlsup1) writes:
with sizes 1, 2, 4 and 8 bytes. C doesn't have many standard integer
types, so if "int" is 64-bit, you have "short" as either 16-bit and have
no 32-bit type, or "short" is 32-bit and you have no 16-bit type. With
32-bit "int", it's easy to have each size without having to add extended
integer types or add new standard integer types (like "short short int"
for 16-bit and "short int" for 32-bit).
short had been 16-bit on 16-bit machines and on 32-bit machines, so
the right choice for 64-bit machines is to make it 16-bit, too. > As
for a 32-bit type, that would then obviously be "long short".
The C
compiler people had no qualms at adding "long long" when they wanted something bigger then 32 bits on 32-bit systems, so what should keep
them from adding "long short"?
I saw benchmarks showing x32 being measurably faster,
Sure, it's measurably faster. That's obviously not sufficient for
incurring the cost of x32.
but it's not
unlikely that the differences got less with more modern x86-64
processors (with bigger caches)
Doubtful. The L1 caches have not become bigger since the days of the
K7 (1999) with its 64KB D and I caches (and the K7 actually did not
support AMD64). There has been some growth in L2+L3 combined in
recent years, but x32 already flopped earlier.
and it's simply not worth the effort
having another set of libraries and compiler targets just to make some
kinds of code marginally faster.
Exactly.
And support for 32-bit has /not/ been "eliminated from ARM cores".
Of course it has. E.g., the Cortex-X1 supports A32/T32, and its
descendants Cortex-X2, X3, X4 don't. The Cortex-A710 supports
A32/T32, it's successors A715 and A720 do not. Cortex-A510 supports
A32/T32, A520 doesn't.
It
may have been eliminated from the latest AArch64 cores - I don't keep
good track of these. But for every such core sold, there will be
hundreds (my guestimate) of 32-bit ARM cores sold in microcontrollers
and embedded systems.
Irrelevant for the question at hand: Are the performance benefits of
32-bit applications sufficient to pay for the cost of maintaining a
32-bit software infrastructure on an otherwise 64-bit system? The
answer is no.
I would suggest C "fast" types like int_fast32_t (or other "fast" sizes, >>>> picked to fit the range you need).
Sure, and then the program might break when an array has more the 2^31
elements; or it might work on one platform and break on another one.
You need to pick the appropriate size for your data, as I said.
In general-purpose computing, you usually don't know that size. E.g.,
for a sort routine, do you use int_fast32_t, int_fast8_t,
int_fast16_t, or int_fast64_t for the array size?
By contrast, with (u)intptr_t, on modern architectures you use the
type that's as wide as the GPRs. And I don't see a reason why to use
something else for a loop counter.
I like my types to say what they mean. "uintptr_t" says "this object
holds addresses for converting pointer values back and forth with an
integer type".
Exactly. It's the unnamed type of BCPL and B, and the int of Unix C
before the I32LP64 mistake.
Nonetheless, it is good design to use appropriate type names for
appropriate usage. This makes the code clearer, and increases portability.
On the contrary, in the Gforth project we have had many more
portability problems with C code with its integer type zoo than in the
Forth code which just has single-cell (a machine word), double cell,
and char as integer types. Likewise, Forth code from others tends to
be pretty portable between 32-bit and 64-bit systems, even if the code
has only been tested on one kind of system.
As for int64_t, that tends to be
slow (if supported at all) on 32-bit platforms, and it is more than
what is necessary for indexing arrays and for loop counters that are
used for indexing into arrays.
And that is why it makes sense to use the "fast" types. If you need a
16-bit range, use "int_fast16_t". It will be 64-bit on 64-bit systems,
32-bit on 32-bit systems, and 16-bit on 16-bit and 8-bit systems -
always supporting the range you need, as fast as possible.
That makes no sense. E.g., an in-memory sorting routine might well
have to sort 100G elements or more on a suitably large (64-bit)
machine. So according you one should use int_fast64_t. But that
would be slow and unnecessarily large on a 32-bit system where you
cannot hold and sort that many items anyway.
Don't use "-fwrapv" unless you actually need it - in most
code, if your arithmetic overflows, you have a mistake in your code, so >>>> letting the compiler assume that will not happen is a good thing.
Thank you for giving a demonstration for Scott Lurndal. I assume that
you claim to be a programmer.
Sorry, that comment went over my head - I don't know what
"demonstration" you are referring to.
I wrote in <2024Feb20.091522@mips.complang.tuwien.ac.at>:
|Those ideas are that integer overflows do not happen and that a
|competent programmer proactively prevents them from happening, by
|sizing the types accordingly, and checking the inputs.
Scott Lurndal replied <SW2BN.153110$taff.74839@fx41.iad>:
|Can't say that I've known a programmer who thought that way.
For comparison to C, look at the Zig language. (It's not a language I
have used or know in detail, but I know a little about it.) Unsigned
integer arithmetic overflow is UB in Zig, just like signed integer
arithmetic overflow. There are standard options and block settings
(roughly equivalent to pragmas) to control whether these give a run-time
error, or are assumed never to happen (for optimisation purposes). And
if you want to add "x" and "y" with wrapping, you use "x +% y" as an
explicit choice of operator.
That seems to me to be the right approach for an efficient high-level
language.
I don't know about Zig, but for a language like C, I would prefer
types that ask for trap-on-overflow arithmetic, or for modulo
arithmetic instead of introducing an additional operator.
The undefined option is just a bad idea: Wang et al <https://people.eecs.berkeley.edu/~akcheung/papers/apsys12.pdf> found
that it gave a measurable speedup over -fwrapv in only one loop in SPECint2006, and that speedup is only due to the I32LP64 mistake.
I.e., a competently designed dialect would not have that mistake, and
there would be no measurable speedup from the undefined option, just
the possibility of the compiler doing something unexpected.
If you want "int z = x + y;" with wrapping, write :
int z = (unsigned) x + (unsigned) y;
I'll leave that to you, not just for the following reason:
Once upon a time someone like you suggested using some casting
approach for getting x-1>=x (with signed x) to work as intended. I
tried it, and the result was that gcc-3.something still "optimized" it
to false.
And even then, I always put it in a
pragma so that the code works even if someone uses different compiler flags.
Yes, announcing such things in the source code is a good idea in
principle. In practice newer versions of gcc tend to need more sanity
flags (the default of newer gccs is more insanity), and the older
versions do not understand the new flags and fail if you pass them.
So we check every sanity flag in autoconf and use those that the gcc
version used accepts. Doing that through pragmas filled in by
configure does not seem to be any better than using flags through the Makefile.
On 24/02/2024 21:57, Anton Ertl wrote:
The C
compiler people had no qualms at adding "long long" when they wanted
something bigger then 32 bits on 32-bit systems, so what should keep
them from adding "long short"?
They could have done that, yes.
I saw benchmarks showing x32 being measurably faster,
Sure, it's measurably faster. That's obviously not sufficient for
incurring the cost of x32.
Agreed.
but it's not
unlikely that the differences got less with more modern x86-64
processors (with bigger caches)
Doubtful. The L1 caches have not become bigger since the days of the
K7 (1999) with its 64KB D and I caches (and the K7 actually did not
support AMD64). There has been some growth in L2+L3 combined in
recent years, but x32 already flopped earlier.
But other caches have got bigger, and buses between the caches have got
wider (have they not?).
and it's simply not worth the effort
having another set of libraries and compiler targets just to make some
kinds of code marginally faster.
Exactly.
For ARM, you are more likely to get away with saying "you need to
re-compile your old code to run on this new system". In the x86 world, especially for Windows, that's a non-starter. An x86-64 processor that couldn't run x86-32 binaries would not sell well.
I would suggest C "fast" types like int_fast32_t (or other "fast" sizes, >>>>> picked to fit the range you need).
Sure, and then the program might break when an array has more the 2^31 >>>> elements; or it might work on one platform and break on another one.
You need to pick the appropriate size for your data, as I said.
In general-purpose computing, you usually don't know that size. E.g.,
for a sort routine, do you use int_fast32_t, int_fast8_t,
int_fast16_t, or int_fast64_t for the array size?
Oh, I think you often have a good idea about the size. However, I quite agree that if you don't know, you have to pick the biggest practical size.
(And either way, you will want to do some checking and sanitizing of
your data when it comes in.)
David Brown wrote:
On 24/02/2024 21:57, Anton Ertl wrote:
I would suggest C "fast" types like int_fast32_t (or other "fast"
sizes,
picked to fit the range you need).
Sure, and then the program might break when an array has more the 2^31 >>>>> elements; or it might work on one platform and break on another one. >>>>>
You need to pick the appropriate size for your data, as I said.
In general-purpose computing, you usually don't know that size. E.g.,
for a sort routine, do you use int_fast32_t, int_fast8_t,
int_fast16_t, or int_fast64_t for the array size?
How about size_t ??
Oh, I think you often have a good idea about the size. However, I
quite agree that if you don't know, you have to pick the biggest
practical size.
We are in the era (now) where the size of individual datums is irrelevant, only the size of agglomerations of data have relevant sizes (array,
struct).
int i; signed long j; are irrelevant in terms of memory bandwidth after caching.
Thomas Koenig wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
I know no implementation of a 64-bit architecture where ALU operations
(except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:
A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.
Now, only in cache footprint. All your std operations take the same amount
of cycles DP vs. SP. Excepting for cache footprint latency::
perf(DP) == perf(SP)
MitchAlsup1 <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
I know no implementation of a 64-bit architecture where ALU operations >>>> (except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:
A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.
Now, only in cache footprint. All your std operations take the same amount >> of cycles DP vs. SP. Excepting for cache footprint latency::
perf(DP) == perf(SP)
Latency, I of course belive you.
Throughput... For 32 vs. 64 bit reals, I would assume that the
area for a unit is about twice as big for the 64-bit version.
You can then calculate twice the number of results for a similar effort.
Hm. AVX-512 can do calculations either on 32 16-bit, 16 32-bit or 8
64-bit reals. Does anybody how they implemented their floating point
units - are they separate for each precision, or do they use some clever
(or horrible) way to use a single unit for all three?
MitchAlsup1 <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
I know no implementation of a 64-bit architecture where ALU operations >>>> (except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:
A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.
Now, only in cache footprint. All your std operations take the same amount >> of cycles DP vs. SP. Excepting for cache footprint latency::
perf(DP) == perf(SP)
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
I know no implementation of a 64-bit architecture where ALU operations >>>>> (except maybe division where present) is slower in 64 bits than in 32 >>>>> bits. I would have chosen ILP64 at the time, so I can only guess at >>>>> their reasons:
A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.
Now, only in cache footprint. All your std operations take the same amount >>> of cycles DP vs. SP. Excepting for cache footprint latency::
perf(DP) == perf(SP)
FDIV float vs double is still quite often slightly faster, but not close
to 2 X faster even though the mantissa is only 24/53 as large.
Terje
On 23/02/2024 20:55, MitchAlsup1 wrote:
Thomas Koenig wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
I know no implementation of a 64-bit architecture where ALU operations >>>> (except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:
A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.
Now, only in cache footprint. All your std operations take the same amount >> of cycles DP vs. SP. Excepting for cache footprint latency:: perf(DP) ==
perf(SP)
That's true - except when it is not.
It is not true when you are using vector instructions, and you can do
twice as many SP instructions as DP instructions in the same register
and instruction.
It is not true when you are using accelerators of various kinds, such as graphics card processors.
And it is not true on smaller processors, such as in the embedded world.
On microcontrollers with floating point hardware for single and double precision, SP can be up to twice as fast as DP. And for many of the
more popular microcontrollers, you can have hardware SP but DP is done
in software - the difference there is clearly massive.
But for big processors doing non-vector adds and multiplies, DP and SP
are usually equal in clock cycles (other than memory and cache effects).
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
I know no implementation of a 64-bit architecture where ALU operations >>>>> (except maybe division where present) is slower in 64 bits than in 32 >>>>> bits. I would have chosen ILP64 at the time, so I can only guess at >>>>> their reasons:
A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.
Now, only in cache footprint. All your std operations take the same amount >>> of cycles DP vs. SP. Excepting for cache footprint latency::
perf(DP) == perf(SP)
FDIV float vs double is still quite often slightly faster, but not close
to 2 X faster even though the mantissa is only 24/53 as large.
Terje
David Brown wrote:
On 23/02/2024 20:55, MitchAlsup1 wrote:
Thomas Koenig wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
I know no implementation of a 64-bit architecture where ALU operations >>>>> (except maybe division where present) is slower in 64 bits than in 32 >>>>> bits. I would have chosen ILP64 at the time, so I can only guess at >>>>> their reasons:
A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.
Now, only in cache footprint. All your std operations take the same
amount
of cycles DP vs. SP. Excepting for cache footprint latency:: perf(DP)
== perf(SP)
That's true - except when it is not.
It is not true when you are using vector instructions, and you can do
twice as many SP instructions as DP instructions in the same register
and instruction.
You use the word vector where you mean SIMD.
A CRAY-YMP doing single
would not be twice as fast because it was designed for 1 FADD + 1 FMUL +
2 LD + 1 ST per cycle continuously. CRAY-YMP is the epitome of a
Vector machine.
The alternative word would be short-vector instead of SIMD.
It is not true when you are using accelerators of various kinds, such
as graphics card processors.
And it is not true on smaller processors, such as in the embedded
world. Â On microcontrollers with floating point hardware for single
and double precision, SP can be up to twice as fast as DP. And for
many of the more popular microcontrollers, you can have hardware SP
but DP is done in software - the difference there is clearly massive.
But for big processors doing non-vector adds and multiplies, DP and SP
are usually equal in clock cycles (other than memory and cache effects).
On 11/03/2024 20:56, MitchAlsup1 wrote:
David Brown wrote:
On 23/02/2024 20:55, MitchAlsup1 wrote:
Thomas Koenig wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
I know no implementation of a 64-bit architecture where ALU
operations (except maybe division where present) is slower in
64 bits than in 32 bits. I would have chosen ILP64 at the
time, so I can only guess at their reasons:
A guess: people did not want sizeof(float) != sizeof(float).
float is cerainly faster than double.
Now, only in cache footprint. All your std operations take the
same amount
of cycles DP vs. SP. Excepting for cache footprint latency::
perf(DP) == perf(SP)
That's true - except when it is not.
It is not true when you are using vector instructions, and you can
do twice as many SP instructions as DP instructions in the same
register and instruction.
You use the word vector where you mean SIMD.
Yes, I was using the word somewhat interchangeably, as I was talking
in general terms. Perhaps I should have been more precise. I know
this thread talked about "Cray style vectors", but I thought this
branch had diverged - I don't know anywhere near enough about the
details of Cray machines to talk much about them.
A CRAY-YMP doing single
would not be twice as fast because it was designed for 1 FADD + 1
FMUL + 2 LD + 1 ST per cycle continuously. CRAY-YMP is the epitome
of a Vector machine.
The alternative word would be short-vector instead of SIMD.
It is not true when you are using accelerators of various kinds,
such as graphics card processors.
And it is not true on smaller processors, such as in the embedded
world. On microcontrollers with floating point hardware for
single and double precision, SP can be up to twice as fast as DP.
And for many of the more popular microcontrollers, you can have
hardware SP but DP is done in software - the difference there is
clearly massive.
But for big processors doing non-vector adds and multiplies, DP
and SP are usually equal in clock cycles (other than memory and
cache effects).
On Tue, 12 Mar 2024 11:14:47 +0100
David Brown <david.brown@hesbynett.no> wrote:
You use the word vector where you mean SIMD.
Yes, I was using the word somewhat interchangeably, as I was talking
in general terms. Perhaps I should have been more precise. I know
this thread talked about "Cray style vectors", but I thought this
branch had diverged - I don't know anywhere near enough about the
details of Cray machines to talk much about them.
Even for Cray/NEC-style vectors, the same throughput for different
precision is not an universal property. Cray's and NEC's vector
processors happen to be designed like that, but one can easily imagine
vector processors of similar style that have 2 or even 3 times higher throughput for SP vs DP.
I personally never encountered such machines, but would be surprised if
it were never built and sold back by one or another usual suspect (may
be, Fujitsu?) in days when designers liked Cray's style.
Which, of course, leaves the question of what property makes vector
processor Cray-style. Just having ALU/FPU several times narrower than
VR is, IMHO, not enough to be considered Cray-style.
In my book, the critical distinction is that at least one size of
partial (chopped) none-load-store vector operations has higher
throughput (and hopefully, but not necessarily lower latency) than full vector operations of the same type.
Michael S wrote:
On Tue, 12 Mar 2024 11:14:47 +0100
David Brown <david.brown@hesbynett.no> wrote:
You use the word vector where you mean SIMD.
Yes, I was using the word somewhat interchangeably, as I was
talking in general terms. Perhaps I should have been more
precise. I know this thread talked about "Cray style vectors",
but I thought this branch had diverged - I don't know anywhere
near enough about the details of Cray machines to talk much about
them.
Even for Cray/NEC-style vectors, the same throughput for different precision is not an universal property. Cray's and NEC's vector
processors happen to be designed like that, but one can easily
imagine vector processors of similar style that have 2 or even 3
times higher throughput for SP vs DP.
I personally never encountered such machines, but would be
surprised if it were never built and sold back by one or another
usual suspect (may be, Fujitsu?) in days when designers liked
Cray's style.
While theoretically possible, they did not do this because both halves
of a 2×SP would not arrive from memory necessarily simultaneously.
{Consider a gather load you need a vector of addresses 2× as long
for pairs of SP going into a single vector register element.}
Which, of course, leaves the question of what property makes vector processor Cray-style. Just having ALU/FPU several times narrower
than VR is, IMHO, not enough to be considered Cray-style.
That property is that the length of the vector register is chosen to
absorb the latency to memory. SMID is too short to have this property.
In my book, the critical distinction is that at least one size of
partial (chopped) none-load-store vector operations has higher
throughput (and hopefully, but not necessarily lower latency) than
full vector operations of the same type.
On Tue, 12 Mar 2024 17:18:36 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
Michael S wrote:
On Tue, 12 Mar 2024 11:14:47 +0100
David Brown <david.brown@hesbynett.no> wrote:
You use the word vector where you mean SIMD.
Yes, I was using the word somewhat interchangeably, as I was
talking in general terms. Perhaps I should have been more
precise. I know this thread talked about "Cray style vectors",
but I thought this branch had diverged - I don't know anywhere
near enough about the details of Cray machines to talk much about
them.
Even for Cray/NEC-style vectors, the same throughput for different
precision is not an universal property. Cray's and NEC's vector
processors happen to be designed like that, but one can easily
imagine vector processors of similar style that have 2 or even 3
times higher throughput for SP vs DP.
I personally never encountered such machines, but would be
surprised if it were never built and sold back by one or another
usual suspect (may be, Fujitsu?) in days when designers liked
Cray's style.
While theoretically possible, they did not do this because both halves
of a 2×SP would not arrive from memory necessarily simultaneously.
{Consider a gather load you need a vector of addresses 2× as long
for pairs of SP going into a single vector register element.}
Doctor, it hurts when I do this!
So, what prevents you from providing no gather with resolution
below 64 bits?
Which, of course, leaves the question of what property makes vector
processor Cray-style. Just having ALU/FPU several times narrower
than VR is, IMHO, not enough to be considered Cray-style.
That property is that the length of the vector register is chosen to
absorb the latency to memory. SMID is too short to have this property.
I don't like this definition at all.
For starter, what is "memory"? Does L1D cache count, or only L2 and
higher?
Then, what is "absorb" ?
Is the whole VR register file part of
absorbent or latency should be covered by one register?
Is OoO machinery
part of absorbent?
Is HW threading part of absorbent?
And for any of
your possible answers I have my "Why?".
In my book, the critical distinction is that at least one size of
partial (chopped) none-load-store vector operations has higher
throughput (and hopefully, but not necessarily lower latency) than
full vector operations of the same type.
Even for Cray/NEC-style vectors, the same throughput for different
precision is not an universal property. Cray's and NEC's vector
processors happen to be designed like that, but one can easily imagine
vector processors of similar style that have 2 or even 3 times higher throughput for SP vs DP.
I personally never encountered such machines, but would be surprised if
it were never built and sold back by one or another usual suspect (may
be, Fujitsu?) in days when designers liked Cray's style.
Michael S wrote:
On Tue, 12 Mar 2024 17:18:36 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
While theoretically possible, they did not do this because both
halves of a 2×SP would not arrive from memory necessarily
simultaneously. {Consider a gather load you need a vector of
addresses 2× as long for pairs of SP going into a single vector
register element.}
Doctor, it hurts when I do this!
So, what prevents you from providing no gather with resolution
below 64 bits?
Well, then, you have SP values in a container than could hold 2 and
you don't get any SIMD speedup.
Which, of course, leaves the question of what property makes
vector processor Cray-style. Just having ALU/FPU several times
narrower than VR is, IMHO, not enough to be considered
Cray-style.
That property is that the length of the vector register is chosen
to absorb the latency to memory. SMID is too short to have this
property.
I don't like this definition at all.
For starter, what is "memory"? Does L1D cache count, or only L2 and
higher?
Those machines had no L1 or L2 (or LLC) caches. Consider the problems
for which they were designed--arrays as big as the memory (sometimes
bigger !!) and processed over and over again with numerical
algorithms. Caches would simply miss on each memory reference
(ignoring TLB effects) With the caches never supplying data to the calculations why have them at all ??
Then, what is "absorb" ?
Absorb means that the first data of a vector arrives and can start calculation before the last address of the memory reference goes out.
This, in turn, means that one can create a continuous stream of
outbound addresses forever and thus cone can create a stream of
continuous calculations forever. {{Where 'forever' means thousands
of cycles but no where near the lifetime of the universe.}}
Now, obviously, this means the memory system has to be able to make
forward progress on all those memory accesses continuously.
Is the whole VR register file part of
absorbent or latency should be covered by one register?
A single register covers a single memory reference latency.
Is OoO
machinery part of absorbent?
The only OoO in the CRAYs was delivery of gather data back to the
vector register*. Scatter stores were sent out in order, as were the addresses of the gather loads.
(*) bank conflicts would delay conflicting accesses but not those
of other banks, creating an OoO effect of returning data. This was
re-ordered back to IO prior to forwarding data into calculation.
Is HW threading part of absorbent?
Absolutely not--none of the CRAYs did this--later XMPs and YMPs did
use lanes (SIMD with vector) but always did calculations in order
and always sent out addresses (and data when appropriate) in order.
And for
any of your possible answers I have my "Why?".
No harm in asking.
On Tue, 12 Mar 2024 19:00:46 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
Michael S wrote:
On Tue, 12 Mar 2024 17:18:36 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
While theoretically possible, they did not do this because both
halves of a 2×SP would not arrive from memory necessarily
simultaneously. {Consider a gather load you need a vector of
addresses 2× as long for pairs of SP going into a single vector
register element.}
Doctor, it hurts when I do this!
So, what prevents you from providing no gather with resolution
below 64 bits?
Well, then, you have SP values in a container than could hold 2 and
you don't get any SIMD speedup.
(1) - a need for full gather it hopefully rare. Majority of time things accessed continuously or, at worst, with non-unit strides.
My personal rule of thumb is "if I need generic gather, most likely I shouldn't have been bothered with vectorizing." Of course, as every
rule of thumb, it's imprecise.
(2) - there are several important applications that naturally have
pair-wise data layout. Complex numbers is just one.
Which, of course, leaves the question of what property makes
vector processor Cray-style. Just having ALU/FPU several times
narrower than VR is, IMHO, not enough to be considered
Cray-style.
That property is that the length of the vector register is chosen
to absorb the latency to memory. SMID is too short to have this
property.
I don't like this definition at all.
For starter, what is "memory"? Does L1D cache count, or only L2 and
higher?
Those machines had no L1 or L2 (or LLC) caches. Consider the problems
for which they were designed--arrays as big as the memory (sometimes
bigger !!) and processed over and over again with numerical
algorithms. Caches would simply miss on each memory reference
(ignoring TLB effects) With the caches never supplying data to the
calculations why have them at all ??
Then, what is "absorb" ?
Absorb means that the first data of a vector arrives and can start
calculation before the last address of the memory reference goes out.
This, in turn, means that one can create a continuous stream of
outbound addresses forever and thus cone can create a stream of
continuous calculations forever. {{Where 'forever' means thousands
of cycles but no where near the lifetime of the universe.}}
Now, obviously, this means the memory system has to be able to make
forward progress on all those memory accesses continuously.
Is the whole VR register file part of
absorbent or latency should be covered by one register?
A single register covers a single memory reference latency.
Is OoO
machinery part of absorbent?
The only OoO in the CRAYs was delivery of gather data back to the
vector register*. Scatter stores were sent out in order, as were the
addresses of the gather loads.
(*) bank conflicts would delay conflicting accesses but not those
of other banks, creating an OoO effect of returning data. This was
re-ordered back to IO prior to forwarding data into calculation.
Is HW threading part of absorbent?
Absolutely not--none of the CRAYs did this--later XMPs and YMPs did
use lanes (SIMD with vector) but always did calculations in order
and always sent out addresses (and data when appropriate) in order.
And for
any of your possible answers I have my "Why?".
No harm in asking.
It seems, we are talking about different things.
You are talking about Cray vectors, as done in Cray's 1/X-MP/Y-MP
series. I.e. something fixed, known and of interest mostly for
computing historians among us.
OTH, I am trying to discuss a vague notion of "Cray-style vectors". My intentions are to see what was applicable in more recent times and
which ideas are not totally obsolete for a future.
OTH, I am trying to discuss a vague notion of "Cray-style vectors". My intentions are to see what was applicable in more recent times and
which ideas are not totally obsolete for a future.
OTH, I am trying to discuss a vague notion of "Cray-style vectors". My
intentions are to see what was applicable in more recent times and
which ideas are not totally obsolete for a future.
Another way to look at the difference between SSE-style vectors (which
I'd call "short vectors") at the ISA level is the fact that SSE-style
vector instructions are designed under the assumption that the latency
of a vector instruction will be more or less the same as that of
a non-vector instruction (i.e. you have enough ALUs to do all the
operations at the same time),
whereas Cray-style vector instructions
(which we could call "long vectors") are designed under the assumption
that the latency will be somewhat proportional to the length of the
vector because the core of the CPU will only access a chunk of the
vector at a time.
So, short vectors have a fairly free hand at shuffling data across their vector (e.g. bitmatrix transpose), and they can be implemented/scheduled/dispatched just like any other instruction, but
the vector length tends to be severely limited and exposed all over
the place.
In contrast long vectors usually depend on specialized implementations
(e.g. chaining) to get good performance, but their length is
easier/cheaper to change.
AFAICT long vectors made sense when we could build machines with
a memory bandwidth that was higher and ALUs were more expensive.
Nowadays we tend to have the opposite.
Also, the massive number of transistors we spend nowadays on OoO means
that a good OoO CPU can dispatch individual non-vector instructions to
ALUs just as well as the Cray did with its vectors with chaining.
Stefan
OTH, I am trying to discuss a vague notion of "Cray-style vectors".
My intentions are to see what was applicable in more recent times
and which ideas are not totally obsolete for a future.
Another way to look at the difference between SSE-style vectors (which
I'd call "short vectors") at the ISA level is the fact that SSE-style
vector instructions are designed under the assumption that the latency
of a vector instruction will be more or less the same as that of
a non-vector instruction (i.e. you have enough ALUs to do all the
operations at the same time), whereas Cray-style vector instructions
(which we could call "long vectors") are designed under the assumption
that the latency will be somewhat proportional to the length of the
vector because the core of the CPU will only access a chunk of the
vector at a time.
So, short vectors have a fairly free hand at shuffling data across
their vector (e.g. bitmatrix transpose), and they can be implemented/scheduled/dispatched just like any other instruction, but
the vector length tends to be severely limited and exposed all over
the place.
In contrast long vectors usually depend on specialized implementations
(e.g. chaining) to get good performance, but their length is
easier/cheaper to change.
AFAICT long vectors made sense when we could build machines with
a memory bandwidth that was higher and ALUs were more expensive.
Nowadays we tend to have the opposite.
Also, the massive number of transistors we spend nowadays on OoO means
that a good OoO CPU can dispatch individual non-vector instructions to
ALUs just as well as the Cray did with its vectors with chaining.
Stefan
Michael S wrote:
Doctor, it hurts when I do this!
So, what prevents you from providing no gather with resolution
below 64 bits?
Well, then, you have SP values in a container than could hold 2 and you
don't get any SIMD speedup.
On 3/13/24 3:24 PM, MitchAlsup1 wrote:
Stefan Monnier wrote:[snip]
So, short vectors have a fairly free hand at shuffling data across
their vector (e.g. bitmatrix transpose), and they can be
implemented/scheduled/dispatched just like any other instruction,
but the vector length tends to be severely limited and exposed all
over the place.
Consuming OpCode space like nobody's business.
Is that necessarily the case? Excluding the shuffle operations, I
think only loads and stores would need to have length specifiers.
Shuffle operations become much more expensive with larger
'vectors', so providing the same granularity of shuffle for larger
vectors seems questionable. (With scatter/gather, permute/shuffle
may be less useful anyway.)
Am I missing something when I assume that lane-based (SIMD)
operations do not need size information in the instruction? The
extra metadata is not free (perhaps especially as that controls
execution at least for efficiency), but if opcode explosion is so
undesirable using metadata might be preferred.
On 3/13/24 3:24 PM, MitchAlsup1 wrote:
Stefan Monnier wrote:[snip]
So, short vectors have a fairly free hand at shuffling data across their >>> vector (e.g. bitmatrix transpose), and they can be
implemented/scheduled/dispatched just like any other instruction, but
the vector length tends to be severely limited and exposed all over
the place.
Consuming OpCode space like nobody's business.
Is that necessarily the case? Excluding the shuffle operations, I
think only loads and stores would need to have length specifiers.
Paul A. Clayton wrote:
On 3/13/24 3:24 PM, MitchAlsup1 wrote:
Stefan Monnier wrote:[snip]
So, short vectors have a fairly free hand at shuffling data across their >>>> vector (e.g. bitmatrix transpose), and they can be
implemented/scheduled/dispatched just like any other instruction, but
the vector length tends to be severely limited and exposed all over
the place.
Consuming OpCode space like nobody's business.
Is that necessarily the case? Excluding the shuffle operations, I
think only loads and stores would need to have length specifiers.
Add signed byte add unsigned byte, add signed half, add unsigned half
add signed int, add unsigned int, add long, add 2 floats, add double
compared to ADD integer, add float and add double.
And then there is the addsub group, too.
I may be possible to avoid OpCode explosion, but neither x86, nor ARM
got anywhere close to avoiding the deluge of OpCodes.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 299 |
Nodes: | 16 (2 / 14) |
Uptime: | 40:53:29 |
Calls: | 6,682 |
Files: | 12,225 |
Messages: | 5,343,435 |