• Re: What integer C type to use

    From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Feb 20 21:03:59 2024
    Scott Lurndal wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Or maybe changing int from 32-bit to 64-bit would have caused
    as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
    day.

    In Unix sizeof(int) == sizeof(int *) on both 16-bit and 32-bit >>architectures. Given the history of C, that's not surprising: BCPL
    and B have a single type, the machine word, and it eventually became
    C's int. You see this in "int" declarations being optional in various >>places. So code portable between 16-bit and 32-bit systems could not >>assume that int has a specific size (such as 32 bits), but if it
    assumed that sizeof(int) == sizeof(int *), that would port fine
    between 16-bit and 32-bit Unixes. There may have been C code that
    assumed that sizeof(int)==4, but why cater to this kind of code which
    did not even port to 16-bit systems?

    Most of the problems encountered when moving unix (System V)
    from 16-bit to 32-bit were more around missing typedefs for certain
    data types (e.g. uids, gids, pids, etc), so there was
    a lot of code that declared these as shorts, but the 32-bit
    kernels defined these as 32-bit (unsigned) values).

    That's when uid_t, pid_t, gid_t were added.

    Then there were the folks who used 'short [int]' instead of 'int'
    since they were the same size on the PDP-11.

    This is a good reason to build a time machine and send back a few
    of the Sopranos to take care of those programmers.

    The Unix code ported relatively easily to I32LP64 because uintptr_t
    had been used extensively rather than assumptions about
    sizeof(int) == sizeof(int *).

    At least the sense of forward progress.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Tue Feb 20 20:58:58 2024
    Anton Ertl wrote:

    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 13:00, Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    mitchalsup@aol.com (MitchAlsup1) writes:
    We are in an era where long has higher performance than ints (except for >>>>> cache footprint overheads.)

    C has been in that era since the bad I32LP64 decision of the people
    who did the first 64-bit Unix compilers in the early 1990s.

    I presume the main reason for this was the size and cost of memory at
    the time? Or do you know any other reason? Maybe some of the early
    64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were >>faster at 16-bit.

    I know no implementation of a 64-bit architecture where ALU operations (except maybe division where present) is slower in 64 bits than in 32
    bits. I would have chosen ILP64 at the time, so I can only guess at
    their reasons:

    The ALU operations were slower because when you IADD x and y to give z
    if you then want z to have no significance above 32-bits you require
    a second instruction to smash the value back into the container range.
    So 1 ALU in 64-bits takes 1 instruction while 1 ALU requiring 32-bits
    requires 2. This is the reason RISC-V has 32-bit ALU instructions.

    Guess 1: There was more software that depended on sizeof(int)==4 than software that depended on sizeof(int)==sizeof(char *).

    There remains a lot of SW that is dependent on INT being smaller than LONG.

    Guess 2: When benchmarketing without adapting the source code (as is
    usual), I32LP64 produced better numbers then ILP64 for some
    benchmarks, because arrays and other data structures with int elements
    are smaller and have better cache hit rates.

    Some truth in here.

    My guess is that it was a mixture of 1 and 2, with 2 being the
    decisive factor. I have certainly seen a lot of writing about how
    64-bit (pointers) hurt performance, and it even led to the x32
    nonsense (which never went anywhere, not surprising to me). These
    days support for 32-bit applications is eliminated from ARM cores,
    another indication that the performance advantages of 32-bit pointers
    are minor.

    Try programming array codes where the arrays are TB in size with P32.

    My ONLY argument is that int was SUPPOSED to be the fastest arithmetic.
    It was in PDP-11 days and VAX days, but it is no longer.

    BTW, some people here have advocated the use of unsigned instead of
    int. Which of the two results in better code depends on the
    architecture. On AMD64 where the so-called 32-bit instructions
    perform a 32->64-bit zero-extension, unsigned is better. On RV64G
    where the so-called 32-bit instructions perform a 32->64-bit sign
    extension, signed int is better. But actually the best way is to use
    a full-width type like intptr_t or uintptr_t, which gives better
    results than either.

    I would suggest C "fast" types like int_fast32_t (or other "fast" sizes, >>picked to fit the range you need).

    The semantics of unsigned are better when you are twiddling bits than
    when signed. My programming practice is to use unsigned everywhere that
    a negative number is unexpected.

    Sure, and then the program might break when an array has more the 2^31 elements; or it might work on one platform and break on another one.

    By contrast, with (u)intptr_t, on modern architectures you use the
    type that's as wide as the GPRs. And I don't see a reason why to use something else for a loop counter.

    This is equivalent to my argument above--int SHOULD be the fastest kind
    of arithmetic; sadly this tenet has been cast aside.

    For a type of which you store many in an array or other data
    structure, you probably prefer int32_t rather than int_fast32_t if 32
    bits is enough. So I don't see a reason for int_fast32_t etc.

    These adapt suitably for different
    targets. If you want to force the issue, then "int64_t" is IMHO clearer >>than "long long int" and does not give a strange impression where you
    are using a type aimed at pointer arithmetic for general integer arithmetic.

    Why do you bring up "long long int"? As for int64_t, that tends to be
    slow (if supported at all) on 32-bit platforms, and it is more than
    what is necessary for indexing arrays and for loop counters that are
    used for indexing into arrays.

    If you want fast local variables, use C's [u]int_fastN_t types. That's >>what they are for.

    I don't see a point in those types. What's wrong with (u)intptr_t IYO?

    In reality, this is a dusty deck problem.

    Don't use "-fwrapv" unless you actually need it - in most
    code, if your arithmetic overflows, you have a mistake in your code, so >>letting the compiler assume that will not happen is a good thing.

    Thank you for giving a demonstration for Scott Lurndal. I assume that
    you claim to be a programmer.

    Anyway, if I have made a mistake in my code, why would let the
    compiler assume that I did not make a mistake be a good thing?

    There are more people willing to pay good money for a wrong answer fast
    than there are willing to pay good money for correct answer slow.

    I OTOH prefer if the compiler behaves consistently, so I use -fwrapv,
    and for good performance, I write the code appropriately (e.g., by
    using intptr_t instead of int).

    Everything I do is inherently 64-bits except for interfaces to OS
    services.

    (And
    it lets you check for overflow bugs using run-time sanitizers.)

    If the compiler assumes that overflow does not happen, how do these "sanitizers" work?

    Anyway, I certainly have code that relies on modulo arithmetic.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Tue Feb 20 21:01:45 2024
    Anton Ertl wrote:

    scott@slp53.sl.home (Scott Lurndal) writes:
    Or maybe changing int from 32-bit to 64-bit would have caused
    as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
    day.

    In Unix sizeof(int) == sizeof(int *) on both 16-bit and 32-bit
    architectures. Given the history of C, that's not surprising: BCPL
    and B have a single type, the machine word, and it eventually became
    C's int. You see this in "int" declarations being optional in various places. So code portable between 16-bit and 32-bit systems could not
    assume that int has a specific size (such as 32 bits), but if it
    assumed that sizeof(int) == sizeof(int *), that would port fine
    between 16-bit and 32-bit Unixes. There may have been C code that
    assumed that sizeof(int)==4, but why cater to this kind of code which
    did not even port to 16-bit systems?

    In any case, I32LP64 caused breakage for my code, and I expect that

    # define int int64_t
    # define long int64_t
    # define unsigned uint64_t
    # define ...

    there was more code around with the assumption sizeof(int)==sizeof(int
    *) than with the assumption sizeof(int)==4. Of course, we worked
    around this misfeature of the C compilers on Digital OSF/1, but those
    who assumed sizeof(int)==4 would have adapted their code if the
    decision had been for ILP64.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Wed Feb 21 13:24:00 2024
    On 20/02/2024 21:58, MitchAlsup1 wrote:
    Anton Ertl wrote:

    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 13:00, Anton Ertl wrote:


    Guess 1: There was more software that depended on sizeof(int)==4 than
    software that depended on sizeof(int)==sizeof(char *).

    There remains a lot of SW that is dependent on INT being smaller than LONG.


    Really? They are the same size on all 32-bit platforms. And they are
    the same size on one of the most common 64-bit platforms (Win64).


    My ONLY argument is that int was SUPPOSED to be the fastest arithmetic.
    It was in PDP-11 days and VAX days, but it is no longer.

    It still is on 32-bit and smaller. But it is not on 64-bit platforms.

    It used to be that "int" meant the same as "int_fast16_t" and "long"
    meant "int_fast32_t". As you say, that is no longer the case. That is
    why I prefer types with explicit ranges for many purposes, though I will
    also use "int" for simplicity (or laziness, as some might say!).

    I would suggest C "fast" types like int_fast32_t (or other "fast"
    sizes, picked to fit the range you need).

    The semantics of unsigned are better when you are twiddling bits than
    when signed.

    I agree on that - I don't think signed types are appropriate for any
    kind of bit twiddling.

    My programming practice is to use unsigned everywhere that
    a negative number is unexpected.

    That is a popular practice, but it is not necessarily the best choice.

    Anyway, if I have made a mistake in my code, why would let the
    compiler assume that I did not make a mistake be a good thing?

    There are more people willing to pay good money for a wrong answer fast
    than there are willing to pay good money for correct answer slow.


    I prefer tools to find my mistakes quickly and then give the right
    answers fast. It seems a better choice than either of your options.
    That is particularly the case in the real-time programming I do, where
    getting the right answer too slow is as wrong as getting the wrong
    answer fast. Time requirements can be part of the correctness
    requirements. (But if yours are your only options, then of course
    correctness trumps speed.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Wed Feb 21 14:15:28 2024
    MitchAlsup1 wrote:
    Anton Ertl wrote:
    My guess is that it was a mixture of 1 and 2, with 2 being the
    decisive factor.  I have certainly seen a lot of writing about how
    64-bit (pointers) hurt performance, and it even led to the x32
    nonsense (which never went anywhere, not surprising to me).  These
    days support for 32-bit applications is eliminated from ARM cores,
    another indication that the performance advantages of 32-bit pointers
    are minor.

    Try programming array codes where the arrays are TB in size with P32.

    That _does_ work (for some definition of "work") if those TB arrays are near-square matrices: Both x and y will easily stay in the 32-bit range,
    it is up to the compiler to realize that the effective address of
    element[x,y] needs more than those 32 bits, i.e. you need to generate something like

    (uint64_t) y * (uint64_t) y_stride + (uint64_t) x * (uint64_t) x_stride

    [snip]>
    The semantics of unsigned are better when you are twiddling bits than
    when signed. My programming practice is to use unsigned everywhere that
    a negative number is unexpected.

    Ditto.


    Sure, and then the program might break when an array has more the 2^31
    elements; or it might work on one platform and break on another one.

    By contrast, with (u)intptr_t, on modern architectures you use the
    type that's as wide as the GPRs.  And I don't see a reason why to use
    something else for a loop counter.

    You are not alone in thinking that, i.e. usize being the only legal
    array index in Rust.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Scott Lurndal on Wed Feb 21 19:14:11 2024
    On 21/02/2024 18:27, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 21/02/2024 17:33, Michael S wrote:


    If one wants top performance on 64-bit architectures then avoiding
    'unsigned int' indices is a very good idea. Hoping that compiler will
    somehow figure out what you meant instead of doing what you wrote is a
    naivety.


    Indeed.

    I hope you're not agreeing that unsigned array indicies should be
    avoided - they should instead be preferred.


    I am agreeing that "unsigned int" indices should be avoided on 64-bit architectures (at least, when you are using them in loops) because the
    compiler has to generate extra instructions to handle wrapping. You
    don't have that problem with larger unsigned types - size_t is often an appropriate choice, but you might have reasons to pick something else
    (an uint_fastN_t type, for example).

    I am also quite happy with "int" as an index type, and don't see a
    problem with it (obviously assuming you know that it has enough range
    for the task - and usually you do).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to David Brown on Wed Feb 21 18:40:09 2024
    David Brown <david.brown@hesbynett.no> writes:
    On 21/02/2024 18:27, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 21/02/2024 17:33, Michael S wrote:


    If one wants top performance on 64-bit architectures then avoiding
    'unsigned int' indices is a very good idea. Hoping that compiler will
    somehow figure out what you meant instead of doing what you wrote is a >>>> naivety.


    Indeed.

    I hope you're not agreeing that unsigned array indicies should be
    avoided - they should instead be preferred.


    I am agreeing that "unsigned int" indices should be avoided on 64-bit >architectures (at least, when you are using them in loops) because the >compiler has to generate extra instructions to handle wrapping

    Does it?

    A simple test case shows that using an unsigned long or unsigned
    int as a loop index generates exactly the same non-optimized code,
    albeit the unsigned long version was slightly larger (4 bytes)
    in code footprint. This with GCC on 64-bit Fedora Core and
    without optimization (with optimization (-O2) both generated
    identical code)

    unsigned long
    test(void)
    {
    unsigned int x; /* Alternate unsigned long x; */
    unsigned long vector[1000];
    unsigned long result;

    for(x = 0; x < 1000u; x++) {
    unsigned long a = vector[x];
    result += a;
    }
    return result;
    }


    unsigned int loop index:
    00000000000000 <test>:
    0: 55 push %rbp
    1: 48 89 e5 mov %rsp,%rbp
    4: 48 81 ec e8 1e 00 00 sub $0x1ee8,%rsp
    b: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
    12: eb 1b jmp 2f <test+0x2f>
    14: 8b 45 fc mov -0x4(%rbp),%eax
    17: 48 8b 84 c5 a0 e0 ff mov -0x1f60(%rbp,%rax,8),%rax
    1e: ff
    1f: 48 89 45 e8 mov %rax,-0x18(%rbp)
    23: 48 8b 45 e8 mov -0x18(%rbp),%rax
    27: 48 01 45 f0 add %rax,-0x10(%rbp)
    2b: 83 45 fc 01 addl $0x1,-0x4(%rbp)
    2f: 81 7d fc e7 03 00 00 cmpl $0x3e7,-0x4(%rbp)
    36: 76 dc jbe 14 <test+0x14>
    38: 48 8b 45 f0 mov -0x10(%rbp),%rax
    3c: c9 leaveq
    3d: c3 retq

    Unsigned long loop index:
    00000000000000 <test>:
    0: 55 push %rbp
    1: 48 89 e5 mov %rsp,%rbp
    4: 48 81 ec e8 1e 00 00 sub $0x1ee8,%rsp
    b: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp)
    12: 00
    13: eb 1d jmp 32 <test+0x32>
    15: 48 8b 45 f8 mov -0x8(%rbp),%rax
    19: 48 8b 84 c5 a0 e0 ff mov -0x1f60(%rbp,%rax,8),%rax
    20: ff
    21: 48 89 45 e8 mov %rax,-0x18(%rbp)
    25: 48 8b 45 e8 mov -0x18(%rbp),%rax
    29: 48 01 45 f0 add %rax,-0x10(%rbp)
    2d: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)
    32: 48 81 7d f8 e7 03 00 cmpq $0x3e7,-0x8(%rbp)
    39: 00
    3a: 76 d9 jbe 15 <test+0x15>
    3c: 48 8b 45 f0 mov -0x10(%rbp),%rax
    40: c9 leaveq
    41: c3 retq

    With -O2:
    0000000000000000 <test>:
    0: 48 81 ec d0 1e 00 00 sub $0x1ed0,%rsp
    7: 48 8d 54 24 88 lea -0x78(%rsp),%rdx
    c: 48 8d 8c 24 c8 1e 00 lea 0x1ec8(%rsp),%rcx
    13: 00
    14: 0f 1f 40 00 nopl 0x0(%rax)
    18: 48 03 02 add (%rdx),%rax
    1b: 48 83 c2 08 add $0x8,%rdx
    1f: 48 39 ca cmp %rcx,%rdx
    22: 75 f4 jne 18 <test+0x18>
    24: 48 81 c4 d0 1e 00 00 add $0x1ed0,%rsp
    2b: c3 retq

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Scott Lurndal on Wed Feb 21 21:33:46 2024
    On 21/02/2024 19:40, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 21/02/2024 18:27, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 21/02/2024 17:33, Michael S wrote:


    If one wants top performance on 64-bit architectures then avoiding
    'unsigned int' indices is a very good idea. Hoping that compiler will >>>>> somehow figure out what you meant instead of doing what you wrote is a >>>>> naivety.


    Indeed.

    I hope you're not agreeing that unsigned array indicies should be
    avoided - they should instead be preferred.


    I am agreeing that "unsigned int" indices should be avoided on 64-bit
    architectures (at least, when you are using them in loops) because the
    compiler has to generate extra instructions to handle wrapping

    Does it?

    A simple test case shows that using an unsigned long or unsigned
    int as a loop index generates exactly the same non-optimized code,
    albeit the unsigned long version was slightly larger (4 bytes)
    in code footprint. This with GCC on 64-bit Fedora Core and
    without optimization (with optimization (-O2) both generated
    identical code)

    unsigned long
    test(void)
    {
    unsigned int x; /* Alternate unsigned long x; */
    unsigned long vector[1000];
    unsigned long result;

    for(x = 0; x < 1000u; x++) {
    unsigned long a = vector[x];
    result += a;
    }
    return result;
    }


    unsigned int loop index:

    There's no point in talking about the most efficient generated code
    without enabling optimisation! So this example is irrelevant.


    Unsigned long loop index:

    This one too.


    With -O2:

    There's no point in talking about generated code with a source that uses uninitialised data. The complier could just as well have turned the
    code into "return 0;".

    However, in code where the compiler can be sure there is no wrapping on
    the index, and can turn the code into pointer arithmetic and stop when
    the pointer gets to the end, then no extra code is needed for an
    "unsigned int" index.

    But if there /is/ a possibility of wrapping, such as in the "sext"
    function posted by Anton, it is needed.

    Or if we adapt your function to have a range:

    unsigned long
    test_uint(long unsigned int a, long unsigned int b)
    {
    unsigned int x; /* Alternate unsigned long x; */
    extern unsigned long vector[1000];
    unsigned long result = 0;

    for(x = a; x <= b; x++) {
    unsigned long a = vector[x];
    result += a;
    }
    return result;
    }

    Using "unsigned int" instead of "int" or "unsigned long" means 5
    instructions in the inner loop, instead of 4. Of course there are other
    ways to re-arrange the function to get the optimal result. But the
    point is, when the compiler does not know for sure that unsigned int
    arithmetic will not wrap, it can lead to extra instructions.

    <https://godbolt.org/z/a5en1j5xY>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to David Brown on Wed Feb 21 22:48:52 2024
    David Brown wrote:
    On 21/02/2024 19:40, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 21/02/2024 18:27, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 21/02/2024 17:33, Michael S wrote:


    If one wants top performance on 64-bit architectures then avoiding >>>>>> 'unsigned int' indices is a very good idea. Hoping that compiler will >>>>>> somehow figure out what you meant instead of doing what you wrote
    is a
    naivety.


    Indeed.

    I hope you're not agreeing that unsigned array indicies should be
    avoided - they should instead be preferred.


    I am agreeing that "unsigned int" indices should be avoided on 64-bit
    architectures (at least, when you are using them in loops) because the
    compiler has to generate extra instructions to handle wrapping

    Does it?

    A simple test case shows that using an unsigned long or unsigned
    int as a loop index generates exactly the same non-optimized code,
    albeit the unsigned long version was slightly larger (4 bytes)
    in code footprint.   This with GCC on 64-bit Fedora Core and
    without optimization (with optimization (-O2) both generated
    identical code)

    unsigned long
    test(void)
    {
         unsigned int x;                    /* Alternate unsigned long x; */
         unsigned long vector[1000];
         unsigned long result;

         for(x = 0; x < 1000u; x++) {
             unsigned long a = vector[x];
             result += a;
         }
         return result;
    }


    unsigned int loop index:

    There's no point in talking about the most efficient generated code
    without enabling optimisation!  So this example is irrelevant.


    Unsigned long loop index:

    This one too.


    With -O2:

    There's no point in talking about generated code with a source that uses uninitialised data.  The complier could just as well have turned the
    code into "return 0;".

    However, in code where the compiler can be sure there is no wrapping on
    the index, and can turn the code into pointer arithmetic and stop when
    the pointer gets to the end, then no extra code is needed for an
    "unsigned int" index.

    But if there /is/ a possibility of wrapping, such as in the "sext"
    function posted by Anton, it is needed.

    Or if we adapt your function to have a range:

    unsigned long
    test_uint(long unsigned int a, long unsigned int b)
    {
        unsigned int x;                    /* Alternate unsigned long x; */
        extern unsigned long vector[1000];
        unsigned long result = 0;

        for(x = a; x <= b; x++) {
            unsigned long a = vector[x];
            result += a;
        }
        return result;
    }

    I really hope you are not in the habit of reusing function parameter
    names as local variables inside the function? Also the various
    inconsistent integer names (long unsigned int, unsigned long), are these supposed to be different types?

    I would expect to see code very close to this for a sane setup:

    xor rax,rax ; result variable
    .. setup code for beginning and end of the range, with check that a<=b
    next:
    add rax,[rsi]
    lea rsi,[rsi+8]
    cmp esi,edi
    jb next

    All four instructions could run in the same cycle, but since these are unsigned variables we can easily parallelize a bit, using
    rax,rbx,rcx,rdx as accumulators, and at this point the performance will
    be limited by how many data items we can load per cycle. The obvious workaround is of course to use SIMD, with a 16/32/64-byte register as
    the wide accumulator.


    Using "unsigned int" instead of "int" or "unsigned long" means 5 instructions in the inner loop, instead of 4.  Of course there are other ways to re-arrange the function to get the optimal result.  But the
    point is, when the compiler does not know for sure that unsigned int arithmetic will not wrap, it can lead to extra instructions.

    <https://godbolt.org/z/a5en1j5xY>

    I see that the code is different: Can you actually measure a performance difference?

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Michael S on Thu Feb 22 01:49:19 2024
    On Thu, 22 Feb 2024 01:37:45 +0200
    Michael S <already5chosen@yahoo.com> wrote:

    On Wed, 21 Feb 2024 22:48:52 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    David Brown wrote:
    On 21/02/2024 19:40, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 21/02/2024 18:27, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 21/02/2024 17:33, Michael S wrote:


    If one wants top performance on 64-bit architectures then
    avoiding 'unsigned int' indices is a very good idea. Hoping
    that compiler will somehow figure out what you meant instead
    of doing what you wrote is a
    naivety.


    Indeed.

    I hope you're not agreeing that unsigned array indicies should
    be avoided - they should instead be preferred.


    I am agreeing that "unsigned int" indices should be avoided on
    64-bit architectures (at least, when you are using them in
    loops) because the compiler has to generate extra instructions
    to handle wrapping

    Does it?

    A simple test case shows that using an unsigned long or unsigned
    int as a loop index generates exactly the same non-optimized
    code, albeit the unsigned long version was slightly larger (4
    bytes) in code footprint.   This with GCC on 64-bit Fedora Core
    and without optimization (with optimization (-O2) both generated
    identical code)

    unsigned long
    test(void)
    {
         unsigned int x;                    /* Alternate unsigned
    long x; */ unsigned long vector[1000];
         unsigned long result;

         for(x = 0; x < 1000u; x++) {
             unsigned long a = vector[x];
             result += a;
         }
         return result;
    }


    unsigned int loop index:

    There's no point in talking about the most efficient generated
    code without enabling optimisation!  So this example is
    irrelevant.

    Unsigned long loop index:

    This one too.


    With -O2:

    There's no point in talking about generated code with a source
    that uses uninitialised data.  The complier could just as well
    have turned the code into "return 0;".

    However, in code where the compiler can be sure there is no
    wrapping on the index, and can turn the code into pointer
    arithmetic and stop when the pointer gets to the end, then no
    extra code is needed for an "unsigned int" index.

    But if there /is/ a possibility of wrapping, such as in the
    "sext" function posted by Anton, it is needed.

    Or if we adapt your function to have a range:

    unsigned long
    test_uint(long unsigned int a, long unsigned int b)
    {
        unsigned int x;                    /* Alternate unsigned long
    x; */ extern unsigned long vector[1000];
        unsigned long result = 0;

        for(x = a; x <= b; x++) {
            unsigned long a = vector[x];
            result += a;
        }
        return result;
    }

    I really hope you are not in the habit of reusing function
    parameter names as local variables inside the function? Also the
    various inconsistent integer names (long unsigned int, unsigned
    long), are these supposed to be different types?

    I would expect to see code very close to this for a sane setup:

    xor rax,rax ; result variable
    .. setup code for beginning and end of the range, with check that
    a<=b next:
    add rax,[rsi]
    lea rsi,[rsi+8]
    cmp esi,edi
    jb next

    All four instructions could run in the same cycle, but since these
    are unsigned variables we can easily parallelize a bit, using rax,rbx,rcx,rdx as accumulators, and at this point the performance
    will be limited by how many data items we can load per cycle. The
    obvious workaround is of course to use SIMD, with a 16/32/64-byte
    register as the wide accumulator.


    Using "unsigned int" instead of "int" or "unsigned long" means 5 instructions in the inner loop, instead of 4.  Of course there are
    other ways to re-arrange the function to get the optimal result.
    But the point is, when the compiler does not know for sure that
    unsigned int arithmetic will not wrap, it can lead to extra
    instructions.

    <https://godbolt.org/z/a5en1j5xY>

    I see that the code is different: Can you actually measure a
    performance difference?

    Terje



    In those particular case the speed difference is likely hard to
    measure on all but the smallest cores.

    Below is the example where I would expect significant difference even
    on cores that were state of the art just 4-5 years ago, like Intel
    Skylake or Arm Cortex-A76:

    double foo(const double* src, unsigned len)
    {
    double acc42 = 0;
    double acc52 = 0;
    double acc62 = 0;
    double acc72 = 0;
    for (unsigned i = 0; i < len; ++i) {
    acc42 += src[i+42];
    acc52 += src[i+52];
    acc62 += src[i+62];
    acc72 += src[i+72];
    }
    return acc42*acc52*acc62*acc72;
    }

    Godbolt links:
    x86-64: https://godbolt.org/z/jErxbfshM
    ARM64: https://godbolt.org/z/xzr9xMTnW





    P.S.
    Thinking a bit more about it... No, on Skylake there wouldn't be
    significant speed difference. All variant will be latency-bound and will
    run very closely to 4 cycles per iteration. In order to see the
    difference we will need higher amount of accumulators - at least 6, but
    8 are better.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Thu Feb 22 01:37:45 2024
    On Wed, 21 Feb 2024 22:48:52 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    David Brown wrote:
    On 21/02/2024 19:40, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 21/02/2024 18:27, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 21/02/2024 17:33, Michael S wrote:


    If one wants top performance on 64-bit architectures then
    avoiding 'unsigned int' indices is a very good idea. Hoping
    that compiler will somehow figure out what you meant instead
    of doing what you wrote is a
    naivety.


    Indeed.

    I hope you're not agreeing that unsigned array indicies should be
    avoided - they should instead be preferred.


    I am agreeing that "unsigned int" indices should be avoided on
    64-bit architectures (at least, when you are using them in loops)
    because the compiler has to generate extra instructions to handle
    wrapping

    Does it?

    A simple test case shows that using an unsigned long or unsigned
    int as a loop index generates exactly the same non-optimized code,
    albeit the unsigned long version was slightly larger (4 bytes)
    in code footprint.   This with GCC on 64-bit Fedora Core and
    without optimization (with optimization (-O2) both generated
    identical code)

    unsigned long
    test(void)
    {
         unsigned int x;                    /* Alternate unsigned long
    x; */ unsigned long vector[1000];
         unsigned long result;

         for(x = 0; x < 1000u; x++) {
             unsigned long a = vector[x];
             result += a;
         }
         return result;
    }


    unsigned int loop index:

    There's no point in talking about the most efficient generated code without enabling optimisation!  So this example is irrelevant.


    Unsigned long loop index:

    This one too.


    With -O2:

    There's no point in talking about generated code with a source that
    uses uninitialised data.  The complier could just as well have
    turned the code into "return 0;".

    However, in code where the compiler can be sure there is no
    wrapping on the index, and can turn the code into pointer
    arithmetic and stop when the pointer gets to the end, then no extra
    code is needed for an "unsigned int" index.

    But if there /is/ a possibility of wrapping, such as in the "sext" function posted by Anton, it is needed.

    Or if we adapt your function to have a range:

    unsigned long
    test_uint(long unsigned int a, long unsigned int b)
    {
        unsigned int x;                    /* Alternate unsigned long
    x; */ extern unsigned long vector[1000];
        unsigned long result = 0;

        for(x = a; x <= b; x++) {
            unsigned long a = vector[x];
            result += a;
        }
        return result;
    }

    I really hope you are not in the habit of reusing function parameter
    names as local variables inside the function? Also the various
    inconsistent integer names (long unsigned int, unsigned long), are
    these supposed to be different types?

    I would expect to see code very close to this for a sane setup:

    xor rax,rax ; result variable
    .. setup code for beginning and end of the range, with check that
    a<=b next:
    add rax,[rsi]
    lea rsi,[rsi+8]
    cmp esi,edi
    jb next

    All four instructions could run in the same cycle, but since these
    are unsigned variables we can easily parallelize a bit, using rax,rbx,rcx,rdx as accumulators, and at this point the performance
    will be limited by how many data items we can load per cycle. The
    obvious workaround is of course to use SIMD, with a 16/32/64-byte
    register as the wide accumulator.


    Using "unsigned int" instead of "int" or "unsigned long" means 5 instructions in the inner loop, instead of 4.  Of course there are
    other ways to re-arrange the function to get the optimal result.
    But the point is, when the compiler does not know for sure that
    unsigned int arithmetic will not wrap, it can lead to extra
    instructions.

    <https://godbolt.org/z/a5en1j5xY>

    I see that the code is different: Can you actually measure a
    performance difference?

    Terje



    In those particular case the speed difference is likely hard to
    measure on all but the smallest cores.

    Below is the example where I would expect significant difference even
    on cores that were state of the art just 4-5 years ago, like Intel
    Skylake or Arm Cortex-A76:

    double foo(const double* src, unsigned len)
    {
    double acc42 = 0;
    double acc52 = 0;
    double acc62 = 0;
    double acc72 = 0;
    for (unsigned i = 0; i < len; ++i) {
    acc42 += src[i+42];
    acc52 += src[i+52];
    acc62 += src[i+62];
    acc72 += src[i+72];
    }
    return acc42*acc52*acc62*acc72;
    }

    Godbolt links:
    x86-64: https://godbolt.org/z/jErxbfshM
    ARM64: https://godbolt.org/z/xzr9xMTnW

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Terje Mathisen on Thu Feb 22 09:48:29 2024
    On 21/02/2024 22:48, Terje Mathisen wrote:
    David Brown wrote:
    On 21/02/2024 19:40, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 21/02/2024 18:27, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 21/02/2024 17:33, Michael S wrote:


    If one wants top performance on 64-bit architectures then avoiding >>>>>>> 'unsigned int' indices is a very good idea. Hoping that compiler >>>>>>> will
    somehow figure out what you meant instead of doing what you wrote >>>>>>> is a
    naivety.


    Indeed.

    I hope you're not agreeing that unsigned array indicies should be
    avoided - they should instead be preferred.


    I am agreeing that "unsigned int" indices should be avoided on 64-bit
    architectures (at least, when you are using them in loops) because the >>>> compiler has to generate extra instructions to handle wrapping

    Does it?

    A simple test case shows that using an unsigned long or unsigned
    int as a loop index generates exactly the same non-optimized code,
    albeit the unsigned long version was slightly larger (4 bytes)
    in code footprint.   This with GCC on 64-bit Fedora Core and
    without optimization (with optimization (-O2) both generated
    identical code)

    unsigned long
    test(void)
    {
         unsigned int x;                    /* Alternate unsigned long x; */
         unsigned long vector[1000];
         unsigned long result;

         for(x = 0; x < 1000u; x++) {
             unsigned long a = vector[x];
             result += a;
         }
         return result;
    }


    unsigned int loop index:

    There's no point in talking about the most efficient generated code
    without enabling optimisation!  So this example is irrelevant.


    Unsigned long loop index:

    This one too.


    With -O2:

    There's no point in talking about generated code with a source that
    uses uninitialised data.  The complier could just as well have turned
    the code into "return 0;".

    However, in code where the compiler can be sure there is no wrapping
    on the index, and can turn the code into pointer arithmetic and stop
    when the pointer gets to the end, then no extra code is needed for an
    "unsigned int" index.

    But if there /is/ a possibility of wrapping, such as in the "sext"
    function posted by Anton, it is needed.

    Or if we adapt your function to have a range:

    unsigned long
    test_uint(long unsigned int a, long unsigned int b)
    {
         unsigned int x;                    /* Alternate unsigned long x; */
         extern unsigned long vector[1000];
         unsigned long result = 0;

         for(x = a; x <= b; x++) {
             unsigned long a = vector[x];
             result += a;
         }
         return result;
    }

    I really hope you are not in the habit of reusing function parameter
    names as local variables inside the function?

    No, of course not. I just didn't notice that while playing around with
    Scott's function on godbolt.org.

    Also the various
    inconsistent integer names (long unsigned int, unsigned long), are these supposed to be different types?

    Again, it's just playing around. I don't use type names like that in my
    code - I use sized types, or plain "int" for a simple local variable of
    small size.


    I would expect to see code very close to this for a sane setup:

      xor rax,rax ; result variable
      .. setup code for beginning and end of the range, with check that a<=b next:
      add rax,[rsi]
      lea rsi,[rsi+8]
      cmp esi,edi
       jb next


    I was showing what compilers /do/, not what the /could/ do. It is
    certainly possible for the code to check the relationships between "a"
    and "b" (and possibly check for awkward end conditions - I haven't
    thought through the details) so that the loop can be shorter in the case
    when "a <= b", and longer for the case "a > b".

    All four instructions could run in the same cycle, but since these are unsigned variables we can easily parallelize a bit, using
    rax,rbx,rcx,rdx as accumulators, and at this point the performance will
    be limited by how many data items we can load per cycle. The obvious workaround is of course to use SIMD, with a 16/32/64-byte register as
    the wide accumulator.


    I turned off SSE on godbolt to make it easier to see how the code is
    generated. For real-world use aiming at maximal efficiency, you'd want
    to let the compiler use every trick it knows.


    Using "unsigned int" instead of "int" or "unsigned long" means 5
    instructions in the inner loop, instead of 4.  Of course there are
    other ways to re-arrange the function to get the optimal result.  But
    the point is, when the compiler does not know for sure that unsigned
    int arithmetic will not wrap, it can lead to extra instructions.

    <https://godbolt.org/z/a5en1j5xY>

    I see that the code is different: Can you actually measure a performance difference?


    godbolt is great for many things, but does not do performance
    measurement as far as I know.

    But you make a good point - sometimes these things do not make a
    difference in speed, on devices like x86-64 processors.

    My work is in embedded systems, and on the cores I use, it is the norm
    that extra instructions means extra time - Cortex-M cores and other microcontrollers are much simpler in their scheduling and execution.
    There can be a certain amount of overlap between instructions, but not
    nearly the kind of thing you see on "big" processors. These are 32-bit
    cores, but I have seen the same thing when code has been ported from
    smaller devices or written by people used to smaller devices, and they
    use "uint8_t" or "uint16_t" for loop variables causing unnecessary extra instructions on 32-bit devices.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to David Brown on Thu Feb 22 10:44:33 2024
    David Brown wrote:
    unsigned long
    test_uint(long unsigned int a, long unsigned int b)
    {
    unsigned int x; /* Alternate unsigned long x; */
    extern unsigned long vector[1000];
    unsigned long result = 0;

    for(x = a; x <= b; x++) {
    unsigned long a = vector[x];
    result += a;
    }
    return result;
    }

    I was showing what compilers /do/, not what the /could/ do.  It is certainly possible for the code to check the relationships between "a"
    and "b" (and possibly check for awkward end conditions - I haven't
    thought through the details) so that the loop can be shorter in the case when "a <= b", and longer for the case "a > b".

    How is that possible?

    a, b and x are all unsigned, which means that afaik there is no possible
    way for x = a; x <= b; x++ to run any iterations, if a is greater than b?

    Or am I misattributing the range-based loop to you?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Terje Mathisen on Thu Feb 22 11:24:41 2024
    On 22/02/2024 10:44, Terje Mathisen wrote:
    David Brown wrote:
    unsigned long
    test_uint(long unsigned int a, long unsigned int b)
    {
         unsigned int x;                    /* Alternate unsigned long x; */
         extern unsigned long vector[1000];
         unsigned long result = 0;

         for(x = a; x <= b; x++) {
             unsigned long a = vector[x];
             result += a;
         }
         return result;
    }

    I was showing what compilers /do/, not what the /could/ do.  It is
    certainly possible for the code to check the relationships between "a"
    and "b" (and possibly check for awkward end conditions - I haven't
    thought through the details) so that the loop can be shorter in the
    case when "a <= b", and longer for the case "a > b".

    How is that possible?

    a, b and x are all unsigned, which means that afaik there is no possible
    way for x = a; x <= b; x++ to run any iterations, if a is greater than b?

    Or am I misattributing the range-based loop to you?


    Sorry, I was not thinking straight at all. The complication here is if
    b is greater than or equal to UINT_MAX. Then there is no end point, and
    the loop is infinite. (The C standard actually gives the implementation
    leave to assume this cannot happen - it can assume there are no infinite
    loops unless the controlling expression is constant, or the loop
    contains observable behaviour. But it seems gcc does not optimise based
    on this.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to David Brown on Thu Feb 22 13:42:45 2024
    David Brown wrote:
    On 22/02/2024 10:44, Terje Mathisen wrote:
    David Brown wrote:
    unsigned long
    test_uint(long unsigned int a, long unsigned int b)
    {
         unsigned int x;                    /*
    Alternate unsigned long x; */
         extern unsigned long vector[1000];
         unsigned long result = 0;

         for(x = a; x <= b; x++) {
             unsigned long a = vector[x];
             result += a;
         }
         return result;
    }

    I was showing what compilers /do/, not what the /could/ do.  It is
    certainly possible for the code to check the relationships between
    "a" and "b" (and possibly check for awkward end conditions - I
    haven't thought through the details) so that the loop can be shorter
    in the case when "a <= b", and longer for the case "a > b".

    How is that possible?

    a, b and x are all unsigned, which means that afaik there is no
    possible way for x = a; x <= b; x++ to run any iterations, if a is
    greater than b?

    Or am I misattributing the range-based loop to you?


    Sorry, I was not thinking straight at all.  The complication here is if
    b is greater than or equal to UINT_MAX.  Then there is no end point, and the loop is infinite.  (The C standard actually gives the implementation leave to assume this cannot happen - it can assume there are no infinite loops unless the controlling expression is constant, or the loop
    contains observable behaviour.  But it seems gcc does not optimise based
    on this.)

    OK, I think I understand your point.

    The take-home learning should be that in a range-based loop, the range endpoints and the loop variable all needs to be of the same type.

    I.e. one of the many C traps that Rust simply will not allow you to
    write or fall into.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Terje Mathisen on Thu Feb 22 14:54:48 2024
    On 22/02/2024 13:42, Terje Mathisen wrote:
    David Brown wrote:
    On 22/02/2024 10:44, Terje Mathisen wrote:
    David Brown wrote:
    unsigned long
    test_uint(long unsigned int a, long unsigned int b)
    {
         unsigned int x;                    /*
    Alternate unsigned long x; */
         extern unsigned long vector[1000];
         unsigned long result = 0;

         for(x = a; x <= b; x++) {
             unsigned long a = vector[x];
             result += a;
         }
         return result;
    }

    I was showing what compilers /do/, not what the /could/ do.  It is
    certainly possible for the code to check the relationships between
    "a" and "b" (and possibly check for awkward end conditions - I
    haven't thought through the details) so that the loop can be shorter
    in the case when "a <= b", and longer for the case "a > b".

    How is that possible?

    a, b and x are all unsigned, which means that afaik there is no
    possible way for x = a; x <= b; x++ to run any iterations, if a is
    greater than b?

    Or am I misattributing the range-based loop to you?


    Sorry, I was not thinking straight at all.  The complication here is
    if b is greater than or equal to UINT_MAX.  Then there is no end
    point, and the loop is infinite.  (The C standard actually gives the
    implementation leave to assume this cannot happen - it can assume
    there are no infinite loops unless the controlling expression is
    constant, or the loop contains observable behaviour.  But it seems gcc
    does not optimise based on this.)

    OK, I think I understand your point.

    It was easier after I had made a real point and not just a muddled mistake!


    The take-home learning should be that in a range-based loop, the range endpoints and the loop variable all needs to be of the same type.

    I certainly think they should be of the same type. But there are still
    cases where an "int" loop variable gives better code than an "unsigned
    int" - though an unsigned integer type of "full size" for the target is
    likely to be just as good. Oh, and it's a good idea to make sure you
    are not accidentally allowing infinite loops (using < rather than <= helps).


    I.e. one of the many C traps that Rust simply will not allow you to
    write or fall into.


    C is a very flexible language - you can write good code, and bad code.
    I think Rust limits you more, which makes it harder to make a number of
    errors. But it is quite possible to write C code without mixing
    signedness of integers, or getting buffer overflows and memory leaks,
    and many of the other things that Rust is designed to prevent. You need
    to write your C code in an appropriate way, and you need to use your
    tools well (such as lots of static warnings).

    For my use, I have yet to see the point of Rust over C++. Rust might
    make it slightly easier to avoid some mistakes than C, but it is lacking
    the features in C++ that let you have better control. Again, it is
    easier to write /bad/ code in C++ - you can write some truly horrendous
    stuff. But used well, C++ can protect you against all kinds of errors.

    I haven't tried Rust for more than a "Hello, world" program, though I
    have read a bit more about it. It seems to have some nice features and
    a more modern outlook - modern C++ is encumbered by having to stay
    mostly compatible with C and old C++. But it seems to me to be lacking features I like in C++, such as constructors in classes, user-defined conversions (with control of explicit or implicit conversions), function overloading, public/private distinctions, and many other features. It's possible that this is because I am thinking in C++ terms and not
    thinking in Rust terms, but I have trouble seeing what Rust would give
    me that I don't have in C++, while I can easily see things from C++ that
    I would miss.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to David Brown on Fri Feb 23 16:17:35 2024
    On Thu, 22 Feb 2024 14:54:48 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 22/02/2024 13:42, Terje Mathisen wrote:


    I.e. one of the many C traps that Rust simply will not allow you to
    write or fall into.


    C is a very flexible language - you can write good code, and bad
    code. I think Rust limits you more, which makes it harder to make a
    number of errors. But it is quite possible to write C code without
    mixing signedness of integers, or getting buffer overflows and memory
    leaks, and many of the other things that Rust is designed to prevent.
    You need to write your C code in an appropriate way, and you need to
    use your tools well (such as lots of static warnings).

    For my use, I have yet to see the point of Rust over C++. Rust might
    make it slightly easier to avoid some mistakes than C, but it is
    lacking the features in C++ that let you have better control. Again,
    it is easier to write /bad/ code in C++ - you can write some truly
    horrendous stuff. But used well, C++ can protect you against all
    kinds of errors.

    I haven't tried Rust for more than a "Hello, world" program, though I
    have read a bit more about it. It seems to have some nice features
    and a more modern outlook - modern C++ is encumbered by having to
    stay mostly compatible with C and old C++. But it seems to me to be
    lacking features I like in C++, such as constructors in classes,
    user-defined conversions (with control of explicit or implicit
    conversions), function overloading, public/private distinctions, and
    many other features. It's possible that this is because I am
    thinking in C++ terms and not thinking in Rust terms, but I have
    trouble seeing what Rust would give me that I don't have in C++,
    while I can easily see things from C++ that I would miss.



    IMHO, Rust has exactly one advantage - it facilitates static code
    analysis to the level, impossible in most (all?) other popular
    imperative languages.
    Other than that, it is the worst, least productive [among relatively
    new, Algol-family] programming language I ever took a look at.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Feb 23 19:55:39 2024
    Thomas Koenig wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    I know no implementation of a 64-bit architecture where ALU operations
    (except maybe division where present) is slower in 64 bits than in 32
    bits. I would have chosen ILP64 at the time, so I can only guess at
    their reasons:

    A guess: people did not want sizeof(float) != sizeof(float). float
    is cerainly faster than double.

    Now, only in cache footprint. All your std operations take the same amount
    of cycles DP vs. SP. Excepting for cache footprint latency::
    perf(DP) == perf(SP)

    It would also broken Fortran, where storage aasociation rules mean
    that both REAL and INTEGER have to have the same size, and DOUBLE
    PRECISION twice that. Breaking that would have invalidated just
    about every large scientific program at the time.

    Yes, my argument that int be the fastest integral data type does not
    extend to languages that mandate a particular size. FORTRAN mandates
    a size, C implies int is faster than short or long--which is no
    longer true.

    Cray got away with 64-bit REAL and 128-bit DOUBLE PRECISION because
    they were the fastest anyway, but anybody else making that choice
    would have been laughed right out of the market.

    And ints were stored in 64-bit containers but only had 24-bits of
    significance (later 32 with -XMP).

    So, backward compatibility, your favorite topic.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Sat Feb 24 10:32:40 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    The take-home learning should be that in a range-based loop, the range=20 >endpoints and the loop variable all needs to be of the same type.

    The sext/zext examples <2024Feb20.130029@mips.complang.tuwien.ac.at>
    use the same type for the loop variable and the endpoints, and the
    unsigned case is still bigger than the int -fwrapv case, and that is
    bigger than the (u)intptr_t case.

    The actual takeaway lesson is to always use a full-width type, which
    avoids the need for extending a sub-width type to full width.

    Of course, if you are a compiler supremacist like David Brown, you use
    a type like int (without -fwrapv or the like) that gives few
    guarantees, and hope that the compiler manages to convert the lack of guarantees into code that is as efficient as what you get when using
    the full-width type; in the sext example that works, but I would not
    rely on it.

    Just to avoid readers having to look up the posting referred to above,
    here's the part that the present posting is about:

    |void sext(int M, int *ic, int *is)
    |{
    | int k;
    | for (k = 1; k <= M; k++) {
    | ic[k] += is[k];
    | }
    |}
    |
    |which is based on the only loop (from 456.hmmer) in SPECint 2006 where
    |the difference between -fwrapv and the default produces a measurable |performance difference (as reported in section 3.3 of |<https://people.eecs.berkeley.edu/~akcheung/papers/apsys12.pdf>). I
    |created variations of this function, where the types of M and k were
    |changed to b) unsigned, c) intptr_t, d) uintptr_t and compiled the
    |code with "gcc -Wall -fwrapv -O3 -c -fno-unroll-loops". The loop body
    |looks as follows on RV64GC:
    |
    | int unsigned (u)intptr_t
    |.L3: .L8: .L15:
    |slli a5,a4,0x2 slli a5,a4,0x20 lw a5,0(a1)
    |add a6,a1,a5 srli a5,a5,0x1e lw a4,4(a2)
    |add a5,a5,a2 add a6,a1,a5 addi a1,a1,4
    |lw a3,0(a6) add a5,a5,a2 addi a2,a2,4
    |lw a5,0(a5) lw a3,0(a6) addw a5,a5,a4
    |addiw a4,a4,1 lw a5,0(a5) sw a5,-4(a1)
    |addw a5,a5,a3 addiw a4,a4,1 bne a2,a3,54 <.L15>
    |sw a5,0(a6) addw a5,a5,a3
    |bge a0,a4,6 <.L3> sw a5,0(a6)
    | bgeu a0,a4,28 <.L8>
    |
    |There is no difference between the intptr_t loop body and the
    |uintptr_t loop. And without -fwrapv, the int loop looks just like the |(u)intptr_t loop (because the C compiler then assumes that signed
    |integer overflow does not happen).

    And just so you can play around yourself, here you find the source
    code of all four variants, and the AMD64 and RV64GC output with
    gcc-12.2: <https://godbolt.org/z/6a7TTTTM1>

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Sat Feb 24 09:32:19 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    MitchAlsup1 wrote:
    [...]
    The semantics of unsigned are better when you are twiddling bits than
    when signed. My programming practice is to use unsigned everywhere that
    a negative number is unexpected.

    Ditto.

    10-4.

    It always surprises me to encounter people who think signed types
    should be the default, with almost no exceptions. It's one thing
    to be somewhat cavalier about shooting yourself in the foot, it's
    quite another to deliberately choose a large caliber weapon ahead
    of time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Tim Rentsch on Sat Feb 24 18:48:37 2024
    Tim Rentsch wrote:

    10-4.

    It always surprises me to encounter people who think signed types
    should be the default, with almost no exceptions. It's one thing
    to be somewhat cavalier about shooting yourself in the foot, it's
    quite another to deliberately choose a large caliber weapon ahead
    of time.

    I am of the opinion that the proper default type is unsigned--the
    only reason to use a singed type is if at some point in it life it
    needs to hold a negative number.

    With C-promotions, unsigned is safer when mixing signed and unsigned.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Sat Feb 24 21:14:52 2024
    mitchalsup@aol.com (MitchAlsup1) writes:

    Tim Rentsch wrote:

    10-4.

    It always surprises me to encounter people who think signed types
    should be the default, with almost no exceptions. It's one thing
    to be somewhat cavalier about shooting yourself in the foot, it's
    quite another to deliberately choose a large caliber weapon ahead
    of time.

    I am of the opinion that the proper default type is unsigned--the
    only reason to use a singed type is if at some point in it life it
    needs to hold a negative number.

    I expect you mean that the proper default choice for signedness
    is unsigned -- unsigned int rather than int, unsigned long rather
    than long, etc. I agree with that view. Whether unsigned int is
    a better default than unsigned long, or vice versa, is another
    discussion.

    With C-promotions, unsigned is safer when mixing signed and unsigned.

    Here too I expect you mean when the respective types have the
    same width. Mixing a signed long and an unsigned int, for
    example, is a red flag in my book.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Sun Feb 25 13:40:59 2024
    On 23/02/2024 20:55, MitchAlsup1 wrote:
    Thomas Koenig wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    I know no implementation of a 64-bit architecture where ALU operations
    (except maybe division where present) is slower in 64 bits than in 32
    bits.  I would have chosen ILP64 at the time, so I can only guess at
    their reasons:

    A guess: people did not want sizeof(float) != sizeof(float). float
    is cerainly faster than double.

    Now, only in cache footprint. All your std operations take the same amount
    of cycles DP vs. SP. Excepting for cache footprint latency:: perf(DP) == perf(SP)


    That's true - except when it is not.

    It is not true when you are using vector instructions, and you can do
    twice as many SP instructions as DP instructions in the same register
    and instruction.

    It is not true when you are using accelerators of various kinds, such as graphics card processors.

    And it is not true on smaller processors, such as in the embedded world.
    On microcontrollers with floating point hardware for single and double precision, SP can be up to twice as fast as DP. And for many of the
    more popular microcontrollers, you can have hardware SP but DP is done
    in software - the difference there is clearly massive.

    But for big processors doing non-vector adds and multiplies, DP and SP
    are usually equal in clock cycles (other than memory and cache effects).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Sun Feb 25 15:25:56 2024
    On 24/02/2024 11:32, Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    The take-home learning should be that in a range-based loop, the range=20
    endpoints and the loop variable all needs to be of the same type.

    The sext/zext examples <2024Feb20.130029@mips.complang.tuwien.ac.at>
    use the same type for the loop variable and the endpoints, and the
    unsigned case is still bigger than the int -fwrapv case, and that is
    bigger than the (u)intptr_t case.

    The actual takeaway lesson is to always use a full-width type, which
    avoids the need for extending a sub-width type to full width.

    Of course, if you are a compiler supremacist like David Brown, you use
    a type like int (without -fwrapv or the like) that gives few
    guarantees, and hope that the compiler manages to convert the lack of guarantees into code that is as efficient as what you get when using
    the full-width type; in the sext example that works, but I would not
    rely on it.

    If by "compiler supremacist" you mean "someone who tries their best to understand the language and the tools", then us "compiler supremacists"
    know that you have a great deal of guarantees about "int". We also know
    that you have few absolute guarantees about performance and exact code generation for any types - you only have guarantees about observable
    behaviour.

    So you need to use types (or options) that first and foremost guarantee
    the /behaviour/ of your code will be correct according to your needs.
    And if you need the best efficiency, you have to look at generated code,
    and measure the results, and try out different combinations of source
    code and compiler flags. Rough rules can help - such as making sure the compiler knows you won't have wrapping or overflow (so use "int" or big
    types) - but never guarantee performance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Sun Feb 25 16:48:32 2024
    On 24/02/2024 19:48, MitchAlsup1 wrote:
    Tim Rentsch wrote:

    10-4.

    It always surprises me to encounter people who think signed types
    should be the default, with almost no exceptions.  It's one thing
    to be somewhat cavalier about shooting yourself in the foot, it's
    quite another to deliberately choose a large caliber weapon ahead
    of time.

    I am of the opinion that the proper default type is unsigned--the
    only reason to use a singed type is if at some point in it life it needs
    to hold a negative number.


    I am not convinced there should be a default type at all.

    But like it or not, "int" is the nearest C has to a default type. It is
    the type for things like integer constants. That does not mean it's the
    type you should choose for everything, but it will turn up very often.

    With C-promotions, unsigned is safer when mixing signed and unsigned.

    I don't like the way C handles this, and choose compiler warnings to
    tell me if I am doing it accidentally. If you don't know the C rules
    here, or are not careful about your types, then mixing signed and
    unsigned types is going to be unsafe no matter what types you pick -
    because by definition, if you have mixing, then you have a signed and an unsigned type involved. The safe thing is to stick to either all
    signed, or all unsigned - then there is no mixing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to David Brown on Sun Feb 25 18:52:05 2024
    David Brown <david.brown@hesbynett.no> schrieb:
    On 24/02/2024 19:48, MitchAlsup1 wrote:
    Tim Rentsch wrote:

    10-4.

    It always surprises me to encounter people who think signed types
    should be the default, with almost no exceptions.  It's one thing
    to be somewhat cavalier about shooting yourself in the foot, it's
    quite another to deliberately choose a large caliber weapon ahead
    of time.

    I am of the opinion that the proper default type is unsigned--the
    only reason to use a singed type is if at some point in it life it needs
    to hold a negative number.


    I am not convinced there should be a default type at all.

    Fortan has a default integer and a default real, and a default
    double precision.

    These days (since 1991, so it's been longer than most programming
    languages exist) it has a KIND mechanism. For example, if you need
    at least 12 significant digits, you can write

    integer, parameter :: wp = selected_real_kind(12)

    to declare a number wp which you can then use in the declaration
    of a real variable, such as

    real(wp) :: a, b, c

    and can use to suffix to constants:

    a = 1.2_wp

    Same mechanism for integers.

    Later revisions also included a 32-bit and a 64-bit types
    from an intrinsic module.

    Same as in C with int: INTEGER is what you use if you don't really
    care about. There is a restriction that Fortran's default integer
    has at least five digits, so 17 bits plus signed. PDP-11 ints would
    no longer be conforming, I'm afraid :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Sun Feb 25 20:22:40 2024
    On 24/02/2024 21:57, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 18:47, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 13:00, Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    mitchalsup@aol.com (MitchAlsup1) writes:
    Another possible reason is that it is very useful to have integer types
    with sizes 1, 2, 4 and 8 bytes. C doesn't have many standard integer
    types, so if "int" is 64-bit, you have "short" as either 16-bit and have
    no 32-bit type, or "short" is 32-bit and you have no 16-bit type. With
    32-bit "int", it's easy to have each size without having to add extended
    integer types or add new standard integer types (like "short short int"
    for 16-bit and "short int" for 32-bit).

    short had been 16-bit on 16-bit machines and on 32-bit machines, so
    the right choice for 64-bit machines is to make it 16-bit, too. > As
    for a 32-bit type, that would then obviously be "long short".

    OK. I am not convinced that "char, short, long short, int, long" would
    be that much better than "char, short short, short, int, long",
    especially when "long long" is added to the end, but either would do.

    (There was an April Fool's proposal for the C standards with a method of
    mixing "long" and "short" qualifiers to get any size of integer you
    like. I haven't been able to find it, but I'd be very grateful if
    anyone has a link.)

    The C
    compiler people had no qualms at adding "long long" when they wanted something bigger then 32 bits on 32-bit systems, so what should keep
    them from adding "long short"?

    They could have done that, yes.


    I saw benchmarks showing x32 being measurably faster,

    Sure, it's measurably faster. That's obviously not sufficient for
    incurring the cost of x32.


    Agreed.

    but it's not
    unlikely that the differences got less with more modern x86-64
    processors (with bigger caches)

    Doubtful. The L1 caches have not become bigger since the days of the
    K7 (1999) with its 64KB D and I caches (and the K7 actually did not
    support AMD64). There has been some growth in L2+L3 combined in
    recent years, but x32 already flopped earlier.


    But other caches have got bigger, and buses between the caches have got
    wider (have they not?).

    and it's simply not worth the effort
    having another set of libraries and compiler targets just to make some
    kinds of code marginally faster.

    Exactly.

    And support for 32-bit has /not/ been "eliminated from ARM cores".

    Of course it has. E.g., the Cortex-X1 supports A32/T32, and its
    descendants Cortex-X2, X3, X4 don't. The Cortex-A710 supports
    A32/T32, it's successors A715 and A720 do not. Cortex-A510 supports
    A32/T32, A520 doesn't.

    It
    may have been eliminated from the latest AArch64 cores - I don't keep
    good track of these. But for every such core sold, there will be
    hundreds (my guestimate) of 32-bit ARM cores sold in microcontrollers
    and embedded systems.

    Irrelevant for the question at hand: Are the performance benefits of
    32-bit applications sufficient to pay for the cost of maintaining a
    32-bit software infrastructure on an otherwise 64-bit system? The
    answer is no.


    For ARM, you are more likely to get away with saying "you need to
    re-compile your old code to run on this new system". In the x86 world, especially for Windows, that's a non-starter. An x86-64 processor that couldn't run x86-32 binaries would not sell well.

    I would suggest C "fast" types like int_fast32_t (or other "fast" sizes, >>>> picked to fit the range you need).

    Sure, and then the program might break when an array has more the 2^31
    elements; or it might work on one platform and break on another one.


    You need to pick the appropriate size for your data, as I said.

    In general-purpose computing, you usually don't know that size. E.g.,
    for a sort routine, do you use int_fast32_t, int_fast8_t,
    int_fast16_t, or int_fast64_t for the array size?


    Oh, I think you often have a good idea about the size. However, I quite
    agree that if you don't know, you have to pick the biggest practical size.

    (And either way, you will want to do some checking and sanitizing of
    your data when it comes in.)

    By contrast, with (u)intptr_t, on modern architectures you use the
    type that's as wide as the GPRs. And I don't see a reason why to use
    something else for a loop counter.

    I like my types to say what they mean. "uintptr_t" says "this object
    holds addresses for converting pointer values back and forth with an
    integer type".

    Exactly. It's the unnamed type of BCPL and B, and the int of Unix C
    before the I32LP64 mistake.


    This is 2024. No one cares about BCPL or B. No one even cares about
    how people wrote C before I32LP64 systems were common. The "int" of C
    is "int" - that's what almost all C programmers use when they don't have
    a good reason for picking something else, or when they simply don't
    think about it. You might not like that (and I would certainly prefer
    that more programmers thought more about their types), but that's the
    reality.

    When I am writing C, if I want a type that represents the size of an
    unknown object, I use "size_t" - because that is the type used by the C standards and by C code. I couldn't care less what types BCPL used. I couldn't care less what types were used in C programming before the
    "sizeof" operator gave a value of type "size_t", or before standard
    library functions like "memcpy" used parameters of type "size_t".

    If I want an unsigned type that can hold up to 64 bits, I use
    "uint64_t". If I want a type that will hold a pointer converted to an
    integer type, I use "uintptr_t".

    I fully appreciate that you prefer to use unsigned types for
    general-purpose variables or loop indexes, and I fully appreciate you
    want them to be 64-bit for the systems you target. But I don't
    understand why you want to call it "uintptr_t".

    Nonetheless, it is good design to use appropriate type names for
    appropriate usage. This makes the code clearer, and increases portability.

    On the contrary, in the Gforth project we have had many more
    portability problems with C code with its integer type zoo than in the
    Forth code which just has single-cell (a machine word), double cell,
    and char as integer types. Likewise, Forth code from others tends to
    be pretty portable between 32-bit and 64-bit systems, even if the code
    has only been tested on one kind of system.


    Forth always uses 16-bit cells, if I remember correctly? (It is a
    /very/ long time since I have tried writing anything in Forth, and that
    was on 8-bit systems.) If you want fixed size types, why not use them
    in C? "int32_t" is the same size in every C implementation.

    As for int64_t, that tends to be
    slow (if supported at all) on 32-bit platforms, and it is more than
    what is necessary for indexing arrays and for loop counters that are
    used for indexing into arrays.


    And that is why it makes sense to use the "fast" types. If you need a
    16-bit range, use "int_fast16_t". It will be 64-bit on 64-bit systems,
    32-bit on 32-bit systems, and 16-bit on 16-bit and 8-bit systems -
    always supporting the range you need, as fast as possible.

    That makes no sense. E.g., an in-memory sorting routine might well
    have to sort 100G elements or more on a suitably large (64-bit)
    machine. So according you one should use int_fast64_t. But that
    would be slow and unnecessarily large on a 32-bit system where you
    cannot hold and sort that many items anyway.


    That would be a case where you don't know the range you need in absolute
    terms, but you know them in C terms - the type to use is "size_t".

    Don't use "-fwrapv" unless you actually need it - in most
    code, if your arithmetic overflows, you have a mistake in your code, so >>>> letting the compiler assume that will not happen is a good thing.

    Thank you for giving a demonstration for Scott Lurndal. I assume that
    you claim to be a programmer.


    Sorry, that comment went over my head - I don't know what
    "demonstration" you are referring to.

    I wrote in <2024Feb20.091522@mips.complang.tuwien.ac.at>:
    |Those ideas are that integer overflows do not happen and that a
    |competent programmer proactively prevents them from happening, by
    |sizing the types accordingly, and checking the inputs.

    Scott Lurndal replied <SW2BN.153110$taff.74839@fx41.iad>:
    |Can't say that I've known a programmer who thought that way.

    OK. Yes, I am a programmer whose signed integer arithmetic does not
    overflow. I use appropriately sized types, or check values so that they
    don't overflow. I may also use unsigned types and check afterwards for wrapping, if that is the most appropriate choice. (And yes, /sometimes/ wrapping is the right thing and appropriate for the code.)

    I don't know of any programmers who think letting their signed integer arithmetic overflow is a good idea. I do know some that think wrapping
    is a good idea because they feel predictable incorrect answers are
    better than unpredictable effects.


    For comparison to C, look at the Zig language. (It's not a language I
    have used or know in detail, but I know a little about it.) Unsigned
    integer arithmetic overflow is UB in Zig, just like signed integer
    arithmetic overflow. There are standard options and block settings
    (roughly equivalent to pragmas) to control whether these give a run-time
    error, or are assumed never to happen (for optimisation purposes). And
    if you want to add "x" and "y" with wrapping, you use "x +% y" as an
    explicit choice of operator.

    That seems to me to be the right approach for an efficient high-level
    language.

    I don't know about Zig, but for a language like C, I would prefer
    types that ask for trap-on-overflow arithmetic, or for modulo
    arithmetic instead of introducing an additional operator.


    How then do you treat "x + 1" ? Should it have trapping or wrapping
    semantics, depending on the type of "x" alone? It is not the /type/
    that wraps or traps, it is the operation. In Forth, you use the
    operator to distinguish the operation - "+", ".+", "2+", etc. (assuming
    I remembered these correctly).

    You could reasonably say that if "x" is of a wrapping type and "y" is of
    a trapping type, then "x + y" is a compile-time error. Your code might
    be a little more wordy to handle explicit conversions, but it would
    certainly be a safe choice.

    The undefined option is just a bad idea: Wang et al <https://people.eecs.berkeley.edu/~akcheung/papers/apsys12.pdf> found
    that it gave a measurable speedup over -fwrapv in only one loop in SPECint2006, and that speedup is only due to the I32LP64 mistake.

    I am not interested in the speed of SPECint2006 functions. I am
    interested in the speed of the code I write on the targets I use, and I
    see "int" leading to faster code than "unsigned int" (this is primarily
    on 32-bit targets these days, but occasionally on older 16-bit or 8-bit systems).

    I.e., a competently designed dialect would not have that mistake, and
    there would be no measurable speedup from the undefined option, just
    the possibility of the compiler doing something unexpected.

    If you want "int z = x + y;" with wrapping, write :

    int z = (unsigned) x + (unsigned) y;

    I'll leave that to you, not just for the following reason:

    Once upon a time someone like you suggested using some casting
    approach for getting x-1>=x (with signed x) to work as intended. I
    tried it, and the result was that gcc-3.something still "optimized" it
    to false.

    Compilers are not perfect, and that is a challenge we have to live with.
    But I don't make major stylistic decisions based on a compiler bug.


    And even then, I always put it in a
    pragma so that the code works even if someone uses different compiler flags.

    Yes, announcing such things in the source code is a good idea in
    principle. In practice newer versions of gcc tend to need more sanity
    flags (the default of newer gccs is more insanity), and the older
    versions do not understand the new flags and fail if you pass them.
    So we check every sanity flag in autoconf and use those that the gcc
    version used accepts. Doing that through pragmas filled in by
    configure does not seem to be any better than using flags through the Makefile.


    You make your choices based on how you prefer to build and distribute
    your software. So I can only say how /I/ prefer to do these things. To
    me, a pragma in the code is clearer and less prone to the risk of
    someone re-using the code elsewhere without understanding the necessary
    flags to get expected correct results.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Sun Feb 25 22:32:35 2024
    David Brown wrote:

    On 24/02/2024 21:57, Anton Ertl wrote:

    The C
    compiler people had no qualms at adding "long long" when they wanted
    something bigger then 32 bits on 32-bit systems, so what should keep
    them from adding "long short"?

    They could have done that, yes.


    I saw benchmarks showing x32 being measurably faster,

    Sure, it's measurably faster. That's obviously not sufficient for
    incurring the cost of x32.


    Agreed.

    but it's not
    unlikely that the differences got less with more modern x86-64
    processors (with bigger caches)

    Doubtful. The L1 caches have not become bigger since the days of the
    K7 (1999) with its 64KB D and I caches (and the K7 actually did not
    support AMD64). There has been some growth in L2+L3 combined in
    recent years, but x32 already flopped earlier.


    But other caches have got bigger, and buses between the caches have got
    wider (have they not?).

    Yes, Opteron busses (HT) started out as 64-bits/cycle and many processors
    are now spotting busses of ½ cache line width per cycle.

    and it's simply not worth the effort
    having another set of libraries and compiler targets just to make some
    kinds of code marginally faster.

    Exactly.

    For ARM, you are more likely to get away with saying "you need to
    re-compile your old code to run on this new system". In the x86 world, especially for Windows, that's a non-starter. An x86-64 processor that couldn't run x86-32 binaries would not sell well.

    I would suggest C "fast" types like int_fast32_t (or other "fast" sizes, >>>>> picked to fit the range you need).

    Sure, and then the program might break when an array has more the 2^31 >>>> elements; or it might work on one platform and break on another one.


    You need to pick the appropriate size for your data, as I said.

    In general-purpose computing, you usually don't know that size. E.g.,
    for a sort routine, do you use int_fast32_t, int_fast8_t,
    int_fast16_t, or int_fast64_t for the array size?

    How about size_t ??

    Oh, I think you often have a good idea about the size. However, I quite agree that if you don't know, you have to pick the biggest practical size.

    We are in the era (now) where the size of individual datums is irrelevant,
    only the size of agglomerations of data have relevant sizes (array, struct). int i; signed long j; are irrelevant in terms of memory bandwidth after caching.

    (And either way, you will want to do some checking and sanitizing of
    your data when it comes in.)

    ---------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Mon Feb 26 12:50:35 2024
    On 25/02/2024 23:32, MitchAlsup1 wrote:
    David Brown wrote:

    On 24/02/2024 21:57, Anton Ertl wrote:


    I would suggest C "fast" types like int_fast32_t (or other "fast"
    sizes,
    picked to fit the range you need).

    Sure, and then the program might break when an array has more the 2^31 >>>>> elements; or it might work on one platform and break on another one. >>>>>

    You need to pick the appropriate size for your data, as I said.

    In general-purpose computing, you usually don't know that size.  E.g.,
    for a sort routine, do you use int_fast32_t, int_fast8_t,
    int_fast16_t, or int_fast64_t for the array size?

    How about size_t ??

    Sure. That's often the appropriate choice when an object represents the
    size of something, or the number of elements in an array of completely
    unknown size.


    Oh, I think you often have a good idea about the size.  However, I
    quite agree that if you don't know, you have to pick the biggest
    practical size.

    We are in the era (now) where the size of individual datums is irrelevant, only the size of agglomerations of data have relevant sizes (array,
    struct).
    int i; signed long j; are irrelevant in terms of memory bandwidth after caching.


    For individual local variables, you are right (except for
    microcontrollers, where it still matters a little - though not nearly as
    much as it used to). It's only for storage or calculations with
    aggregate data that size matters. And ironically we are now in an era
    where small sizes, such as 16-bit floats, are more popular than ever before.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Mon Feb 26 12:41:54 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    I know no implementation of a 64-bit architecture where ALU operations
    (except maybe division where present) is slower in 64 bits than in 32
    bits. I would have chosen ILP64 at the time, so I can only guess at
    their reasons:

    A guess: people did not want sizeof(float) != sizeof(float). float
    is cerainly faster than double.

    Now, only in cache footprint. All your std operations take the same amount
    of cycles DP vs. SP. Excepting for cache footprint latency::
    perf(DP) == perf(SP)

    Latency, I of course belive you.

    Throughput... For 32 vs. 64 bit reals, I would assume that the
    area for a unit is about twice as big for the 64-bit version.

    You can then calculate twice the number of results for a similar effort.

    Hm. AVX-512 can do calculations either on 32 16-bit, 16 32-bit or 8
    64-bit reals. Does anybody how they implemented their floating point
    units - are they separate for each precision, or do they use some clever
    (or horrible) way to use a single unit for all three?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Mon Feb 26 19:53:45 2024
    Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    I know no implementation of a 64-bit architecture where ALU operations >>>> (except maybe division where present) is slower in 64 bits than in 32
    bits. I would have chosen ILP64 at the time, so I can only guess at
    their reasons:

    A guess: people did not want sizeof(float) != sizeof(float). float
    is cerainly faster than double.

    Now, only in cache footprint. All your std operations take the same amount >> of cycles DP vs. SP. Excepting for cache footprint latency::
    perf(DP) == perf(SP)

    Latency, I of course belive you.

    Throughput... For 32 vs. 64 bit reals, I would assume that the
    area for a unit is about twice as big for the 64-bit version.

    You can then calculate twice the number of results for a similar effort.

    An FMAC that can do 1×DP or 2×SP is 15%-20% larger than an FMAC that can
    only do 1×DP or 1×SP. {For those interested, it if the Augend element
    and the long adder (168-bits) that are clever and horrible at the same
    time enabling 2×SP.}

    Hm. AVX-512 can do calculations either on 32 16-bit, 16 32-bit or 8
    64-bit reals. Does anybody how they implemented their floating point
    units - are they separate for each precision, or do they use some clever
    (or horrible) way to use a single unit for all three?

    Think:: Clever and horrible at the same time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Thomas Koenig on Tue Feb 27 13:43:14 2024
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    I know no implementation of a 64-bit architecture where ALU operations >>>> (except maybe division where present) is slower in 64 bits than in 32
    bits. I would have chosen ILP64 at the time, so I can only guess at
    their reasons:

    A guess: people did not want sizeof(float) != sizeof(float). float
    is cerainly faster than double.

    Now, only in cache footprint. All your std operations take the same amount >> of cycles DP vs. SP. Excepting for cache footprint latency::
    perf(DP) == perf(SP)

    FDIV float vs double is still quite often slightly faster, but not close
    to 2 X faster even though the mantissa is only 24/53 as large.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Wed Feb 28 01:25:02 2024
    Terje Mathisen wrote:

    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    I know no implementation of a 64-bit architecture where ALU operations >>>>> (except maybe division where present) is slower in 64 bits than in 32 >>>>> bits. I would have chosen ILP64 at the time, so I can only guess at >>>>> their reasons:

    A guess: people did not want sizeof(float) != sizeof(float). float
    is cerainly faster than double.

    Now, only in cache footprint. All your std operations take the same amount >>> of cycles DP vs. SP. Excepting for cache footprint latency::
    perf(DP) == perf(SP)

    FDIV float vs double is still quite often slightly faster, but not close
    to 2 X faster even though the mantissa is only 24/53 as large.

    DIV algorithms have the property that they double the number of accurate
    bits each iteration. This holds for both Goldschmidt and Newton-Raphson.

    So instead of 17 cycles DP FDIV one gets 15 cycle SP FDIV.

    SRT based FDIV is a lot more linear in that doubling the bits requires
    doubling of the iteration cycles.

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Mon Mar 11 19:56:32 2024
    David Brown wrote:

    On 23/02/2024 20:55, MitchAlsup1 wrote:
    Thomas Koenig wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    I know no implementation of a 64-bit architecture where ALU operations >>>> (except maybe division where present) is slower in 64 bits than in 32
    bits.  I would have chosen ILP64 at the time, so I can only guess at
    their reasons:

    A guess: people did not want sizeof(float) != sizeof(float). float
    is cerainly faster than double.

    Now, only in cache footprint. All your std operations take the same amount >> of cycles DP vs. SP. Excepting for cache footprint latency:: perf(DP) ==
    perf(SP)


    That's true - except when it is not.

    It is not true when you are using vector instructions, and you can do
    twice as many SP instructions as DP instructions in the same register
    and instruction.

    You use the word vector where you mean SIMD. A CRAY-YMP doing single
    would not be twice as fast because it was designed for 1 FADD + 1 FMUL
    + 2 LD + 1 ST per cycle continuously. CRAY-YMP is the epitome of a
    Vector machine.

    The alternative word would be short-vector instead of SIMD.

    It is not true when you are using accelerators of various kinds, such as graphics card processors.

    And it is not true on smaller processors, such as in the embedded world.
    On microcontrollers with floating point hardware for single and double precision, SP can be up to twice as fast as DP. And for many of the
    more popular microcontrollers, you can have hardware SP but DP is done
    in software - the difference there is clearly massive.

    But for big processors doing non-vector adds and multiplies, DP and SP
    are usually equal in clock cycles (other than memory and cache effects).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Mon Mar 11 20:07:10 2024
    Terje Mathisen wrote:

    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    I know no implementation of a 64-bit architecture where ALU operations >>>>> (except maybe division where present) is slower in 64 bits than in 32 >>>>> bits. I would have chosen ILP64 at the time, so I can only guess at >>>>> their reasons:

    A guess: people did not want sizeof(float) != sizeof(float). float
    is cerainly faster than double.

    Now, only in cache footprint. All your std operations take the same amount >>> of cycles DP vs. SP. Excepting for cache footprint latency::
    perf(DP) == perf(SP)

    FDIV float vs double is still quite often slightly faster, but not close
    to 2 X faster even though the mantissa is only 24/53 as large.

    On a machine where DDIV takes 17 cycles, SDIV would take 15 both using Goldschmidt algorithm with a Newton-Raphson fix up at the end.
    On a machine using SRT Division, single is about twice as fast as
    double.

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Tue Mar 12 11:14:47 2024
    On 11/03/2024 20:56, MitchAlsup1 wrote:
    David Brown wrote:

    On 23/02/2024 20:55, MitchAlsup1 wrote:
    Thomas Koenig wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    I know no implementation of a 64-bit architecture where ALU operations >>>>> (except maybe division where present) is slower in 64 bits than in 32 >>>>> bits.  I would have chosen ILP64 at the time, so I can only guess at >>>>> their reasons:

    A guess: people did not want sizeof(float) != sizeof(float). float
    is cerainly faster than double.

    Now, only in cache footprint. All your std operations take the same
    amount
    of cycles DP vs. SP. Excepting for cache footprint latency:: perf(DP)
    == perf(SP)


    That's true - except when it is not.

    It is not true when you are using vector instructions, and you can do
    twice as many SP instructions as DP instructions in the same register
    and instruction.

    You use the word vector where you mean SIMD.

    Yes, I was using the word somewhat interchangeably, as I was talking in
    general terms. Perhaps I should have been more precise. I know this
    thread talked about "Cray style vectors", but I thought this branch had diverged - I don't know anywhere near enough about the details of Cray
    machines to talk much about them.

    A CRAY-YMP doing single
    would not be twice as fast because it was designed for 1 FADD + 1 FMUL +
    2 LD + 1 ST per cycle continuously. CRAY-YMP is the epitome of a
    Vector machine.

    The alternative word would be short-vector instead of SIMD.

    It is not true when you are using accelerators of various kinds, such
    as graphics card processors.

    And it is not true on smaller processors, such as in the embedded
    world.   On microcontrollers with floating point hardware for single
    and double precision, SP can be up to twice as fast as DP.  And for
    many of the more popular microcontrollers, you can have hardware SP
    but DP is done in software - the difference there is clearly massive.

    But for big processors doing non-vector adds and multiplies, DP and SP
    are usually equal in clock cycles (other than memory and cache effects).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to David Brown on Tue Mar 12 14:44:28 2024
    On Tue, 12 Mar 2024 11:14:47 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 11/03/2024 20:56, MitchAlsup1 wrote:
    David Brown wrote:

    On 23/02/2024 20:55, MitchAlsup1 wrote:
    Thomas Koenig wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    I know no implementation of a 64-bit architecture where ALU
    operations (except maybe division where present) is slower in
    64 bits than in 32 bits.  I would have chosen ILP64 at the
    time, so I can only guess at their reasons:

    A guess: people did not want sizeof(float) != sizeof(float).
    float is cerainly faster than double.

    Now, only in cache footprint. All your std operations take the
    same amount
    of cycles DP vs. SP. Excepting for cache footprint latency::
    perf(DP) == perf(SP)


    That's true - except when it is not.

    It is not true when you are using vector instructions, and you can
    do twice as many SP instructions as DP instructions in the same
    register and instruction.

    You use the word vector where you mean SIMD.

    Yes, I was using the word somewhat interchangeably, as I was talking
    in general terms. Perhaps I should have been more precise. I know
    this thread talked about "Cray style vectors", but I thought this
    branch had diverged - I don't know anywhere near enough about the
    details of Cray machines to talk much about them.


    Even for Cray/NEC-style vectors, the same throughput for different
    precision is not an universal property. Cray's and NEC's vector
    processors happen to be designed like that, but one can easily imagine
    vector processors of similar style that have 2 or even 3 times higher throughput for SP vs DP.
    I personally never encountered such machines, but would be surprised if
    it were never built and sold back by one or another usual suspect (may
    be, Fujitsu?) in days when designers liked Cray's style.

    Which, of course, leaves the question of what property makes vector
    processor Cray-style. Just having ALU/FPU several times narrower than
    VR is, IMHO, not enough to be considered Cray-style.
    In my book, the critical distinction is that at least one size of
    partial (chopped) none-load-store vector operations has higher
    throughput (and hopefully, but not necessarily lower latency) than full
    vector operations of the same type.

    A CRAY-YMP doing single
    would not be twice as fast because it was designed for 1 FADD + 1
    FMUL + 2 LD + 1 ST per cycle continuously. CRAY-YMP is the epitome
    of a Vector machine.

    The alternative word would be short-vector instead of SIMD.

    It is not true when you are using accelerators of various kinds,
    such as graphics card processors.

    And it is not true on smaller processors, such as in the embedded
    world.   On microcontrollers with floating point hardware for
    single and double precision, SP can be up to twice as fast as DP.
    And for many of the more popular microcontrollers, you can have
    hardware SP but DP is done in software - the difference there is
    clearly massive.

    But for big processors doing non-vector adds and multiplies, DP
    and SP are usually equal in clock cycles (other than memory and
    cache effects).


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Tue Mar 12 17:18:36 2024
    Michael S wrote:

    On Tue, 12 Mar 2024 11:14:47 +0100
    David Brown <david.brown@hesbynett.no> wrote:


    You use the word vector where you mean SIMD.

    Yes, I was using the word somewhat interchangeably, as I was talking
    in general terms. Perhaps I should have been more precise. I know
    this thread talked about "Cray style vectors", but I thought this
    branch had diverged - I don't know anywhere near enough about the
    details of Cray machines to talk much about them.


    Even for Cray/NEC-style vectors, the same throughput for different
    precision is not an universal property. Cray's and NEC's vector
    processors happen to be designed like that, but one can easily imagine
    vector processors of similar style that have 2 or even 3 times higher throughput for SP vs DP.
    I personally never encountered such machines, but would be surprised if
    it were never built and sold back by one or another usual suspect (may
    be, Fujitsu?) in days when designers liked Cray's style.

    While theoretically possible, they did not do this because both halves
    of a 2×SP would not arrive from memory necessarily simultaneously.
    {Consider a gather load you need a vector of addresses 2× as long
    for pairs of SP going into a single vector register element.}

    Which, of course, leaves the question of what property makes vector
    processor Cray-style. Just having ALU/FPU several times narrower than
    VR is, IMHO, not enough to be considered Cray-style.

    That property is that the length of the vector register is chosen to
    absorb the latency to memory. SMID is too short to have this property.

    In my book, the critical distinction is that at least one size of
    partial (chopped) none-load-store vector operations has higher
    throughput (and hopefully, but not necessarily lower latency) than full vector operations of the same type.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Tue Mar 12 19:49:18 2024
    On Tue, 12 Mar 2024 17:18:36 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    Michael S wrote:

    On Tue, 12 Mar 2024 11:14:47 +0100
    David Brown <david.brown@hesbynett.no> wrote:


    You use the word vector where you mean SIMD.

    Yes, I was using the word somewhat interchangeably, as I was
    talking in general terms. Perhaps I should have been more
    precise. I know this thread talked about "Cray style vectors",
    but I thought this branch had diverged - I don't know anywhere
    near enough about the details of Cray machines to talk much about
    them.

    Even for Cray/NEC-style vectors, the same throughput for different precision is not an universal property. Cray's and NEC's vector
    processors happen to be designed like that, but one can easily
    imagine vector processors of similar style that have 2 or even 3
    times higher throughput for SP vs DP.
    I personally never encountered such machines, but would be
    surprised if it were never built and sold back by one or another
    usual suspect (may be, Fujitsu?) in days when designers liked
    Cray's style.

    While theoretically possible, they did not do this because both halves
    of a 2×SP would not arrive from memory necessarily simultaneously.
    {Consider a gather load you need a vector of addresses 2× as long
    for pairs of SP going into a single vector register element.}


    Doctor, it hurts when I do this!
    So, what prevents you from providing no gather with resolution
    below 64 bits?

    Which, of course, leaves the question of what property makes vector processor Cray-style. Just having ALU/FPU several times narrower
    than VR is, IMHO, not enough to be considered Cray-style.

    That property is that the length of the vector register is chosen to
    absorb the latency to memory. SMID is too short to have this property.


    I don't like this definition at all.
    For starter, what is "memory"? Does L1D cache count, or only L2 and
    higher?
    Then, what is "absorb" ? Is the whole VR register file part of
    absorbent or latency should be covered by one register? Is OoO machinery
    part of absorbent? Is HW threading part of absorbent? And for any of
    your possible answers I have my "Why?".

    In my book, the critical distinction is that at least one size of
    partial (chopped) none-load-store vector operations has higher
    throughput (and hopefully, but not necessarily lower latency) than
    full vector operations of the same type.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Tue Mar 12 19:00:46 2024
    Michael S wrote:

    On Tue, 12 Mar 2024 17:18:36 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    Michael S wrote:

    On Tue, 12 Mar 2024 11:14:47 +0100
    David Brown <david.brown@hesbynett.no> wrote:


    You use the word vector where you mean SIMD.

    Yes, I was using the word somewhat interchangeably, as I was
    talking in general terms. Perhaps I should have been more
    precise. I know this thread talked about "Cray style vectors",
    but I thought this branch had diverged - I don't know anywhere
    near enough about the details of Cray machines to talk much about
    them.

    Even for Cray/NEC-style vectors, the same throughput for different
    precision is not an universal property. Cray's and NEC's vector
    processors happen to be designed like that, but one can easily
    imagine vector processors of similar style that have 2 or even 3
    times higher throughput for SP vs DP.
    I personally never encountered such machines, but would be
    surprised if it were never built and sold back by one or another
    usual suspect (may be, Fujitsu?) in days when designers liked
    Cray's style.

    While theoretically possible, they did not do this because both halves
    of a 2×SP would not arrive from memory necessarily simultaneously.
    {Consider a gather load you need a vector of addresses 2× as long
    for pairs of SP going into a single vector register element.}


    Doctor, it hurts when I do this!
    So, what prevents you from providing no gather with resolution
    below 64 bits?

    Well, then, you have SP values in a container than could hold 2 and you
    don't get any SIMD speedup.

    Which, of course, leaves the question of what property makes vector
    processor Cray-style. Just having ALU/FPU several times narrower
    than VR is, IMHO, not enough to be considered Cray-style.

    That property is that the length of the vector register is chosen to
    absorb the latency to memory. SMID is too short to have this property.


    I don't like this definition at all.
    For starter, what is "memory"? Does L1D cache count, or only L2 and
    higher?

    Those machines had no L1 or L2 (or LLC) caches. Consider the problems
    for which they were designed--arrays as big as the memory (sometimes
    bigger !!) and processed over and over again with numerical algorithms.
    Caches would simply miss on each memory reference (ignoring TLB effects)
    With the caches never supplying data to the calculations why have them
    at all ??

    Then, what is "absorb" ?

    Absorb means that the first data of a vector arrives and can start
    calculation before the last address of the memory reference goes out.
    This, in turn, means that one can create a continuous stream of
    outbound addresses forever and thus cone can create a stream of
    continuous calculations forever. {{Where 'forever' means thousands
    of cycles but no where near the lifetime of the universe.}}

    Now, obviously, this means the memory system has to be able to make
    forward progress on all those memory accesses continuously.

    Is the whole VR register file part of
    absorbent or latency should be covered by one register?

    A single register covers a single memory reference latency.

    Is OoO machinery
    part of absorbent?

    The only OoO in the CRAYs was delivery of gather data back to the
    vector register*. Scatter stores were sent out in order, as were the
    addresses of the gather loads.

    (*) bank conflicts would delay conflicting accesses but not those
    of other banks, creating an OoO effect of returning data. This was
    re-ordered back to IO prior to forwarding data into calculation.

    Is HW threading part of absorbent?

    Absolutely not--none of the CRAYs did this--later XMPs and YMPs did
    use lanes (SIMD with vector) but always did calculations in order
    and always sent out addresses (and data when appropriate) in order.

    And for any of
    your possible answers I have my "Why?".

    No harm in asking.

    In my book, the critical distinction is that at least one size of
    partial (chopped) none-load-store vector operations has higher
    throughput (and hopefully, but not necessarily lower latency) than
    full vector operations of the same type.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Wed Mar 13 06:17:01 2024
    Michael S <already5chosen@yahoo.com> schrieb:

    Even for Cray/NEC-style vectors, the same throughput for different
    precision is not an universal property. Cray's and NEC's vector
    processors happen to be designed like that, but one can easily imagine
    vector processors of similar style that have 2 or even 3 times higher throughput for SP vs DP.
    I personally never encountered such machines, but would be surprised if
    it were never built and sold back by one or another usual suspect (may
    be, Fujitsu?) in days when designers liked Cray's style.

    I worked on such a machine, and (IIRC) single precision was faster
    on that machine. That may have been due to the comparatively
    low memory throughput of the single load/store pipeline that it had.

    But it's been a few decades, and my memory may be off (and I don't
    have any handbooks from the time).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Wed Mar 13 11:08:59 2024
    On Tue, 12 Mar 2024 19:00:46 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    Michael S wrote:

    On Tue, 12 Mar 2024 17:18:36 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    While theoretically possible, they did not do this because both
    halves of a 2×SP would not arrive from memory necessarily
    simultaneously. {Consider a gather load you need a vector of
    addresses 2× as long for pairs of SP going into a single vector
    register element.}

    Doctor, it hurts when I do this!
    So, what prevents you from providing no gather with resolution
    below 64 bits?

    Well, then, you have SP values in a container than could hold 2 and
    you don't get any SIMD speedup.


    (1) - a need for full gather it hopefully rare. Majority of time things accessed continuously or, at worst, with non-unit strides.
    My personal rule of thumb is "if I need generic gather, most likely I
    shouldn't have been bothered with vectorizing." Of course, as every
    rule of thumb, it's imprecise.
    (2) - there are several important applications that naturally have
    pair-wise data layout. Complex numbers is just one.


    Which, of course, leaves the question of what property makes
    vector processor Cray-style. Just having ALU/FPU several times
    narrower than VR is, IMHO, not enough to be considered
    Cray-style.

    That property is that the length of the vector register is chosen
    to absorb the latency to memory. SMID is too short to have this
    property.

    I don't like this definition at all.
    For starter, what is "memory"? Does L1D cache count, or only L2 and
    higher?

    Those machines had no L1 or L2 (or LLC) caches. Consider the problems
    for which they were designed--arrays as big as the memory (sometimes
    bigger !!) and processed over and over again with numerical
    algorithms. Caches would simply miss on each memory reference
    (ignoring TLB effects) With the caches never supplying data to the calculations why have them at all ??

    Then, what is "absorb" ?

    Absorb means that the first data of a vector arrives and can start calculation before the last address of the memory reference goes out.
    This, in turn, means that one can create a continuous stream of
    outbound addresses forever and thus cone can create a stream of
    continuous calculations forever. {{Where 'forever' means thousands
    of cycles but no where near the lifetime of the universe.}}

    Now, obviously, this means the memory system has to be able to make
    forward progress on all those memory accesses continuously.

    Is the whole VR register file part of
    absorbent or latency should be covered by one register?

    A single register covers a single memory reference latency.

    Is OoO
    machinery part of absorbent?

    The only OoO in the CRAYs was delivery of gather data back to the
    vector register*. Scatter stores were sent out in order, as were the addresses of the gather loads.

    (*) bank conflicts would delay conflicting accesses but not those
    of other banks, creating an OoO effect of returning data. This was
    re-ordered back to IO prior to forwarding data into calculation.

    Is HW threading part of absorbent?

    Absolutely not--none of the CRAYs did this--later XMPs and YMPs did
    use lanes (SIMD with vector) but always did calculations in order
    and always sent out addresses (and data when appropriate) in order.

    And for
    any of your possible answers I have my "Why?".

    No harm in asking.


    It seems, we are talking about different things.
    You are talking about Cray vectors, as done in Cray's 1/X-MP/Y-MP
    series. I.e. something fixed, known and of interest mostly for
    computing historians among us.
    OTH, I am trying to discuss a vague notion of "Cray-style vectors". My intentions are to see what was applicable in more recent times and
    which ideas are not totally obsolete for a future.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Wed Mar 13 15:45:00 2024
    Michael S wrote:

    On Tue, 12 Mar 2024 19:00:46 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    Michael S wrote:

    On Tue, 12 Mar 2024 17:18:36 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    While theoretically possible, they did not do this because both
    halves of a 2×SP would not arrive from memory necessarily
    simultaneously. {Consider a gather load you need a vector of
    addresses 2× as long for pairs of SP going into a single vector
    register element.}

    Doctor, it hurts when I do this!
    So, what prevents you from providing no gather with resolution
    below 64 bits?

    Well, then, you have SP values in a container than could hold 2 and
    you don't get any SIMD speedup.


    (1) - a need for full gather it hopefully rare. Majority of time things accessed continuously or, at worst, with non-unit strides.
    My personal rule of thumb is "if I need generic gather, most likely I shouldn't have been bothered with vectorizing." Of course, as every
    rule of thumb, it's imprecise.

    CRAY-1 XMP gained considerable speedup (about 20%) on its benchmarks
    of the day after adding Scatter/Gather and 5 more Livermore Loops
    would vectorize.

    (2) - there are several important applications that naturally have
    pair-wise data layout. Complex numbers is just one.

    What about Quaterions--which alleviate the programmer from having to
    remember which multiplications are subtracted instead of added.

    Which, of course, leaves the question of what property makes
    vector processor Cray-style. Just having ALU/FPU several times
    narrower than VR is, IMHO, not enough to be considered
    Cray-style.

    That property is that the length of the vector register is chosen
    to absorb the latency to memory. SMID is too short to have this
    property.

    I don't like this definition at all.
    For starter, what is "memory"? Does L1D cache count, or only L2 and
    higher?

    Those machines had no L1 or L2 (or LLC) caches. Consider the problems
    for which they were designed--arrays as big as the memory (sometimes
    bigger !!) and processed over and over again with numerical
    algorithms. Caches would simply miss on each memory reference
    (ignoring TLB effects) With the caches never supplying data to the
    calculations why have them at all ??

    Then, what is "absorb" ?

    Absorb means that the first data of a vector arrives and can start
    calculation before the last address of the memory reference goes out.
    This, in turn, means that one can create a continuous stream of
    outbound addresses forever and thus cone can create a stream of
    continuous calculations forever. {{Where 'forever' means thousands
    of cycles but no where near the lifetime of the universe.}}

    Now, obviously, this means the memory system has to be able to make
    forward progress on all those memory accesses continuously.

    Is the whole VR register file part of
    absorbent or latency should be covered by one register?

    A single register covers a single memory reference latency.

    Is OoO
    machinery part of absorbent?

    The only OoO in the CRAYs was delivery of gather data back to the
    vector register*. Scatter stores were sent out in order, as were the
    addresses of the gather loads.

    (*) bank conflicts would delay conflicting accesses but not those
    of other banks, creating an OoO effect of returning data. This was
    re-ordered back to IO prior to forwarding data into calculation.

    Is HW threading part of absorbent?

    Absolutely not--none of the CRAYs did this--later XMPs and YMPs did
    use lanes (SIMD with vector) but always did calculations in order
    and always sent out addresses (and data when appropriate) in order.

    And for
    any of your possible answers I have my "Why?".

    No harm in asking.


    It seems, we are talking about different things.
    You are talking about Cray vectors, as done in Cray's 1/X-MP/Y-MP
    series. I.e. something fixed, known and of interest mostly for
    computing historians among us.

    Fair enough, but it remains my model for how to discuss vector calculations.

    OTH, I am trying to discuss a vague notion of "Cray-style vectors". My intentions are to see what was applicable in more recent times and
    which ideas are not totally obsolete for a future.

    Only after you figure out a way to feed 2-LDs and consume 1 ST per-cycle continuously (cache miss or cache hit; TLB miss or TLB hit) are you in a position to utilize CRAY-like vector-register architecture effectively.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Mar 13 15:02:12 2024
    OTH, I am trying to discuss a vague notion of "Cray-style vectors". My intentions are to see what was applicable in more recent times and
    which ideas are not totally obsolete for a future.

    Another way to look at the difference between SSE-style vectors (which
    I'd call "short vectors") at the ISA level is the fact that SSE-style
    vector instructions are designed under the assumption that the latency
    of a vector instruction will be more or less the same as that of
    a non-vector instruction (i.e. you have enough ALUs to do all the
    operations at the same time), whereas Cray-style vector instructions
    (which we could call "long vectors") are designed under the assumption
    that the latency will be somewhat proportional to the length of the
    vector because the core of the CPU will only access a chunk of the
    vector at a time.

    So, short vectors have a fairly free hand at shuffling data across their
    vector (e.g. bitmatrix transpose), and they can be implemented/scheduled/dispatched just like any other instruction, but
    the vector length tends to be severely limited and exposed all over
    the place.

    In contrast long vectors usually depend on specialized implementations
    (e.g. chaining) to get good performance, but their length is
    easier/cheaper to change.

    AFAICT long vectors made sense when we could build machines with
    a memory bandwidth that was higher and ALUs were more expensive.
    Nowadays we tend to have the opposite.

    Also, the massive number of transistors we spend nowadays on OoO means
    that a good OoO CPU can dispatch individual non-vector instructions to
    ALUs just as well as the Cray did with its vectors with chaining.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Wed Mar 13 19:24:15 2024
    Stefan Monnier wrote:

    OTH, I am trying to discuss a vague notion of "Cray-style vectors". My
    intentions are to see what was applicable in more recent times and
    which ideas are not totally obsolete for a future.

    Another way to look at the difference between SSE-style vectors (which
    I'd call "short vectors") at the ISA level is the fact that SSE-style
    vector instructions are designed under the assumption that the latency
    of a vector instruction will be more or less the same as that of
    a non-vector instruction (i.e. you have enough ALUs to do all the
    operations at the same time),

    In the early-mid 1980s there was a class of processor assist engines
    using the tern Array-Processor that performed a Cray-like vector in
    the same latency as a scalar operation. CRAY streamed single-operands
    through FUs, Array processors took entire <but shorter> vectors through
    lanes of calculations.

    I would call these "medium vector" to distinguish from (short vector)
    SIMD and (long vector) CRAY {or just vector without qualifier}. So we
    have::
    SIMD <short> vector
    ARRAY <medium> vector
    CRAY <long> vector
    CDC <memory> vector

    whereas Cray-style vector instructions
    (which we could call "long vectors") are designed under the assumption
    that the latency will be somewhat proportional to the length of the
    vector because the core of the CPU will only access a chunk of the
    vector at a time.

    So, short vectors have a fairly free hand at shuffling data across their vector (e.g. bitmatrix transpose), and they can be implemented/scheduled/dispatched just like any other instruction, but
    the vector length tends to be severely limited and exposed all over
    the place.

    Consuming OpCode space like nobody's business.

    In contrast long vectors usually depend on specialized implementations
    (e.g. chaining) to get good performance, but their length is
    easier/cheaper to change.

    The only limitation is when one masks out beats of the vector from
    calculation of memory referencing. This is what kept CRAY at 64-element vectors. It was also the Achilles heal of CRAY--once memory gets more
    than 64 beats away, the length of the vector can no longer absorb the
    latency to memory. NEC did not have this problem.

    AFAICT long vectors made sense when we could build machines with
    a memory bandwidth that was higher and ALUs were more expensive.
    Nowadays we tend to have the opposite.

    While BW is important (very) it is latency that is crucial. Latency
    to memory must be smaller than vector length.

    Also, the massive number of transistors we spend nowadays on OoO means
    that a good OoO CPU can dispatch individual non-vector instructions to
    ALUs just as well as the Cray did with its vectors with chaining.

    Not "just as well" but "within spitting distance of"

    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Stefan Monnier on Thu Mar 14 12:58:33 2024
    On Wed, 13 Mar 2024 15:02:12 -0400
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:

    OTH, I am trying to discuss a vague notion of "Cray-style vectors".
    My intentions are to see what was applicable in more recent times
    and which ideas are not totally obsolete for a future.

    Another way to look at the difference between SSE-style vectors (which
    I'd call "short vectors") at the ISA level is the fact that SSE-style
    vector instructions are designed under the assumption that the latency
    of a vector instruction will be more or less the same as that of
    a non-vector instruction (i.e. you have enough ALUs to do all the
    operations at the same time), whereas Cray-style vector instructions
    (which we could call "long vectors") are designed under the assumption
    that the latency will be somewhat proportional to the length of the
    vector because the core of the CPU will only access a chunk of the
    vector at a time.


    I sympathize with your direction, but want to point out that
    historically in x86 world implementations of those or another SIMD
    instruction sets that had twice higher throughput for scalar FP ops vs full-width FP ops were and are very common. As to latency, many of
    these implementations had (and may be still have?) one cycle longer
    latency for scalar vs full-width.

    Similar throughput differences can be seen on few aarch64 processors.
    E.g. on Cortex-A75, where Q-form of common FP arithmetic instructions
    has half of throughput of scalar and D-form, but latency is not
    affected.
    I'd guess that the same applies to LITTLE Arm cores, but can't be sure,
    because Cortex-A53 Software Optimization Guide does not exist and
    Cortex-A55 Software Optimization Guide is not very comprehensible.


    So, short vectors have a fairly free hand at shuffling data across
    their vector (e.g. bitmatrix transpose), and they can be implemented/scheduled/dispatched just like any other instruction, but
    the vector length tends to be severely limited and exposed all over
    the place.

    In contrast long vectors usually depend on specialized implementations
    (e.g. chaining) to get good performance, but their length is
    easier/cheaper to change.

    AFAICT long vectors made sense when we could build machines with
    a memory bandwidth that was higher and ALUs were more expensive.
    Nowadays we tend to have the opposite.


    16x DP FMA (dual-issue 512-bit), as in server-class Intel cores, is
    still expensive, less so in area, more so in effect on max. power
    consumption.
    As to memory, I find this argument false, at least in its original form
    often stated here by Mitch.
    There are many important algorithms, both in dense linear algebra and
    in digital signal processing, where it is easily possible to reduce
    memory read to DP FMA ratio to 5-6 bytes/DP-FMA. With 32 software
    visible vector registers and with more programmer's effort the ratio
    could be pushed further down, sometimes below 4 bytes.
    I know little about other computationally intensive fields (apart
    from dense linear algebra and DSP, that is), but would expect that if
    not 5-6 then at least 8-9 bytes/DP-FMA (or equivalents for lower
    precisions) are not out of reach in other field as well.

    Also, the massive number of transistors we spend nowadays on OoO means
    that a good OoO CPU can dispatch individual non-vector instructions to
    ALUs just as well as the Cray did with its vectors with chaining.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Fri Mar 15 10:16:37 2024
    MitchAlsup1 wrote:
    Michael S wrote:
    Doctor, it hurts when I do this!
    So, what prevents you from providing no gather with resolution
    below 64 bits?

    Well, then, you have SP values in a container than could hold 2 and you
    don't get any SIMD speedup.

    I've looked at this issue since Larrabee (which was the first cpu I had
    access to which supported gather):

    For a 32-bit gather in a 64-bit environment, the only good solution I've
    been able to come up with is to use a 64-bit base register and then
    require all the sources to be within 4GB from that base, so that you can
    use the 32-bit wide gather addresses as indices/offsets from that base.

    <dest_vector_reg> = gather(base_reg, src_vector_reg)

    It would be up to the implementer to decide if the src_vector_reg is
    used as signed or unsigned offsets from the base. It is also possible to
    extend adressability by having the offsets be scaled by the element
    size, so effectively

    dest[i] = [base+src[i]*4]

    for single precision values.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Paul A. Clayton on Fri Mar 15 14:01:09 2024
    On Thu, 14 Mar 2024 23:56:04 -0400
    "Paul A. Clayton" <paaronclayton@gmail.com> wrote:

    On 3/13/24 3:24 PM, MitchAlsup1 wrote:
    Stefan Monnier wrote:
    [snip]
    So, short vectors have a fairly free hand at shuffling data across
    their vector (e.g. bitmatrix transpose), and they can be
    implemented/scheduled/dispatched just like any other instruction,
    but the vector length tends to be severely limited and exposed all
    over the place.

    Consuming OpCode space like nobody's business.

    Is that necessarily the case? Excluding the shuffle operations, I
    think only loads and stores would need to have length specifiers.
    Shuffle operations become much more expensive with larger
    'vectors', so providing the same granularity of shuffle for larger
    vectors seems questionable. (With scatter/gather, permute/shuffle
    may be less useful anyway.)

    <snip>

    Am I missing something when I assume that lane-based (SIMD)
    operations do not need size information in the instruction? The
    extra metadata is not free (perhaps especially as that controls
    execution at least for efficiency), but if opcode explosion is so
    undesirable using metadata might be preferred.

    I would guess that Mitch operates under assumption that we are still
    talking about ISA that is very similar to CRAY Y-MP.
    I.e. non-load-store, but more like AVX512, where one of the source
    operands could be in memory.
    That is, I had never seen CRAY Y-MP ISA docs, but would imagine that it
    was like that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Fri Mar 15 20:03:33 2024
    Paul A. Clayton wrote:

    On 3/13/24 3:24 PM, MitchAlsup1 wrote:
    Stefan Monnier wrote:
    [snip]
    So, short vectors have a fairly free hand at shuffling data across their >>> vector (e.g. bitmatrix transpose), and they can be
    implemented/scheduled/dispatched just like any other instruction, but
    the vector length tends to be severely limited and exposed all over
    the place.

    Consuming OpCode space like nobody's business.

    Is that necessarily the case? Excluding the shuffle operations, I
    think only loads and stores would need to have length specifiers.

    Add signed byte add unsigned byte, add signed half, add unsigned half
    add signed int, add unsigned int, add long, add 2 floats, add double

    compared to ADD integer, add float and add double.

    And then there is the addsub group, too.

    I may be possible to avoid OpCode explosion, but neither x86, nor ARM
    got anywhere close to avoiding the deluge of OpCodes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Fri Mar 15 20:11:36 2024
    MitchAlsup1 wrote:

    Paul A. Clayton wrote:

    On 3/13/24 3:24 PM, MitchAlsup1 wrote:
    Stefan Monnier wrote:
    [snip]
    So, short vectors have a fairly free hand at shuffling data across their >>>> vector (e.g. bitmatrix transpose), and they can be
    implemented/scheduled/dispatched just like any other instruction, but
    the vector length tends to be severely limited and exposed all over
    the place.

    Consuming OpCode space like nobody's business.

    Is that necessarily the case? Excluding the shuffle operations, I
    think only loads and stores would need to have length specifiers.

    Add signed byte add unsigned byte, add signed half, add unsigned half
    add signed int, add unsigned int, add long, add 2 floats, add double

    All of the above come in 64-bit, 128-bit, 256-bit, and 512-bit variants.

    compared to ADD integer, add float and add double.

    And then there is the addsub group, too.

    I may be possible to avoid OpCode explosion, but neither x86, nor ARM
    got anywhere close to avoiding the deluge of OpCodes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)