• Little program to test concurrency of .fetch_add and .compare_exchange_

    From Bonita Montero@21:1/5 to All on Sun Dec 5 18:13:07 2021
    I've written a little program that tests the throughput of fetch_add
    on an increasing number of processors in your systems and if you chose
    it, it couts the throughput of compare_exchange_weak. On my Ryzen
    Threadripper 3990X (Zen2) / Windows 10 (SMT disabled) the fetch_add
    timings are linar with increasing number of threads and the compare _exchange_weak-timings are linear in the beginning, but become expo-
    nential at the end.

    I'd like to see your results:

    #include <iostream>
    #include <cstring>
    #include <atomic>
    #include <charconv>
    #include <thread>
    #include <vector>
    #include <semaphore>
    #include <chrono>
    #include <algorithm>
    #include <functional>

    using namespace std;
    using namespace chrono;

    int main( int argc, char **argv )
    {
    if( argc < 2 )
    return EXIT_FAILURE;
    bool xchg = strcmp( argv[1], "xchg" ) == 0;
    if( argc - xchg < 2 )
    return EXIT_FAILURE;
    auto parseValue = []( char const *str ) -> unsigned
    {
    unsigned value;
    from_chars_result fcr = from_chars( str, str + strlen( str ), value );
    if( fcr.ec != errc() || *fcr.ptr )
    return -1;
    return value;
    };
    unsigned fromThreads, toThreads;
    if( argc - xchg == 2 )
    if( (fromThreads = toThreads = parseValue( argv[1 + xchg] )) == -1 )
    return EXIT_FAILURE;
    else;
    else
    if( (fromThreads = parseValue( argv[1 + xchg] )) == -1 || (toThreads =
    parseValue( argv[2 + xchg] )) == -1 )
    return EXIT_FAILURE;
    unsigned hc = thread::hardware_concurrency();
    hc = hc ? hc : toThreads;
    toThreads = toThreads <= hc ? toThreads : hc;
    fromThreads = fromThreads <= hc ? fromThreads : hc;
    if( fromThreads > toThreads )
    swap( fromThreads, toThreads );
    for( unsigned nThreads = fromThreads; nThreads <= toThreads; ++nThreads )
    {
    atomic_uint readyCountDown( nThreads );
    binary_semaphore semReady( 0 );
    counting_semaphore semRun( 0 );
    atomic_uint synch( nThreads );
    atomic_uint64_t aui64;
    atomic_uint64_t nsSum( 0 );
    auto theThread = [&]( function<void()> &addFn, size_t n )
    {
    if( readyCountDown.fetch_sub( 1, memory_order_relaxed ) == 1 )
    semReady.release();
    semRun.acquire();
    if( synch.fetch_sub( 1, memory_order_relaxed ) != 1 )
    while( synch.load( memory_order_relaxed ) );
    auto start = high_resolution_clock::now();
    for( ; n; --n )
    addFn();
    nsSum.fetch_add( (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count(), memory_order_relaxed );
    };
    vector<jthread> threads;
    threads.reserve( nThreads );
    static size_t const TURNS = 10'000'000;
    auto fetchAddFn = [&]() { aui64.fetch_add( 1, memory_order_relaxed ); };
    auto cmpXchgFn = [&]()
    {
    uint64_t ref = aui64.load( memory_order_relaxed );
    while( !aui64.compare_exchange_weak( ref, ref + 1, memory_order_relaxed ) );
    };
    function<void()> xchgFn;
    if( !xchg )
    xchgFn = bind( fetchAddFn );
    else
    xchgFn = bind( cmpXchgFn );
    for( unsigned t = 0; t != nThreads; ++t )
    threads.emplace_back( theThread, xchgFn ), TURNS );
    semReady.acquire();
    semRun.release( nThreads );
    for( jthread &thr : threads )
    thr.join();
    double ns = (double)(int64_t)nsSum.load( memory_order_relaxed );
    ns = ns / ((double)TURNS * (int)nThreads);
    cout << ns << endl;
    }
    }

    The timings are important for every kind of synchronization on your PC.
    The programm can be called like that
    ./a.out <n-threads> - tests fetch_add witn n-threads ./a.out <from-threads> <to-threads> - tests fetch_add ranging from
    from-threads to to-threads
    ./a.out xchg <n-threads> - tests compar_exchange_weak with
    n-threads
    ./a.out xchg <from-threads> <to-threads> - tests compar_exchange_weak
    ranging from-threads to
    to-threads

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Bonita Montero on Sun Dec 5 19:56:04 2021
    Bonita Montero <Bonita.Montero@gmail.com> writes:


    for( unsigned t = 0; t != nThreads; ++t )
    threads.emplace_back( theThread, xchgFn ), TURNS );

    ?

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Ben Bacarisse on Sun Dec 5 15:22:36 2021
    On 12/5/2021 11:56 AM, Ben Bacarisse wrote:
    Bonita Montero <Bonita.Montero@gmail.com> writes:


    for( unsigned t = 0; t != nThreads; ++t )
    threads.emplace_back( theThread, xchgFn ), TURNS );

    ?


    It seems way to complicated. To test fetch_add vs compare_exchange just:

    spawn T_N threads.

    each thread performs N fetch_add operations on a global counter.

    join the threads.

    Give a time.

    vs.


    spawn T_N threads.

    each thread performs N compare_exchange operations that increments a
    global counter. Basically using CAS to build a fetch_add.

    join the threads.

    Give a time.



    In my experience fetch_add always beats a CAS-loop on x86.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Chris M. Thomasson on Sun Dec 5 23:43:47 2021
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

    On 12/5/2021 11:56 AM, Ben Bacarisse wrote:
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    for( unsigned t = 0; t != nThreads; ++t )
    threads.emplace_back( theThread, xchgFn ), TURNS );
    ?

    It seems way to complicated.

    In case it was not clear, my comment was about the syntax error.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Ben Bacarisse on Sun Dec 5 19:31:12 2021
    On 12/5/2021 3:43 PM, Ben Bacarisse wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

    On 12/5/2021 11:56 AM, Ben Bacarisse wrote:
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    for( unsigned t = 0; t != nThreads; ++t )
    threads.emplace_back( theThread, xchgFn ), TURNS );
    ?

    It seems way to complicated.

    In case it was not clear, my comment was about the syntax error.


    Oh ouch! I did not even notice it. You must be referencing this:

    threads.emplace_back( theThread, xchgFn ), TURNS );

    The parenthesis balance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Mon Dec 6 06:27:53 2021
    So, here's the corrected version (C++20):

    #include <iostream>
    #include <cstring>
    #include <atomic>
    #include <charconv>
    #include <thread>
    #include <vector>
    #include <semaphore>
    #include <chrono>
    #include <algorithm>
    #include <functional>

    using namespace std;
    using namespace chrono;

    int main( int argc, char **argv )
    {
    if( argc < 2 )
    return EXIT_FAILURE;
    bool xchg = strcmp( argv[1], "xchg" ) == 0;
    if( argc - xchg < 2 )
    return EXIT_FAILURE;
    auto parseValue = []( char const *str ) -> unsigned
    {
    unsigned value;
    from_chars_result fcr = from_chars( str, str + strlen( str ), value );
    if( fcr.ec != errc() || *fcr.ptr )
    return -1;
    return value;
    };
    unsigned fromThreads, toThreads;
    if( argc - xchg == 2 )
    if( (fromThreads = toThreads = parseValue( argv[1 + xchg] )) == -1 )
    return EXIT_FAILURE;
    else;
    else
    if( (fromThreads = parseValue( argv[1 + xchg] )) == -1 || (toThreads =
    parseValue( argv[2 + xchg] )) == -1 )
    return EXIT_FAILURE;
    unsigned hc = thread::hardware_concurrency();
    hc = hc ? hc : toThreads;
    toThreads = toThreads <= hc ? toThreads : hc;
    fromThreads = fromThreads <= hc ? fromThreads : hc;
    if( fromThreads > toThreads )
    swap( fromThreads, toThreads );
    for( unsigned nThreads = fromThreads; nThreads <= toThreads; ++nThreads )
    {
    atomic_uint readyCountDown( nThreads );
    binary_semaphore semReady( 0 );
    counting_semaphore semRun( 0 );
    atomic_uint synch( nThreads );
    atomic_uint64_t aui64;
    atomic_uint64_t nsSum( 0 );
    auto theThread = [&]( function<void()> &addFn, size_t n )
    {
    if( readyCountDown.fetch_sub( 1, memory_order_relaxed ) == 1 )
    semReady.release();
    semRun.acquire();
    if( synch.fetch_sub( 1, memory_order_relaxed ) != 1 )
    while( synch.load( memory_order_relaxed ) );
    auto start = high_resolution_clock::now();
    for( ; n; addFn(), --n );
    nsSum.fetch_add( (uint64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count(), memory_order_relaxed );
    };
    vector<jthread> threads;
    threads.reserve( nThreads );
    static size_t const TURNS = 10'000'000;
    auto fetchAddFn = [&]() { aui64.fetch_add( 1, memory_order_relaxed ); };
    auto cmpXchgFn = [&]()
    {
    uint64_t ref = aui64.load( memory_order_relaxed );
    while( !aui64.compare_exchange_weak( ref, ref + 1, memory_order_relaxed ) );
    };
    function<void()> xchgFn;
    if( !xchg )
    xchgFn = bind( fetchAddFn );
    else
    xchgFn = bind( cmpXchgFn );
    for( unsigned t = 0; t != nThreads; ++t )
    threads.emplace_back( theThread, ref( xchgFn ), TURNS );
    semReady.acquire();
    semRun.release( nThreads );
    for( jthread &thr : threads )
    thr.join();
    double ns = (double)(int64_t)nsSum.load( memory_order_relaxed );
    ns = ns / ((double)TURNS * (int)nThreads);
    cout << nThreads << "\t" << ns << endl;
    }
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Mon Dec 6 06:24:59 2021
    Am 05.12.2021 um 20:56 schrieb Ben Bacarisse:
    Bonita Montero <Bonita.Montero@gmail.com> writes:


    for( unsigned t = 0; t != nThreads; ++t )
    threads.emplace_back( theThread, xchgFn ), TURNS );

    ?

    Remove one ).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Mon Dec 6 06:26:49 2021
    Am 06.12.2021 um 00:22 schrieb Chris M. Thomasson:

    It seems way to complicated. To test fetch_add vs compare_exchange just:

    There's nothing complicated with my test.

    In my experience fetch_add always beats a CAS-loop on x86.

    Of course, because it neer fails.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Bonita Montero on Mon Dec 6 10:45:00 2021
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 05.12.2021 um 20:56 schrieb Ben Bacarisse:
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    for( unsigned t = 0; t != nThreads; ++t )
    threads.emplace_back( theThread, xchgFn ), TURNS );
    ?

    Remove one ).

    The code does not compile with any of the three )s removed.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Mon Dec 6 13:09:29 2021
    Am 06.12.2021 um 11:45 schrieb Ben Bacarisse:
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 05.12.2021 um 20:56 schrieb Ben Bacarisse:
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    for( unsigned t = 0; t != nThreads; ++t )
    threads.emplace_back( theThread, xchgFn ), TURNS );
    ?

    Remove one ).

    The code does not compile with any of the three )s removed.

    Take the latest code I've posted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Mon Dec 6 18:02:57 2021
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    I've written a little program that tests the throughput of fetch_add
    on an increasing number of processors in your systems and if you chose

    This is _highly_ microarchitectural dependent.

    Some processors will acquire the cacheline into the nearest
    cache to ensure exclusive access for the add, while
    others will pass the entire operation to the last-level cache
    where it is executed atomically.

    In the former case, scaling will be very bad on large core counts.

    In the latter case, scaling will be quite good with large core counts,
    at least on a single-socket system. On a multi-socket system, this
    breaks down somewhat as the LLC on each socket will compete for the
    cache line.

    In any case, a couple dozen line assembler program would be a
    far better test than your overly complicated C++.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Tue Dec 7 06:27:08 2021
    Am 06.12.2021 um 19:02 schrieb Scott Lurndal:

    Some processors will acquire the cacheline into the nearest
    cache to ensure exclusive access for the add, while
    others will pass the entire operation to the last-level cache
    where it is executed atomically.

    There's for sure no architecture that does atomic operations in
    the last level cache because this would be silly.

    In any case, a couple dozen line assembler program would be a
    far better test than your overly complicated C++.

    No, it wouldn't give better results and the code would be
    magnitudes longer if it would do the same.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Tue Dec 7 18:06:59 2021
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 06.12.2021 um 19:02 schrieb Scott Lurndal:

    Some processors will acquire the cacheline into the nearest
    cache to ensure exclusive access for the add, while
    others will pass the entire operation to the last-level cache
    where it is executed atomically.

    There's for sure no architecture that does atomic operations in
    the last level cache because this would be silly.

    Well, are you sure? Why do you think it would be silly?

    https://genzconsortium.org/wp-content/uploads/2019/04/Gen-Z-Atomics-2019.pdf

    Given that at least three high-end processor chips have taped out just
    this year with the capability of executing "far" atomic operations in
    the LLC (or to a PCI Express Root complex host bridge), I think you really don't have
    a clue what you are talking about.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Tue Dec 7 19:14:53 2021
    Am 07.12.2021 um 19:06 schrieb Scott Lurndal:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 06.12.2021 um 19:02 schrieb Scott Lurndal:

    Some processors will acquire the cacheline into the nearest
    cache to ensure exclusive access for the add, while
    others will pass the entire operation to the last-level cache
    where it is executed atomically.

    There's for sure no architecture that does atomic operations in
    the last level cache because this would be silly.

    Well, are you sure? Why do you think it would be silly?

    https://genzconsortium.org/wp-content/uploads/2019/04/Gen-Z-Atomics-2019.pdf

    Given that at least three high-end processor chips have taped out just
    this year with the capability of executing "far" atomic operations in
    the LLC (or to a PCI Express Root complex host bridge), I think you
    really don't have a clue what you are talking about.

    And which CPUs currently support this Gen-Z interconnect ?
    And which CPUs currently use this far atomics for thread
    -synchronitation - none.
    Did you really read the paper and noted what Gen-Z is ?
    No.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Tue Dec 7 18:49:11 2021
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 07.12.2021 um 19:06 schrieb Scott Lurndal:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 06.12.2021 um 19:02 schrieb Scott Lurndal:

    Some processors will acquire the cacheline into the nearest
    cache to ensure exclusive access for the add, while
    others will pass the entire operation to the last-level cache
    where it is executed atomically.

    There's for sure no architecture that does atomic operations in
    the last level cache because this would be silly.

    Well, are you sure? Why do you think it would be silly?

    https://genzconsortium.org/wp-content/uploads/2019/04/Gen-Z-Atomics-2019.pdf >>
    Given that at least three high-end processor chips have taped out just
    this year with the capability of executing "far" atomic operations in
    the LLC (or to a PCI Express Root complex host bridge), I think you
    really don't have a clue what you are talking about.

    And which CPUs currently support this Gen-Z interconnect ?

    I'd tell you, but various NDA's forbid.

    And which CPUs currently use this far atomics for thread
    -synchronitation - none.

    How do you know? I'm aware of three. Two sampling to
    customers, with core counts from 8 to 64. A handful of others
    are in development by several processor vendors as I
    write this.

    Did you really read the paper and noted what Gen-Z is ?

    I know exactly what it is, and I know what CXL is as well,
    both being part of my day job. And if you don't think Intel
    is designing all of their server CPUs to be CXL [*] compatible,
    you're not thinking.

    [*] "In November 2021 the CXL Consortium and the GenZ Consortium
    signed a letter of intent for Gen-Z to transfer its specifications
    and assets to CXL, leaving CXL as the sole industry standard moving
    forward"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Tue Dec 7 19:54:59 2021
    Am 07.12.2021 um 19:49 schrieb Scott Lurndal:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 07.12.2021 um 19:06 schrieb Scott Lurndal:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 06.12.2021 um 19:02 schrieb Scott Lurndal:

    Some processors will acquire the cacheline into the nearest
    cache to ensure exclusive access for the add, while
    others will pass the entire operation to the last-level cache
    where it is executed atomically.

    There's for sure no architecture that does atomic operations in
    the last level cache because this would be silly.

    Well, are you sure? Why do you think it would be silly?

    https://genzconsortium.org/wp-content/uploads/2019/04/Gen-Z-Atomics-2019.pdf

    Given that at least three high-end processor chips have taped out just
    this year with the capability of executing "far" atomic operations in
    the LLC (or to a PCI Express Root complex host bridge), I think you
    really don't have a clue what you are talking about.

    And which CPUs currently support this Gen-Z interconnect ?

    I'd tell you, but various NDA's forbid.

    LOOOOOOOL.

    And which CPUs currently use this far atomics for thread
    -synchronitation - none.

    How do you know?

    Because this would be slower since the lock-modifications
    woudln't be done in the L1-caches but in far memory. That's
    just a silly idea.

    I'm aware of three. Two sampling to customers, with core
    counts from 8 to 64.

    And you can't tell it because of NDAs. Hrhr.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Tue Dec 7 19:14:40 2021
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 07.12.2021 um 19:49 schrieb Scott Lurndal:


    How do you know?

    Because this would be slower since the lock-modifications
    woudln't be done in the L1-caches but in far memory. That's
    just a silly idea.

    Hello, it's a cache-coherent multiprocessor. You need to
    fetch it exclusively into the L1 first, so instead of sending the fetch
    (or invalidate if converting a shared line to owned),
    you send the atomic op and it gets handled atomically at
    the far end (e.g. LLC, PCI express device, SoC coprocessor)
    saving the interconnect (mesh, ring, whatever) bandwidth and
    the round-trip time between L1 and LLC and reducing contention
    for the line.

    If it's already in the L1 cache, then the processor will
    automatically treat it as a near-atomic, this is expected
    to be a rare case with correctly designed atomic usage.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Scott Lurndal on Tue Dec 7 19:25:31 2021
    scott@slp53.sl.home (Scott Lurndal) writes:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 07.12.2021 um 19:49 schrieb Scott Lurndal:


    How do you know?

    Because this would be slower since the lock-modifications
    woudln't be done in the L1-caches but in far memory. That's
    just a silly idea.

    Hello, it's a cache-coherent multiprocessor. You need to
    fetch it exclusively into the L1 first, so instead of sending the fetch
    (or invalidate if converting a shared line to owned),
    you send the atomic op and it gets handled atomically at
    the far end (e.g. LLC, PCI express device, SoC coprocessor)
    saving the interconnect (mesh, ring, whatever) bandwidth and
    the round-trip time between L1 and LLC and reducing contention
    for the line.

    If it's already in the L1 cache, then the processor will
    automatically treat it as a near-atomic, this is expected
    to be a rare case with correctly designed atomic usage.

    In case you need a public reference for a shipping processor:

    https://developer.arm.com/documentation/102099/0000/L1-data-memory-system/Instruction-implementation-in-the-L1-data-memory-system

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Scott Lurndal on Wed Dec 8 09:08:01 2021
    On 07/12/2021 20:14, Scott Lurndal wrote:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 07.12.2021 um 19:49 schrieb Scott Lurndal:


    How do you know?

    Because this would be slower since the lock-modifications
    woudln't be done in the L1-caches but in far memory. That's
    just a silly idea.

    Hello, it's a cache-coherent multiprocessor. You need to
    fetch it exclusively into the L1 first, so instead of sending the fetch
    (or invalidate if converting a shared line to owned),
    you send the atomic op and it gets handled atomically at
    the far end (e.g. LLC, PCI express device, SoC coprocessor)
    saving the interconnect (mesh, ring, whatever) bandwidth and
    the round-trip time between L1 and LLC and reducing contention
    for the line.

    If it's already in the L1 cache, then the processor will
    automatically treat it as a near-atomic, this is expected
    to be a rare case with correctly designed atomic usage.


    This is such an obvious improvement that I am constantly amazed how long
    it has taken to be implemented. Using ordinary memory for atomic
    operations, locks, etc., is massively inefficient compared to a
    dedicated hardware solution.

    I've used a multi-core embedded microcontroller with a semaphore block, consisting of a number (16, IIRC) of individual semaphores. Each of
    these was made of two 16-bit parts - the lock tag and the value. You
    can only change the value if you have the lock, and you get the lock by
    writing a non-zero tag when the tag is currently 0 (unlocked). You
    release it by writing your tag with the high bit set. It is all very
    simple, and extremely fast - no need to go through caches, snooping, or
    any of that nonsense because it is dedicated and connected close to the
    cpu's core buses.

    Obviously in a "big" system you need to handle more than two cores (and
    with the Z-Gen and CLX system, other bus masters), support larger
    numbers of locks, and security is a rather different matter! But the
    principle of having dedicated hardware, memory mapped but not passing
    through caches and slow external memory, is the same.

    Atomic operations carried out by the core on memory in the L1 caches
    will be fast as long as their are no conflicts, but you wouldn't bother
    with atomics unless there /were/ a risk of conflict. And then they get
    slow. With a "far atomics" solution, you should be able to get much
    more consistent timings and efficient results.

    (At least, that is my understanding of it, without having actually used
    them!)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Wed Dec 8 09:50:30 2021
    Am 07.12.2021 um 20:25 schrieb Scott Lurndal:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 07.12.2021 um 19:49 schrieb Scott Lurndal:


    How do you know?

    Because this would be slower since the lock-modifications
    woudln't be done in the L1-caches but in far memory. That's
    just a silly idea.

    Hello, it's a cache-coherent multiprocessor. You need to
    fetch it exclusively into the L1 first, so instead of sending the fetch
    (or invalidate if converting a shared line to owned),
    you send the atomic op and it gets handled atomically at
    the far end (e.g. LLC, PCI express device, SoC coprocessor)
    saving the interconnect (mesh, ring, whatever) bandwidth and
    the round-trip time between L1 and LLC and reducing contention
    for the line.

    If it's already in the L1 cache, then the processor will
    automatically treat it as a near-atomic, this is expected
    to be a rare case with correctly designed atomic usage.

    In case you need a public reference for a shipping processor:

    https://developer.arm.com/documentation/102099/0000/L1-data-memory-system/Instruction-implementation-in-the-L1-data-memory-system

    That's not a processor implementing this Gen-Z interconnect and
    it's atomic facilities. This is just an optimization for a special
    kind of processor architecture.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Scott Lurndal on Wed Dec 8 00:59:34 2021
    On 12/7/2021 11:25 AM, Scott Lurndal wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 07.12.2021 um 19:49 schrieb Scott Lurndal:


    How do you know?

    Because this would be slower since the lock-modifications
    woudln't be done in the L1-caches but in far memory. That's
    just a silly idea.

    Hello, it's a cache-coherent multiprocessor. You need to
    fetch it exclusively into the L1 first, so instead of sending the fetch
    (or invalidate if converting a shared line to owned),
    you send the atomic op and it gets handled atomically at
    the far end (e.g. LLC, PCI express device, SoC coprocessor)
    saving the interconnect (mesh, ring, whatever) bandwidth and
    the round-trip time between L1 and LLC and reducing contention
    for the line.

    If it's already in the L1 cache, then the processor will
    automatically treat it as a near-atomic, this is expected
    to be a rare case with correctly designed atomic usage.

    In case you need a public reference for a shipping processor:

    https://developer.arm.com/documentation/102099/0000/L1-data-memory-system/Instruction-implementation-in-the-L1-data-memory-system



    You have encountered the rabbit hole of Bonita! I have proved her/it
    wrong several times. No good, goes nowhere.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Wed Dec 8 15:41:50 2021
    Am 08.12.2021 um 09:59 schrieb Chris M. Thomasson:
    On 12/7/2021 11:25 AM, Scott Lurndal wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 07.12.2021 um 19:49 schrieb Scott Lurndal:


    How do you know?

    Because this would be slower since the lock-modifications
    woudln't be done in the L1-caches but in far memory. That's
    just a silly idea.

    Hello, it's a cache-coherent multiprocessor.   You need to
    fetch it exclusively into the L1 first, so instead of sending the fetch
    (or invalidate if converting a shared line to owned),
    you send the atomic op and it gets handled atomically at
    the far end (e.g. LLC, PCI express device, SoC coprocessor)
    saving the interconnect (mesh, ring, whatever) bandwidth and
    the round-trip time between L1 and LLC and reducing contention
    for the line.

    If it's already in the L1 cache, then the processor will
    automatically treat it as a near-atomic, this is expected
    to be a rare case with correctly designed atomic usage.

    In case you need a public reference for a shipping processor:

    https://developer.arm.com/documentation/102099/0000/L1-data-memory-system/Instruction-implementation-in-the-L1-data-memory-system




    You have encountered the rabbit hole of Bonita! I have proved her/it
    wrong several times. No good, goes nowhere.

    What he links isn't a proof for what he says.
    The above CPU doesn't implement the mentioned interconnect. It's
    just a minor improvement for this special kind of CPU -architecture
    to speed up lock-flipping with concurrent cores.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Wed Dec 8 15:48:55 2021
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 08.12.2021 um 09:59 schrieb Chris M. Thomasson:

    You have encountered the rabbit hole of Bonita! I have proved her/it
    wrong several times. No good, goes nowhere.

    What he links isn't a proof for what he says.

    As you note Chris, Christof/Bonita cannot admit he
    was wrong.

    The above CPU doesn't implement the mentioned interconnect.

    Of course not, ARM doesn't make CPUs. They provide the IP
    used to make real CPUs; for example the Amazon AWS Graviton 2 and 3.

    Yet, ARM does provide interconnect IP which fully supports
    near and far atomics.

    Some of the current Neoverse N2 licensees are listed here:

    https://www.design-reuse.com/news/49872/arm-neoverse-n2-v1-platform.html

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Wed Dec 8 17:35:58 2021
    Am 08.12.2021 um 16:48 schrieb Scott Lurndal:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 08.12.2021 um 09:59 schrieb Chris M. Thomasson:

    You have encountered the rabbit hole of Bonita! I have proved her/it
    wrong several times. No good, goes nowhere.

    What he links isn't a proof for what he says.

    As you note Chris, Christof/Bonita cannot admit he
    was wrong.

    The above CPU doesn't implement the mentioned interconnect.

    Of course not, ARM doesn't make CPUs. They provide the IP
    used to make real CPUs; for example the Amazon AWS Graviton 2 and 3.

    Yet, ARM does provide interconnect IP which fully supports
    near and far atomics.

    They're not far in the sense of the mentioned interconnect.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?B?w5bDtiBUaWli?=@21:1/5 to Chris M. Thomasson on Wed Dec 8 10:57:57 2021
    On Wednesday, 8 December 2021 at 10:59:51 UTC+2, Chris M. Thomasson wrote:
    On 12/7/2021 11:25 AM, Scott Lurndal wrote:
    sc...@slp53.sl.home (Scott Lurndal) writes:
    Bonita Montero <Bonita....@gmail.com> writes:
    Am 07.12.2021 um 19:49 schrieb Scott Lurndal:


    How do you know?

    Because this would be slower since the lock-modifications
    woudln't be done in the L1-caches but in far memory. That's
    just a silly idea.

    Hello, it's a cache-coherent multiprocessor. You need to
    fetch it exclusively into the L1 first, so instead of sending the fetch
    (or invalidate if converting a shared line to owned),
    you send the atomic op and it gets handled atomically at
    the far end (e.g. LLC, PCI express device, SoC coprocessor)
    saving the interconnect (mesh, ring, whatever) bandwidth and
    the round-trip time between L1 and LLC and reducing contention
    for the line.

    If it's already in the L1 cache, then the processor will
    automatically treat it as a near-atomic, this is expected
    to be a rare case with correctly designed atomic usage.

    In case you need a public reference for a shipping processor:

    https://developer.arm.com/documentation/102099/0000/L1-data-memory-system/Instruction-implementation-in-the-L1-data-memory-system

    You have encountered the rabbit hole of Bonita! I have proved her/it
    wrong several times. No good, goes nowhere.

    But it what comp.lang.c++ is. Whenever you come here then BM is
    present, wrong (or not even wrong), desperately trying to shadow
    it by snipping out of context, removing attributions, misrepresenting
    what others wrote, moving goalposts etc. Keeping hearth and home
    warm. ;-D

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Manfred@21:1/5 to All on Thu Dec 9 00:27:58 2021
    On 12/8/2021 7:57 PM, Öö Tiib wrote:
    removing attributions

    At least this part appears to have improved, as of recent. For all the
    rest, there's still a long way to go, but never loose hope ;)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Juha Nieminen@21:1/5 to ootiib@hot.ee on Thu Dec 9 06:30:15 2021
    Tiib <ootiib@hot.ee> wrote:
    You have encountered the rabbit hole of Bonita! I have proved her/it
    wrong several times. No good, goes nowhere.

    But it what comp.lang.c++ is. Whenever you come here then BM is
    present, wrong (or not even wrong), desperately trying to shadow
    it by snipping out of context, removing attributions, misrepresenting
    what others wrote, moving goalposts etc. Keeping hearth and home
    warm. ;-D

    The same pattern repeats again and again and again. I can't decide if
    it's amusing or tiresome.

    At least she doesn't use as many insults and derogatory tone anymore
    towards people who are just trying to help, so I suppose that's an
    improvement.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)