• DCAS-supplement ?

    From Bonita Montero@21:1/5 to All on Sun Sep 19 18:13:46 2021
    I just wanted to test if my compiler recognizes ...
    atomic<pair<uintptr_t, uintptr_t>
    ... and uses DCASes on it or if it supplements the pair with a usual
    lock. So I first wrote a minimal program with a single noinline-function
    which does a compare and swap on a atomic pair. The compiler generated
    a lot of code for this which I didn't understand, and I didn't saw any LOCK-prefixed instructions in that. I was curisous about whether the implementation places a lock within the atomic and printed a sizeof
    of the above atomic pair: 24 on a 64-bit-platform, so obviously without
    a lock.
    At last I wrote a program which increments both portions of a single
    atomic pair by all the threads my system has (Ryzen Threadripper 64
    core, Win10, SMT off) a predefined number of times. Then I calculated
    the time for each increment in nanoseconds. The time is rather high,
    about 20.000ns for each successful increment, so it first looked to
    me if there was a lock I overlooked; so this couldn't be true with
    a sizeof of this atomic of 24 bytes. And when I saw at the Processs
    Viewer I saw that all 64 cores were nearly at 100% _user_ CPU time
    all the time - so there couldn't be any kernel-interventions.
    So is there anyone here smarter than me and can identify what this DCAS-substitute does from the assembly-dump ?

    Here's my test-code:

    #include <iostream>
    #include <atomic>
    #include <utility>
    #include <thread>
    #include <mutex>
    #include <condition_variable>
    #include <chrono>

    using namespace std;
    using namespace chrono;

    using uintptr_pair = pair<uintptr_t, uintptr_t>;
    using atomic_pair = atomic<uintptr_pair>;

    int main()
    {
    cout << "atomic<pair<uintptr_t, uintptr_t>>: " << sizeof(atomic_pair) << endl;
    atomic_pair ap( uintptr_pair( 0, 0 ) );
    mutex mtx;
    unsigned nThreads = thread::hardware_concurrency();
    unsigned ready = nThreads;
    condition_variable cvReady;
    bool run = false;
    condition_variable cvRun;
    atomic_int64_t sumDur = 0;
    auto theThread = [&]( size_t n )
    {
    unique_lock<mutex> lock( mtx );
    if( !--ready )
    cvReady.notify_one();
    cvRun.wait( lock, [&]() -> bool { return run; } );
    lock.unlock();
    auto start = high_resolution_clock::now();
    uintptr_pair cmp = ap.load( memory_order_relaxed );
    for( ; n--; )
    while( !ap.compare_exchange_weak( cmp, uintptr_pair( cmp.first + 1,
    cmp.second + 1 ), memory_order_relaxed, memory_order_relaxed ) );
    sumDur.fetch_add( duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count(), memory_order_relaxed );
    lock.lock();
    };
    vector<jthread> threads;
    threads.reserve( nThreads );
    static size_t const ROUNDS = 100'000;
    for( unsigned t = nThreads; t--; )
    threads.emplace_back( theThread, ROUNDS );
    unique_lock<mutex> lock( mtx );
    cvReady.wait( lock, [&]() -> bool { return !ready; } );
    run = true;
    cvRun.notify_all();
    lock.unlock();
    for( jthread &thr : threads )
    thr.join();
    cout << (double)sumDur / ((double)nThreads * ROUNDS) << endl;
    uintptr_pair p = ap.load( memory_order_relaxed );
    cout << "synch: " << (p.first == p.second ? "yes" : "no") << endl;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Bonita Montero on Sun Sep 19 15:29:40 2021
    On 9/19/2021 9:13 AM, Bonita Montero wrote:
    I just wanted to test if my compiler recognizes ...
        atomic<pair<uintptr_t, uintptr_t>
    ... and uses DCASes on it or if it supplements the pair with a usual
    lock.

    [...]

    Can you get this to compile on your MSVC?
    _________________________
    #include <iostream>
    #include <atomic>
    #include <xmmintrin.h>

    int main()
    {
    std::atomic<__int64> foo_64;
    std::atomic<__m128> foo_128;

    std::cout << sizeof(__int64) << "\n";
    std::cout << sizeof(__m128) << "\n";

    std::cout << foo_64.is_lock_free() << "\n";
    std::cout << foo_128.is_lock_free() << "\n";

    return 0;
    }
    _________________________


    In 64-bit mode I am getting:

    8
    16
    1
    0

    This means that this compiler is not supporting lock-free operations on
    128-bit numbers in 64-bit mode, humm. What do you get? Also, you might
    try to see if your system supports the CMPXCHG16B instruction. If so,
    code DWCAS manually in assembly language. Now, in 32 bit mode, I also get:

    8
    16
    1
    0

    This means that DWCAS _is_ supported in a lock-free manner in 32 bit
    mode through CMPXCHG8B. So, try to use CMPXCHG16B directly if you want
    DWCAS in 64-bit mode.

    I have not tried this experiment in GCC. It might work with __int128,
    and end up using:

    https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html

    __atomic_compare_exchange_16, you might have to use -mcx16.

    Hope that helps a bit!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Mon Sep 20 07:03:45 2021
    Am 20.09.2021 um 00:29 schrieb Chris M. Thomasson:
    On 9/19/2021 9:13 AM, Bonita Montero wrote:
    I just wanted to test if my compiler recognizes ...
         atomic<pair<uintptr_t, uintptr_t>
    ... and uses DCASes on it or if it supplements the pair with a usual
    lock.

    [...]

    Can you get this to compile on your MSVC?
    _________________________
    #include <iostream>
    #include <atomic>
    #include <xmmintrin.h>

    int main()
    {
        std::atomic<__int64> foo_64;
        std::atomic<__m128> foo_128;

        std::cout << sizeof(__int64) << "\n";
        std::cout << sizeof(__m128) << "\n";

        std::cout << foo_64.is_lock_free() << "\n";
        std::cout << foo_128.is_lock_free() << "\n";

        return 0;
    }
    _________________________


    That's not what I wanted to test.
    Read this thread:

    https://stackoverflow.com/questions/69245183/dcas-alternative-with-no-help-of-the-kernel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Mon Sep 20 07:26:05 2021
    Am 20.09.2021 um 07:19 schrieb Chris M. Thomasson:
    On 9/19/2021 10:03 PM, Bonita Montero wrote:
    Am 20.09.2021 um 00:29 schrieb Chris M. Thomasson:
    On 9/19/2021 9:13 AM, Bonita Montero wrote:
    I just wanted to test if my compiler recognizes ...
         atomic<pair<uintptr_t, uintptr_t>
    ... and uses DCASes on it or if it supplements the pair with a usual
    lock.

    [...]

    Can you get this to compile on your MSVC?
    _________________________
    #include <iostream>
    #include <atomic>
    #include <xmmintrin.h>

    int main()
    {
         std::atomic<__int64> foo_64;
         std::atomic<__m128> foo_128;

         std::cout << sizeof(__int64) << "\n";
         std::cout << sizeof(__m128) << "\n";

         std::cout << foo_64.is_lock_free() << "\n";
         std::cout << foo_128.is_lock_free() << "\n";

         return 0;
    }
    _________________________


    That's not what I wanted to test.
    Read this thread:

    https://stackoverflow.com/questions/69245183/dcas-alternative-with-no-help-of-the-kernel


    A quote:

    "I just wanted to test if my compiler recognizes ...and uses DCASes on
    it or if it supplements the pair with a usual lock"

    Well, is_lock_free() should help?

    I put the compare_exchangE_weak-code into a noinline function:

    struct uip_pair
    {
    uip_pair() = default;
    uip_pair( uintptr_t first, uintptr_t second ) :
    first( first ),
    second( second )
    {
    }
    uintptr_t first, second;
    };

    using atomic_pair = atomic<uip_pair>;

    #if defined(_MSC_VER)
    #define NOINLINE __declspec(noinline)
    #elif defined(__GNUC__) || defined(__clang__)
    #define NOINLINE __attribute((noinline))
    #endif

    NOINLINE
    bool cmpXchg( atomic_pair &ap, uip_pair &cmp, uip_pair xchg )
    {
    return ap.compare_exchange_weak( cmp, xchg, memory_order_relaxed, memory_order_relaxed );
    }

    ?cmpXchg@@YA_NAEAU?$atomic@Uuip_pair@@@std@@AEAUuip_pair@@U3@@Z PROC
    cmpXchg, COMDAT
    mov eax, 1
    mov r10, rcx
    mov r9d, eax
    xchg DWORD PTR [rcx], eax
    test eax, eax
    je SHORT $LN14@cmpXchg
    $LL13@cmpXchg:
    mov eax, DWORD PTR [rcx]
    test eax, eax
    je SHORT $LN16@cmpXchg
    $LL15@cmpXchg:
    mov eax, r9d
    test r9d, r9d
    je SHORT $LN38@cmpXchg
    npad 1
    $LL19@cmpXchg:
    pause
    sub eax, 1
    jne SHORT $LL19@cmpXchg
    cmp r9d, 64
    jl SHORT $LN38@cmpXchg
    lea r9d, QWORD PTR [rax+64]
    jmp SHORT $LN22@cmpXchg
    $LN38@cmpXchg:
    add r9d, r9d
    $LN22@cmpXchg:
    mov eax, DWORD PTR [rcx]
    test eax, eax
    jne SHORT $LL15@cmpXchg
    $LN16@cmpXchg:
    mov eax, 1
    xchg DWORD PTR [rcx], eax
    test eax, eax
    jne SHORT $LL13@cmpXchg
    $LN14@cmpXchg:
    mov rax, QWORD PTR [rcx+8]
    sub rax, QWORD PTR [rdx]
    jne SHORT $LN39@cmpXchg
    mov rax, QWORD PTR [rcx+16]
    sub rax, QWORD PTR [rdx+8]
    $LN39@cmpXchg:
    test rax, rax
    sete al
    test al, al
    je SHORT $LN6@cmpXchg
    movups xmm0, XMMWORD PTR [r8]
    movups XMMWORD PTR [rcx+8], xmm0
    xor ecx, ecx
    xchg DWORD PTR [r10], ecx
    ret 0
    $LN6@cmpXchg:
    movups xmm0, XMMWORD PTR [rcx+8]
    xor ecx, ecx
    movups XMMWORD PTR [rdx], xmm0
    xchg DWORD PTR [r10], ecx
    ret 0
    ?cmpXchg@@YA_NAEAU?$atomic@Uuip_pair@@@std@@AEAUuip_pair@@U3@@Z ENDP

    Maybe someone here understands what the above assembly-code does.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Bonita Montero on Sun Sep 19 22:40:56 2021
    On 9/19/2021 10:26 PM, Bonita Montero wrote:
    Am 20.09.2021 um 07:19 schrieb Chris M. Thomasson:
    On 9/19/2021 10:03 PM, Bonita Montero wrote:
    Am 20.09.2021 um 00:29 schrieb Chris M. Thomasson:
    On 9/19/2021 9:13 AM, Bonita Montero wrote:
    I just wanted to test if my compiler recognizes ...
         atomic<pair<uintptr_t, uintptr_t>
    ... and uses DCASes on it or if it supplements the pair with a usual >>>>> lock.

    [...]

    Can you get this to compile on your MSVC?
    _________________________
    #include <iostream>
    #include <atomic>
    #include <xmmintrin.h>

    int main()
    {
         std::atomic<__int64> foo_64;
         std::atomic<__m128> foo_128;

         std::cout << sizeof(__int64) << "\n";
         std::cout << sizeof(__m128) << "\n";

         std::cout << foo_64.is_lock_free() << "\n";
         std::cout << foo_128.is_lock_free() << "\n";

         return 0;
    }
    _________________________


    That's not what I wanted to test.
    Read this thread:

    https://stackoverflow.com/questions/69245183/dcas-alternative-with-no-help-of-the-kernel


    A quote:

    "I just wanted to test if my compiler recognizes ...and uses DCASes on
    it or if it supplements the pair with a usual lock"

    Well, is_lock_free() should help?

    I put the compare_exchangE_weak-code into a noinline function:

    struct uip_pair
    {
        uip_pair() = default;
        uip_pair( uintptr_t first, uintptr_t second ) :
            first( first ),
            second( second )
        {
        }
        uintptr_t first, second;
    };

    using atomic_pair = atomic<uip_pair>;

    #if defined(_MSC_VER)
        #define NOINLINE __declspec(noinline)
    #elif defined(__GNUC__) || defined(__clang__)
        #define NOINLINE __attribute((noinline))
    #endif

    NOINLINE
    bool cmpXchg( atomic_pair &ap, uip_pair &cmp, uip_pair xchg )
    {
        return ap.compare_exchange_weak( cmp, xchg, memory_order_relaxed, memory_order_relaxed );
    }

    ?cmpXchg@@YA_NAEAU?$atomic@Uuip_pair@@@std@@AEAUuip_pair@@U3@@Z PROC
    cmpXchg, COMDAT
        mov    eax, 1
        mov    r10, rcx
        mov    r9d, eax
        xchg   DWORD PTR [rcx], eax
        test   eax, eax
        je     SHORT $LN14@cmpXchg
    $LL13@cmpXchg:
        mov    eax, DWORD PTR [rcx]
        test   eax, eax
        je     SHORT $LN16@cmpXchg
    $LL15@cmpXchg:
        mov    eax, r9d
        test   r9d, r9d
        je     SHORT $LN38@cmpXchg
        npad   1
    $LL19@cmpXchg:
        pause
        sub    eax, 1
        jne    SHORT $LL19@cmpXchg
        cmp    r9d, 64
        jl     SHORT $LN38@cmpXchg
        lea    r9d, QWORD PTR [rax+64]
        jmp    SHORT $LN22@cmpXchg
    $LN38@cmpXchg:
        add    r9d, r9d
    $LN22@cmpXchg:
        mov    eax, DWORD PTR [rcx]
        test   eax, eax
        jne    SHORT $LL15@cmpXchg
    $LN16@cmpXchg:
        mov    eax, 1
        xchg   DWORD PTR [rcx], eax
        test   eax, eax
        jne    SHORT $LL13@cmpXchg
    $LN14@cmpXchg:
        mov    rax, QWORD PTR [rcx+8]
        sub    rax, QWORD PTR [rdx]
        jne    SHORT $LN39@cmpXchg
        mov    rax, QWORD PTR [rcx+16]
        sub    rax, QWORD PTR [rdx+8]
    $LN39@cmpXchg:
        test   rax, rax
        sete   al
        test   al, al
        je     SHORT $LN6@cmpXchg
        movups xmm0, XMMWORD PTR [r8]
        movups XMMWORD PTR [rcx+8], xmm0
        xor    ecx, ecx
        xchg   DWORD PTR [r10], ecx
        ret    0
    $LN6@cmpXchg:
        movups xmm0, XMMWORD PTR [rcx+8]
        xor    ecx, ecx
        movups XMMWORD PTR [rdx], xmm0
        xchg   DWORD PTR [r10], ecx
        ret    0 ?cmpXchg@@YA_NAEAU?$atomic@Uuip_pair@@@std@@AEAUuip_pair@@U3@@Z ENDP

    Maybe someone here understands what the above assembly-code does.


    Need to examine it, however there is a way to make a lock without using
    LOCK'ed anything in x86, x64. BTW, did you know that xchg has an implied
    LOCK prefix? So, it looks locked to me. What does is_lock_free say?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Bonita Montero on Sun Sep 19 22:19:55 2021
    On 9/19/2021 10:03 PM, Bonita Montero wrote:
    Am 20.09.2021 um 00:29 schrieb Chris M. Thomasson:
    On 9/19/2021 9:13 AM, Bonita Montero wrote:
    I just wanted to test if my compiler recognizes ...
         atomic<pair<uintptr_t, uintptr_t>
    ... and uses DCASes on it or if it supplements the pair with a usual
    lock.

    [...]

    Can you get this to compile on your MSVC?
    _________________________
    #include <iostream>
    #include <atomic>
    #include <xmmintrin.h>

    int main()
    {
         std::atomic<__int64> foo_64;
         std::atomic<__m128> foo_128;

         std::cout << sizeof(__int64) << "\n";
         std::cout << sizeof(__m128) << "\n";

         std::cout << foo_64.is_lock_free() << "\n";
         std::cout << foo_128.is_lock_free() << "\n";

         return 0;
    }
    _________________________


    That's not what I wanted to test.
    Read this thread:

    https://stackoverflow.com/questions/69245183/dcas-alternative-with-no-help-of-the-kernel


    A quote:

    "I just wanted to test if my compiler recognizes ...and uses DCASes on
    it or if it supplements the pair with a usual lock"

    Well, is_lock_free() should help?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Mon Sep 20 07:53:43 2021
    I put the compare_exchangE_weak-code into a noinline function:

    struct uip_pair
    {
        uip_pair() = default;
        uip_pair( uintptr_t first, uintptr_t second ) :
            first( first ),
            second( second )
        {
        }
        uintptr_t first, second;
    If I add ...
    uintptr_t i, j, k;
    ... here the structure is still only 8 bytes larger than it
    should be. Maybe MSVC does something like STM on this structure.
    };

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Bonita Montero on Mon Sep 20 16:02:40 2021
    On 9/19/2021 10:53 PM, Bonita Montero wrote:
    I put the compare_exchangE_weak-code into a noinline function:

    struct uip_pair
    {
         uip_pair() = default;
         uip_pair( uintptr_t first, uintptr_t second ) :
             first( first ),
             second( second )
         {
         }
         uintptr_t first, second;
    If I add ...
           uintptr_t i, j, k;
    ... here the structure is still only 8 bytes larger than it
    should be. Maybe MSVC does something like STM on this structure.
    };

    I don't know why they just don't use CMPXCHG16B directly. Look at the disassembly of:

    https://docs.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange128

    on your system.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Tue Sep 21 09:31:03 2021
    Am 21.09.2021 um 01:02 schrieb Chris M. Thomasson:
    On 9/19/2021 10:53 PM, Bonita Montero wrote:
    I put the compare_exchangE_weak-code into a noinline function:

    struct uip_pair
    {
         uip_pair() = default;
         uip_pair( uintptr_t first, uintptr_t second ) :
             first( first ),
             second( second )
         {
         }
         uintptr_t first, second;
    If I add ...
            uintptr_t i, j, k;
    ... here the structure is still only 8 bytes larger than it
    should be. Maybe MSVC does something like STM on this structure.
    };

    I don't know why they just don't use CMPXCHG16B directly. Look at the disassembly of:

    https://docs.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange128

    Can you please respond according to what I wrote ??? It's
    about the assembly of the code I posted and not this intrinsic.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Bonita Montero on Tue Sep 21 16:09:38 2021
    On 9/21/2021 12:31 AM, Bonita Montero wrote:
    Am 21.09.2021 um 01:02 schrieb Chris M. Thomasson:
    On 9/19/2021 10:53 PM, Bonita Montero wrote:
    I put the compare_exchangE_weak-code into a noinline function:

    struct uip_pair
    {
         uip_pair() = default;
         uip_pair( uintptr_t first, uintptr_t second ) :
             first( first ),
             second( second )
         {
         }
         uintptr_t first, second;
    If I add ...
            uintptr_t i, j, k;
    ... here the structure is still only 8 bytes larger than it
    should be. Maybe MSVC does something like STM on this structure.
    };

    I don't know why they just don't use CMPXCHG16B directly. Look at the
    disassembly of:

    https://docs.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange128


    Can you please respond according to what I wrote ??? It's
    about the assembly of the code I posted and not this intrinsic.

    You do it. Its a pain to converse with you.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)