• Some help needed

    From Bonita Montero@21:1/5 to All on Sat Sep 25 16:41:12 2021
    I've developed a monitor-object like that of Java for C++ with some improvements. The major improvement is that there's not only a spin
    -loop for locking and unlocking but also for waiting on an event. In
    this case you don't have to lock the mutex but supply a predicate on
    a wait_poll-function and the code repeatedly tries to lock the mutex
    polling and if it can lock the mutex it calls the predicate which
    returns (or moves) a pair of a bool and the result-type.
    Waiting to for a semaphore and or a event-object (Win32) in the kernel
    can easily take from 1.000 to 10.000 clock-cylces even when the call immediately returns because the semaphore or event has been set before.
    So there has to be a spin count with a reasonable relationship to this waiting-inteval, f.e. spinning one tenth of the minimum interval being
    spent in the kernel.
    With my monitor-object I've taken the spincount recalculation-algorithm
    from the glibc. And I'm also using the PAUSE-instruction. But I think
    that the glibc induces to heavy cacheline-flipping by re-loading the mutex-flags immediately after a single PAUSE-instruction. So I decided
    to loop PAUSE several times and to take less spinning iterations there-
    fore.
    To get a reasonable number of PAUSE-spinnings I need the time PAUSE
    takes on different processors. On my CPU PAUSE halts the pipe only
    for about for 0,78 nanoseconds, which is about 3,25 clock-cycles in
    average. I've written a short progam tha repeatedly PAUSEs and takes
    the aveage time. I want to encourage you to compile the code on your
    machine and give me the PAUSE-timing it outputs here.

    This is the code:

    #include <iostream>
    #include <chrono>
    #include <cstddef>
    #include <cstdint>
    #include <immintrin.h>

    using namespace std;
    using namespace chrono;

    int main( int argc, char **argv )
    {
    static uint64_t const PAUSE_ROUNDS = 1'000'000'000;
    auto start = high_resolution_clock::now();
    for( uint64_t i = PAUSE_ROUNDS; i; --i )
    _mm_pause();
    double ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / (double)PAUSE_ROUNDS;
    cout << ns << endl;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paavo Helde@21:1/5 to All on Sat Sep 25 19:51:50 2021
    25.09.2021 17:41 Bonita Montero kirjutas:
    I've developed a monitor-object like that of Java for C++ with some improvements. The major improvement is that there's not only a spin
    -loop for locking and unlocking but also for waiting on an event. In
    this case you don't have to lock the mutex but supply a predicate on
    a wait_poll-function and the code repeatedly tries to lock the mutex
    polling and if it can lock the mutex it calls the predicate which
    returns (or moves) a pair of a bool and the result-type.
    Waiting to for a semaphore and or a event-object (Win32) in the kernel
    can easily take from 1.000 to 10.000 clock-cylces even when the call immediately returns because the semaphore or event has been set before.
    So there has to be a spin count with a reasonable relationship to this waiting-inteval, f.e. spinning one tenth of the minimum interval being
    spent in the kernel.
    With my monitor-object I've taken the spincount recalculation-algorithm
    from the glibc. And I'm also using the PAUSE-instruction. But I think
    that the glibc induces to heavy cacheline-flipping by re-loading the mutex-flags immediately after a single PAUSE-instruction. So I decided
    to loop PAUSE several times and to take less spinning iterations there-
    fore.
    To get a reasonable number of PAUSE-spinnings I need the time PAUSE
    takes on different processors. On my CPU PAUSE halts the pipe only
    for about for 0,78 nanoseconds, which is about 3,25 clock-cycles in
    average. I've written a short progam tha repeatedly PAUSEs and takes
    the aveage time. I want to encourage you to compile the code on your
    machine and give me the PAUSE-timing it outputs here.

    This is the code:

    #include <iostream>
    #include <chrono>
    #include <cstddef>
    #include <cstdint>
    #include <immintrin.h>

    using namespace std;
    using namespace chrono;

    int main( int argc, char **argv )
    {
        static uint64_t const PAUSE_ROUNDS = 1'000'000'000;
        auto start = high_resolution_clock::now();
        for( uint64_t i = PAUSE_ROUNDS; i; --i )
            _mm_pause();
        double ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / (double)PAUSE_ROUNDS;
        cout << ns << endl;
    }

    On my computer this outputs:

    34.9194

    (Intel Xeon E-2286M CPU @ 2.40 GHz)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bo Persson@21:1/5 to Bonita Montero on Sat Sep 25 19:05:54 2021
    On 2021-09-25 at 16:41, Bonita Montero wrote:
    I've developed a monitor-object like that of Java for C++ with some improvements. The major improvement is that there's not only a spin
    -loop for locking and unlocking but also for waiting on an event. In
    this case you don't have to lock the mutex but supply a predicate on
    a wait_poll-function and the code repeatedly tries to lock the mutex
    polling and if it can lock the mutex it calls the predicate which
    returns (or moves) a pair of a bool and the result-type.
    Waiting to for a semaphore and or a event-object (Win32) in the kernel
    can easily take from 1.000 to 10.000 clock-cylces even when the call immediately returns because the semaphore or event has been set before.
    So there has to be a spin count with a reasonable relationship to this waiting-inteval, f.e. spinning one tenth of the minimum interval being
    spent in the kernel.
    With my monitor-object I've taken the spincount recalculation-algorithm
    from the glibc. And I'm also using the PAUSE-instruction. But I think
    that the glibc induces to heavy cacheline-flipping by re-loading the mutex-flags immediately after a single PAUSE-instruction. So I decided
    to loop PAUSE several times and to take less spinning iterations there-
    fore.
    To get a reasonable number of PAUSE-spinnings I need the time PAUSE
    takes on different processors. On my CPU PAUSE halts the pipe only
    for about for 0,78 nanoseconds, which is about 3,25 clock-cycles in
    average. I've written a short progam tha repeatedly PAUSEs and takes
    the aveage time. I want to encourage you to compile the code on your
    machine and give me the PAUSE-timing it outputs here.

    This is the code:

    #include <iostream>
    #include <chrono>
    #include <cstddef>
    #include <cstdint>
    #include <immintrin.h>

    using namespace std;
    using namespace chrono;

    int main( int argc, char **argv )
    {
        static uint64_t const PAUSE_ROUNDS = 1'000'000'000;
        auto start = high_resolution_clock::now();
        for( uint64_t i = PAUSE_ROUNDS; i; --i )
            _mm_pause();
        double ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / (double)PAUSE_ROUNDS;
        cout << ns << endl;
    }

    I get

    30.7635

    (Core i9 9900K 5GHz)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Sat Sep 25 19:01:43 2021
    Am 25.09.2021 um 18:51 schrieb Paavo Helde:
    25.09.2021 17:41 Bonita Montero kirjutas:
    I've developed a monitor-object like that of Java for C++ with some
    improvements. The major improvement is that there's not only a spin
    -loop for locking and unlocking but also for waiting on an event. In
    this case you don't have to lock the mutex but supply a predicate on
    a wait_poll-function and the code repeatedly tries to lock the mutex
    polling and if it can lock the mutex it calls the predicate which
    returns (or moves) a pair of a bool and the result-type.
    Waiting to for a semaphore and or a event-object (Win32) in the kernel
    can easily take from 1.000 to 10.000 clock-cylces even when the call
    immediately returns because the semaphore or event has been set before.
    So there has to be a spin count with a reasonable relationship to this
    waiting-inteval, f.e. spinning one tenth of the minimum interval being
    spent in the kernel.
    With my monitor-object I've taken the spincount recalculation-algorithm
    from the glibc. And I'm also using the PAUSE-instruction. But I think
    that the glibc induces to heavy cacheline-flipping by re-loading the
    mutex-flags immediately after a single PAUSE-instruction. So I decided
    to loop PAUSE several times and to take less spinning iterations there-
    fore.
    To get a reasonable number of PAUSE-spinnings I need the time PAUSE
    takes on different processors. On my CPU PAUSE halts the pipe only
    for about for 0,78 nanoseconds, which is about 3,25 clock-cycles in
    average. I've written a short progam tha repeatedly PAUSEs and takes
    the aveage time. I want to encourage you to compile the code on your
    machine and give me the PAUSE-timing it outputs here.

    This is the code:

    #include <iostream>
    #include <chrono>
    #include <cstddef>
    #include <cstdint>
    #include <immintrin.h>

    using namespace std;
    using namespace chrono;

    int main( int argc, char **argv )
    {
         static uint64_t const PAUSE_ROUNDS = 1'000'000'000;
         auto start = high_resolution_clock::now();
         for( uint64_t i = PAUSE_ROUNDS; i; --i )
             _mm_pause();
         double ns = (int64_t)duration_cast<nanoseconds>(
    high_resolution_clock::now() - start ).count() / (double)PAUSE_ROUNDS;
         cout << ns << endl;
    }

    On my computer this outputs:

    34.9194

    That seems much more reasonable to me than the < 4 clock cycles on my
    PC (Ryzen Threadripper 3990X).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Sat Sep 25 19:14:26 2021
    Am 25.09.2021 um 19:05 schrieb Bo Persson:

    This is the code:

    #include <iostream>
    #include <chrono>
    #include <cstddef>
    #include <cstdint>
    #include <immintrin.h>

    using namespace std;
    using namespace chrono;

    int main( int argc, char **argv )
    {
         static uint64_t const PAUSE_ROUNDS = 1'000'000'000;
         auto start = high_resolution_clock::now();
         for( uint64_t i = PAUSE_ROUNDS; i; --i )
             _mm_pause();
         double ns = (int64_t)duration_cast<nanoseconds>(
    high_resolution_clock::now() - start ).count() / (double)PAUSE_ROUNDS;
         cout << ns << endl;
    }

    I get

    30.7635

    (Core i9 9900K  5GHz)

    Why did AMD decide for such an idiotic timing ? That is neither suitable
    for spin-loops with only a single PAUSE-instructions, nor does it save
    power while spinning.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Branimir Maksimovic@21:1/5 to Bonita Montero on Sat Sep 25 17:15:08 2021
    On 2021-09-25, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    I've developed a monitor-object like that of Java for C++ with some improvements. The major improvement is that there's not only a spin
    for( uint64_t i = PAUSE_ROUNDS; i; --i )
    _mm_pause();
    What's in _mm_pause ?

    --

    7-77-777
    Evil Sinner!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Sat Sep 25 19:16:21 2021
    Am 25.09.2021 um 19:15 schrieb Branimir Maksimovic:
    On 2021-09-25, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    I've developed a monitor-object like that of Java for C++ with some
    improvements. The major improvement is that there's not only a spin
    for( uint64_t i = PAUSE_ROUNDS; i; --i )
    _mm_pause();
    What's in _mm_pause ?

    Read again what I wrote or google for _mm_pause.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Branimir Maksimovic@21:1/5 to Bonita Montero on Sat Sep 25 17:29:59 2021
    On 2021-09-25, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    Am 25.09.2021 um 19:15 schrieb Branimir Maksimovic:
    On 2021-09-25, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    I've developed a monitor-object like that of Java for C++ with some
    improvements. The major improvement is that there's not only a spin
    for( uint64_t i = PAUSE_ROUNDS; i; --i )
    _mm_pause();
    What's in _mm_pause ?

    Read again what I wrote or google for _mm_pause.
    On mine system it is:
    #include <iostream>
    #include <chrono>
    #include <cstddef>
    #include <cstdint>

    using namespace std;
    using namespace chrono;

    int main( int argc, char **argv )
    {
    static uint64_t const PAUSE_ROUNDS = 1'000'000'000;
    auto start = high_resolution_clock::now();
    for( uint64_t i = PAUSE_ROUNDS; i; --i )
    __asm__ __volatile__("isb\n");
    // _mm_pause();
    double ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / (double)PAUSE_ROUNDS;
    cout << ns << endl;
    }



    --

    7-77-777
    Evil Sinner!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Branimir Maksimovic@21:1/5 to Paavo Helde on Sat Sep 25 17:20:45 2021
    On 2021-09-25, Paavo Helde <myfirstname@osa.pri.ee> wrote:
    25.09.2021 17:41 Bonita Montero kirjutas:
        for( uint64_t i = PAUSE_ROUNDS; i; --i )
            _mm_pause();

    On my computer this outputs:

    34.9194

    (Intel Xeon E-2286M CPU @ 2.40 GHz)

    On mine does not compile becaise it is x86 specific...


    --

    7-77-777
    Evil Sinner!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Branimir Maksimovic@21:1/5 to Bo Persson on Sat Sep 25 17:33:38 2021
    On 2021-09-25, Bo Persson <bo@bo-persson.se> wrote:

    I get

    30.7635

    (Core i9 9900K 5GHz

    bmaxa@Branimirs-Air News % ./a.out
    9.14544

    M1 processor. (with code modification to compile)



    --

    7-77-777
    Evil Sinner!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Sun Sep 26 07:26:37 2021
    So as the timings of the PAUSE-instructions are so different I
    decided to write a singleton containing the fastest timing of
    the PAUSE-instruction on the machine. This is the code:

    cpu_pause.h:

    #pragma once
    #if defined(_MSC_VER) && (defined(_M_X64) || defined(_M_IX86))
    #include <intrin.h>
    #elif defined(__GNUC__) && (defined(__x86_64__) || defined(__i386__))
    #include <immintrin.h>
    #else
    #error "need platform-specific pause-instruction"
    #endif

    inline
    void cpu_pause()
    {
    #if defined(_MSC_VER) && (defined(_M_X64) || defined(_M_IX86)) || defined(__GNUC__) && (defined(__x86_64__) || defined(__i386__))
    _mm_pause();
    #endif
    }

    inline
    void cpu_pause_n( unsigned iterations )
    {
    for( ; iterations--; cpu_pause() );
    }

    struct pause_singleton
    {
    static
    double getNsPerPause();
    private:
    static
    struct singleton_t
    {
    singleton_t();
    double m_nsPerPause;
    } singleton;
    };

    inline
    double pause_singleton::getNsPerPause()
    {
    return singleton.m_nsPerPause;
    }

    cpu_pasue.h:

    #include <chrono>
    #include <limits>
    #include "cpu_pause.h"

    using namespace std;
    using namespace chrono;

    #if defined(_MSC_VER)
    #pragma warning(disable: 26495) // member not initialized
    #pragma warning(disable: 26498) // consider constexpr
    #endif

    pause_singleton::singleton_t::singleton_t()
    {
    int64_t leastTicks = numeric_limits<int64_t>::max();;
    for( size_t i = 1000; i; --i )
    {
    auto start = high_resolution_clock::now();
    for( size_t j = 1'000; j; --j )
    cpu_pause();
    int64_t ticks = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count();
    if( ticks < leastTicks )
    leastTicks = ticks;
    }
    m_nsPerPause = leastTicks / 1'000.0;
    }

    pause_singleton::singleton_t pause_singleton::singleton;

    So I can adjust the spinning-loop according
    to pause_singleton::getNsPerPause().

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Branimir Maksimovic@21:1/5 to Bonita Montero on Sun Sep 26 06:56:34 2021
    On 2021-09-26, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    So as the timings of the PAUSE-instructions are so different I
    decided to write a singleton containing the fastest timing of
    the PAUSE-instruction on the machine. This is the code:

    cpu_pause.h:
    for( size_t j = 1'000; j; --j )
    cpu_pause();
    Please add for aarch64 i gave to you.


    --

    7-77-777
    Evil Sinner!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Mon Sep 27 17:35:49 2021
    Am 26.09.2021 um 07:26 schrieb Bonita Montero:

    So I can adjust the spinning-loop according
    to pause_singleton::getNsPerPause().

    I dropped it ! I simply made a spinning-loop according to the TSC
    if the CPU has a TSC and it is invariant (these are also invariant
    across sockets !). Reading the TSC can be done at roughly every 10
    nanoseconds my PC (TR3990X, Zen3, Win10, SMT off). It's not accu-
    rate since it might overlap with instruction before or afterwards,
    but accuracy isn't relevant when you spin hundreds of clock-cycles.
    And I changed a single pause per spin loop instead of a row of
    PAUSEs which sum up to 30ns (which is roughly the most common
    value on newer Intel -CPUs). This more eager spinnging may gain
    locking earlier, although it may generate more interconnect-traffic.

    But as I'm using RDTSC: I'm asking myself how fast RDTSC is on
    different CPUs. So I modified my test-program to measure different
    routines to test a loop of 10 RDTSCs per loop. Here it is:

    #include <iostream>
    #include <chrono>
    #include <limits>
    #include <functional>
    #if defined(_MSC_VER)
    #include <intrin.h>
    #endif

    using namespace std;
    using namespace chrono;


    int main( int argc, char **argv )
    {
    using bench_fn = function<void(size_t)>;
    auto bench = []( bench_fn const &fn, size_t nTests, size_t nIterations ) -> double
    {
    int64_t nsShortest = numeric_limits<int64_t>::max();
    for( size_t p = nTests; p; --p )
    {
    auto start = high_resolution_clock::now();
    fn( nIterations );
    int64_t ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count();
    nsShortest = ns < nsShortest ? ns : nsShortest;
    }
    return (double)nsShortest / (ptrdiff_t)nIterations;
    };
    auto rdtscLoop = []( size_t nIterations )
    {
    uint64_t TSCs[10];
    for( ; nIterations; --nIterations )
    // unfortunately there's no #directive vor REP'ing
    #if defined(_MSC_VER)
    TSCs[0] += __rdtsc(),
    TSCs[1] += __rdtsc(),
    TSCs[2] += __rdtsc(),
    TSCs[3] += __rdtsc(),
    TSCs[4] += __rdtsc(),
    TSCs[5] += __rdtsc(),
    TSCs[6] += __rdtsc(),
    TSCs[7] += __rdtsc(),
    TSCs[8] += __rdtsc(),
    TSCs[9] += __rdtsc();
    #elif defined(__GNUC__)
    TSCs[0] += __builtin_ia32_rdtsc(),
    TSCs[1] += __builtin_ia32_rdtsc(),
    TSCs[2] += __builtin_ia32_rdtsc(),
    TSCs[3] += __builtin_ia32_rdtsc(),
    TSCs[4] += __builtin_ia32_rdtsc(),
    TSCs[5] += __builtin_ia32_rdtsc(),
    TSCs[6] += __builtin_ia32_rdtsc(),
    TSCs[7] += __builtin_ia32_rdtsc(),
    TSCs[8] += __builtin_ia32_rdtsc(),
    TSCs[9] += __builtin_ia32_rdtsc();
    #endif
    uint64_t sum = 0; // prevent optimization
    for( uint64_t TSC : TSCs )
    sum += TSC;
    uint64_t volatile vsum = sum;
    };
    static size_t const
    N_TESTS = 100, // number of tests to get the shortest timing
    N_ITERATIONS = 500, // iterations of the test-loop
    N_REPEATS = 10; // REPetitions inside the test-loop
    double nsPerREP = bench( bench_fn( bind( rdtscLoop, placeholders::_1 ) ), N_TESTS, N_ITERATIONS ) / N_REPEATS;
    cout << "ns per RDTSC: " << nsPerREP << endl;
    }

    It would be nice if you could compile this on your machine and
    give me the number of the RDTSC-timing here. This would give me
    a hint if what I try is feasible.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bo Persson@21:1/5 to Bonita Montero on Mon Sep 27 19:29:11 2021
    On 2021-09-27 at 17:35, Bonita Montero wrote:
    Am 26.09.2021 um 07:26 schrieb Bonita Montero:

    So I can adjust the spinning-loop according
    to pause_singleton::getNsPerPause().

    I dropped it ! I simply made a spinning-loop according to the TSC
    if the CPU has a TSC and it is invariant (these are also invariant
    across sockets !). Reading the TSC can be done at roughly every 10 nanoseconds my PC (TR3990X, Zen3, Win10, SMT off). It's not accu-
    rate since it might overlap with instruction before or afterwards,
    but accuracy isn't relevant when you spin hundreds of clock-cycles.
    And I changed a single pause per spin loop instead of a row of
    PAUSEs which sum up to 30ns (which is roughly the most common
    value on newer Intel -CPUs). This more eager spinnging may gain
    locking earlier, although it may generate more interconnect-traffic.

    But as I'm using RDTSC: I'm asking myself how fast RDTSC is on
    different CPUs. So I modified my test-program to measure different
    routines to test a loop of 10 RDTSCs per loop. Here it is:

    #include <iostream>
    #include <chrono>
    #include <limits>
    #include <functional>
    #if defined(_MSC_VER)
        #include <intrin.h>
    #endif

    using namespace std;
    using namespace chrono;


    int main( int argc, char **argv )
    {
        using bench_fn = function<void(size_t)>;
        auto bench = []( bench_fn const &fn, size_t nTests, size_t nIterations ) -> double
        {
            int64_t nsShortest = numeric_limits<int64_t>::max();
            for( size_t p = nTests; p; --p )
            {
                auto start = high_resolution_clock::now();
                fn( nIterations );
                int64_t ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count();
                nsShortest = ns < nsShortest ? ns : nsShortest;
            }
            return (double)nsShortest / (ptrdiff_t)nIterations;
        };
        auto rdtscLoop = []( size_t nIterations )
        {
            uint64_t TSCs[10];
            for( ; nIterations; --nIterations )
                // unfortunately there's no #directive vor REP'ing #if defined(_MSC_VER)
                TSCs[0] += __rdtsc(),
                TSCs[1] += __rdtsc(),
                TSCs[2] += __rdtsc(),
                TSCs[3] += __rdtsc(),
                TSCs[4] += __rdtsc(),
                TSCs[5] += __rdtsc(),
                TSCs[6] += __rdtsc(),
                TSCs[7] += __rdtsc(),
                TSCs[8] += __rdtsc(),
                TSCs[9] += __rdtsc();
    #elif defined(__GNUC__)
                TSCs[0] += __builtin_ia32_rdtsc(),
                TSCs[1] += __builtin_ia32_rdtsc(),
                TSCs[2] += __builtin_ia32_rdtsc(),
                TSCs[3] += __builtin_ia32_rdtsc(),
                TSCs[4] += __builtin_ia32_rdtsc(),
                TSCs[5] += __builtin_ia32_rdtsc(),
                TSCs[6] += __builtin_ia32_rdtsc(),
                TSCs[7] += __builtin_ia32_rdtsc(),
                TSCs[8] += __builtin_ia32_rdtsc(),
                TSCs[9] += __builtin_ia32_rdtsc();
    #endif
            uint64_t sum = 0; // prevent optimization
            for( uint64_t TSC : TSCs )
                sum += TSC;
            uint64_t volatile vsum = sum;
        };
        static size_t const
            N_TESTS = 100, // number of tests to get the shortest timing
            N_ITERATIONS = 500, // iterations of the test-loop
            N_REPEATS = 10; // REPetitions inside the test-loop
        double nsPerREP = bench( bench_fn( bind( rdtscLoop,
    placeholders::_1 ) ), N_TESTS, N_ITERATIONS ) / N_REPEATS;
        cout << "ns per RDTSC: " << nsPerREP << endl;
    }

    It would be nice if you could compile this on your machine and
    give me the number of the RDTSC-timing here. This would give me
    a hint if what I try is feasible.

    I get

    ns per RDTSC: 5.42

    on my Core i9-9900K 5GHz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)