I've developed a monitor-object like that of Java for C++ with some improvements. The major improvement is that there's not only a spin
-loop for locking and unlocking but also for waiting on an event. In
this case you don't have to lock the mutex but supply a predicate on
a wait_poll-function and the code repeatedly tries to lock the mutex
polling and if it can lock the mutex it calls the predicate which
returns (or moves) a pair of a bool and the result-type.
Waiting to for a semaphore and or a event-object (Win32) in the kernel
can easily take from 1.000 to 10.000 clock-cylces even when the call immediately returns because the semaphore or event has been set before.
So there has to be a spin count with a reasonable relationship to this waiting-inteval, f.e. spinning one tenth of the minimum interval being
spent in the kernel.
With my monitor-object I've taken the spincount recalculation-algorithm
from the glibc. And I'm also using the PAUSE-instruction. But I think
that the glibc induces to heavy cacheline-flipping by re-loading the mutex-flags immediately after a single PAUSE-instruction. So I decided
to loop PAUSE several times and to take less spinning iterations there-
fore.
To get a reasonable number of PAUSE-spinnings I need the time PAUSE
takes on different processors. On my CPU PAUSE halts the pipe only
for about for 0,78 nanoseconds, which is about 3,25 clock-cycles in
average. I've written a short progam tha repeatedly PAUSEs and takes
the aveage time. I want to encourage you to compile the code on your
machine and give me the PAUSE-timing it outputs here.
This is the code:
#include <iostream>
#include <chrono>
#include <cstddef>
#include <cstdint>
#include <immintrin.h>
using namespace std;
using namespace chrono;
int main( int argc, char **argv )
{
static uint64_t const PAUSE_ROUNDS = 1'000'000'000;
auto start = high_resolution_clock::now();
for( uint64_t i = PAUSE_ROUNDS; i; --i )
_mm_pause();
double ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / (double)PAUSE_ROUNDS;
cout << ns << endl;
}
I've developed a monitor-object like that of Java for C++ with some improvements. The major improvement is that there's not only a spin
-loop for locking and unlocking but also for waiting on an event. In
this case you don't have to lock the mutex but supply a predicate on
a wait_poll-function and the code repeatedly tries to lock the mutex
polling and if it can lock the mutex it calls the predicate which
returns (or moves) a pair of a bool and the result-type.
Waiting to for a semaphore and or a event-object (Win32) in the kernel
can easily take from 1.000 to 10.000 clock-cylces even when the call immediately returns because the semaphore or event has been set before.
So there has to be a spin count with a reasonable relationship to this waiting-inteval, f.e. spinning one tenth of the minimum interval being
spent in the kernel.
With my monitor-object I've taken the spincount recalculation-algorithm
from the glibc. And I'm also using the PAUSE-instruction. But I think
that the glibc induces to heavy cacheline-flipping by re-loading the mutex-flags immediately after a single PAUSE-instruction. So I decided
to loop PAUSE several times and to take less spinning iterations there-
fore.
To get a reasonable number of PAUSE-spinnings I need the time PAUSE
takes on different processors. On my CPU PAUSE halts the pipe only
for about for 0,78 nanoseconds, which is about 3,25 clock-cycles in
average. I've written a short progam tha repeatedly PAUSEs and takes
the aveage time. I want to encourage you to compile the code on your
machine and give me the PAUSE-timing it outputs here.
This is the code:
#include <iostream>
#include <chrono>
#include <cstddef>
#include <cstdint>
#include <immintrin.h>
using namespace std;
using namespace chrono;
int main( int argc, char **argv )
{
static uint64_t const PAUSE_ROUNDS = 1'000'000'000;
auto start = high_resolution_clock::now();
for( uint64_t i = PAUSE_ROUNDS; i; --i )
_mm_pause();
double ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / (double)PAUSE_ROUNDS;
cout << ns << endl;
}
25.09.2021 17:41 Bonita Montero kirjutas:
I've developed a monitor-object like that of Java for C++ with some
improvements. The major improvement is that there's not only a spin
-loop for locking and unlocking but also for waiting on an event. In
this case you don't have to lock the mutex but supply a predicate on
a wait_poll-function and the code repeatedly tries to lock the mutex
polling and if it can lock the mutex it calls the predicate which
returns (or moves) a pair of a bool and the result-type.
Waiting to for a semaphore and or a event-object (Win32) in the kernel
can easily take from 1.000 to 10.000 clock-cylces even when the call
immediately returns because the semaphore or event has been set before.
So there has to be a spin count with a reasonable relationship to this
waiting-inteval, f.e. spinning one tenth of the minimum interval being
spent in the kernel.
With my monitor-object I've taken the spincount recalculation-algorithm
from the glibc. And I'm also using the PAUSE-instruction. But I think
that the glibc induces to heavy cacheline-flipping by re-loading the
mutex-flags immediately after a single PAUSE-instruction. So I decided
to loop PAUSE several times and to take less spinning iterations there-
fore.
To get a reasonable number of PAUSE-spinnings I need the time PAUSE
takes on different processors. On my CPU PAUSE halts the pipe only
for about for 0,78 nanoseconds, which is about 3,25 clock-cycles in
average. I've written a short progam tha repeatedly PAUSEs and takes
the aveage time. I want to encourage you to compile the code on your
machine and give me the PAUSE-timing it outputs here.
This is the code:
#include <iostream>
#include <chrono>
#include <cstddef>
#include <cstdint>
#include <immintrin.h>
using namespace std;
using namespace chrono;
int main( int argc, char **argv )
{
static uint64_t const PAUSE_ROUNDS = 1'000'000'000;
auto start = high_resolution_clock::now();
for( uint64_t i = PAUSE_ROUNDS; i; --i )
_mm_pause();
double ns = (int64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / (double)PAUSE_ROUNDS;
cout << ns << endl;
}
On my computer this outputs:
34.9194
This is the code:
#include <iostream>
#include <chrono>
#include <cstddef>
#include <cstdint>
#include <immintrin.h>
using namespace std;
using namespace chrono;
int main( int argc, char **argv )
{
static uint64_t const PAUSE_ROUNDS = 1'000'000'000;
auto start = high_resolution_clock::now();
for( uint64_t i = PAUSE_ROUNDS; i; --i )
_mm_pause();
double ns = (int64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / (double)PAUSE_ROUNDS;
cout << ns << endl;
}
I get
30.7635
(Core i9 9900K 5GHz)
I've developed a monitor-object like that of Java for C++ with some improvements. The major improvement is that there's not only a spinWhat's in _mm_pause ?
for( uint64_t i = PAUSE_ROUNDS; i; --i )
_mm_pause();
On 2021-09-25, Bonita Montero <Bonita.Montero@gmail.com> wrote:
I've developed a monitor-object like that of Java for C++ with someWhat's in _mm_pause ?
improvements. The major improvement is that there's not only a spin
for( uint64_t i = PAUSE_ROUNDS; i; --i )
_mm_pause();
Am 25.09.2021 um 19:15 schrieb Branimir Maksimovic:On mine system it is:
On 2021-09-25, Bonita Montero <Bonita.Montero@gmail.com> wrote:
I've developed a monitor-object like that of Java for C++ with someWhat's in _mm_pause ?
improvements. The major improvement is that there's not only a spin
for( uint64_t i = PAUSE_ROUNDS; i; --i )
_mm_pause();
Read again what I wrote or google for _mm_pause.
25.09.2021 17:41 Bonita Montero kirjutas:
for( uint64_t i = PAUSE_ROUNDS; i; --i )
_mm_pause();
On my computer this outputs:
34.9194
(Intel Xeon E-2286M CPU @ 2.40 GHz)
I get
30.7635
(Core i9 9900K 5GHz
So as the timings of the PAUSE-instructions are so different IPlease add for aarch64 i gave to you.
decided to write a singleton containing the fastest timing of
the PAUSE-instruction on the machine. This is the code:
cpu_pause.h:
for( size_t j = 1'000; j; --j )
cpu_pause();
So I can adjust the spinning-loop according
to pause_singleton::getNsPerPause().
Am 26.09.2021 um 07:26 schrieb Bonita Montero:
So I can adjust the spinning-loop according
to pause_singleton::getNsPerPause().
I dropped it ! I simply made a spinning-loop according to the TSC
if the CPU has a TSC and it is invariant (these are also invariant
across sockets !). Reading the TSC can be done at roughly every 10 nanoseconds my PC (TR3990X, Zen3, Win10, SMT off). It's not accu-
rate since it might overlap with instruction before or afterwards,
but accuracy isn't relevant when you spin hundreds of clock-cycles.
And I changed a single pause per spin loop instead of a row of
PAUSEs which sum up to 30ns (which is roughly the most common
value on newer Intel -CPUs). This more eager spinnging may gain
locking earlier, although it may generate more interconnect-traffic.
But as I'm using RDTSC: I'm asking myself how fast RDTSC is on
different CPUs. So I modified my test-program to measure different
routines to test a loop of 10 RDTSCs per loop. Here it is:
#include <iostream>
#include <chrono>
#include <limits>
#include <functional>
#if defined(_MSC_VER)
#include <intrin.h>
#endif
using namespace std;
using namespace chrono;
int main( int argc, char **argv )
{
using bench_fn = function<void(size_t)>;
auto bench = []( bench_fn const &fn, size_t nTests, size_t nIterations ) -> double
{
int64_t nsShortest = numeric_limits<int64_t>::max();
for( size_t p = nTests; p; --p )
{
auto start = high_resolution_clock::now();
fn( nIterations );
int64_t ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count();
nsShortest = ns < nsShortest ? ns : nsShortest;
}
return (double)nsShortest / (ptrdiff_t)nIterations;
};
auto rdtscLoop = []( size_t nIterations )
{
uint64_t TSCs[10];
for( ; nIterations; --nIterations )
// unfortunately there's no #directive vor REP'ing #if defined(_MSC_VER)
TSCs[0] += __rdtsc(),
TSCs[1] += __rdtsc(),
TSCs[2] += __rdtsc(),
TSCs[3] += __rdtsc(),
TSCs[4] += __rdtsc(),
TSCs[5] += __rdtsc(),
TSCs[6] += __rdtsc(),
TSCs[7] += __rdtsc(),
TSCs[8] += __rdtsc(),
TSCs[9] += __rdtsc();
#elif defined(__GNUC__)
TSCs[0] += __builtin_ia32_rdtsc(),
TSCs[1] += __builtin_ia32_rdtsc(),
TSCs[2] += __builtin_ia32_rdtsc(),
TSCs[3] += __builtin_ia32_rdtsc(),
TSCs[4] += __builtin_ia32_rdtsc(),
TSCs[5] += __builtin_ia32_rdtsc(),
TSCs[6] += __builtin_ia32_rdtsc(),
TSCs[7] += __builtin_ia32_rdtsc(),
TSCs[8] += __builtin_ia32_rdtsc(),
TSCs[9] += __builtin_ia32_rdtsc();
#endif
uint64_t sum = 0; // prevent optimization
for( uint64_t TSC : TSCs )
sum += TSC;
uint64_t volatile vsum = sum;
};
static size_t const
N_TESTS = 100, // number of tests to get the shortest timing
N_ITERATIONS = 500, // iterations of the test-loop
N_REPEATS = 10; // REPetitions inside the test-loop
double nsPerREP = bench( bench_fn( bind( rdtscLoop,
placeholders::_1 ) ), N_TESTS, N_ITERATIONS ) / N_REPEATS;
cout << "ns per RDTSC: " << nsPerREP << endl;
}
It would be nice if you could compile this on your machine and
give me the number of the RDTSC-timing here. This would give me
a hint if what I try is feasible.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 113 |
Nodes: | 8 (0 / 8) |
Uptime: | 09:45:25 |
Calls: | 2,497 |
Calls today: | 14 |
Files: | 8,646 |
Messages: | 1,902,668 |