for( unsigned t = 0; t != nThreads; ++t )
threads.emplace_back( theThread, xchgFn ), TURNS );
Bonita Montero <Bonita.Montero@gmail.com> writes:
for( unsigned t = 0; t != nThreads; ++t )
threads.emplace_back( theThread, xchgFn ), TURNS );
?
On 12/5/2021 11:56 AM, Ben Bacarisse wrote:
Bonita Montero <Bonita.Montero@gmail.com> writes:
for( unsigned t = 0; t != nThreads; ++t )?
threads.emplace_back( theThread, xchgFn ), TURNS );
It seems way to complicated.
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 12/5/2021 11:56 AM, Ben Bacarisse wrote:
Bonita Montero <Bonita.Montero@gmail.com> writes:
for( unsigned t = 0; t != nThreads; ++t )?
threads.emplace_back( theThread, xchgFn ), TURNS );
It seems way to complicated.
In case it was not clear, my comment was about the syntax error.
Bonita Montero <Bonita.Montero@gmail.com> writes:
for( unsigned t = 0; t != nThreads; ++t )
threads.emplace_back( theThread, xchgFn ), TURNS );
?
It seems way to complicated. To test fetch_add vs compare_exchange just:
In my experience fetch_add always beats a CAS-loop on x86.
Am 05.12.2021 um 20:56 schrieb Ben Bacarisse:
Bonita Montero <Bonita.Montero@gmail.com> writes:
for( unsigned t = 0; t != nThreads; ++t )?
threads.emplace_back( theThread, xchgFn ), TURNS );
Remove one ).
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 05.12.2021 um 20:56 schrieb Ben Bacarisse:
Bonita Montero <Bonita.Montero@gmail.com> writes:
for( unsigned t = 0; t != nThreads; ++t )?
threads.emplace_back( theThread, xchgFn ), TURNS );
Remove one ).
The code does not compile with any of the three )s removed.
I've written a little program that tests the throughput of fetch_add
on an increasing number of processors in your systems and if you chose
Some processors will acquire the cacheline into the nearest
cache to ensure exclusive access for the add, while
others will pass the entire operation to the last-level cache
where it is executed atomically.
In any case, a couple dozen line assembler program would be a
far better test than your overly complicated C++.
Am 06.12.2021 um 19:02 schrieb Scott Lurndal:
Some processors will acquire the cacheline into the nearest
cache to ensure exclusive access for the add, while
others will pass the entire operation to the last-level cache
where it is executed atomically.
There's for sure no architecture that does atomic operations in
the last level cache because this would be silly.
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 06.12.2021 um 19:02 schrieb Scott Lurndal:
Some processors will acquire the cacheline into the nearest
cache to ensure exclusive access for the add, while
others will pass the entire operation to the last-level cache
where it is executed atomically.
There's for sure no architecture that does atomic operations in
the last level cache because this would be silly.
Well, are you sure? Why do you think it would be silly?
https://genzconsortium.org/wp-content/uploads/2019/04/Gen-Z-Atomics-2019.pdf
Given that at least three high-end processor chips have taped out just
this year with the capability of executing "far" atomic operations in
the LLC (or to a PCI Express Root complex host bridge), I think you
really don't have a clue what you are talking about.
Am 07.12.2021 um 19:06 schrieb Scott Lurndal:
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 06.12.2021 um 19:02 schrieb Scott Lurndal:
Some processors will acquire the cacheline into the nearest
cache to ensure exclusive access for the add, while
others will pass the entire operation to the last-level cache
where it is executed atomically.
There's for sure no architecture that does atomic operations in
the last level cache because this would be silly.
Well, are you sure? Why do you think it would be silly?
https://genzconsortium.org/wp-content/uploads/2019/04/Gen-Z-Atomics-2019.pdf >>
Given that at least three high-end processor chips have taped out just
this year with the capability of executing "far" atomic operations in
the LLC (or to a PCI Express Root complex host bridge), I think you
really don't have a clue what you are talking about.
And which CPUs currently support this Gen-Z interconnect ?
And which CPUs currently use this far atomics for thread
-synchronitation - none.
Did you really read the paper and noted what Gen-Z is ?
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 07.12.2021 um 19:06 schrieb Scott Lurndal:
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 06.12.2021 um 19:02 schrieb Scott Lurndal:
Some processors will acquire the cacheline into the nearest
cache to ensure exclusive access for the add, while
others will pass the entire operation to the last-level cache
where it is executed atomically.
There's for sure no architecture that does atomic operations in
the last level cache because this would be silly.
Well, are you sure? Why do you think it would be silly?
https://genzconsortium.org/wp-content/uploads/2019/04/Gen-Z-Atomics-2019.pdf
Given that at least three high-end processor chips have taped out just
this year with the capability of executing "far" atomic operations in
the LLC (or to a PCI Express Root complex host bridge), I think you
really don't have a clue what you are talking about.
And which CPUs currently support this Gen-Z interconnect ?
I'd tell you, but various NDA's forbid.
And which CPUs currently use this far atomics for thread
-synchronitation - none.
How do you know?
I'm aware of three. Two sampling to customers, with core
counts from 8 to 64.
Am 07.12.2021 um 19:49 schrieb Scott Lurndal:
How do you know?
Because this would be slower since the lock-modifications
woudln't be done in the L1-caches but in far memory. That's
just a silly idea.
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 07.12.2021 um 19:49 schrieb Scott Lurndal:
How do you know?
Because this would be slower since the lock-modifications
woudln't be done in the L1-caches but in far memory. That's
just a silly idea.
Hello, it's a cache-coherent multiprocessor. You need to
fetch it exclusively into the L1 first, so instead of sending the fetch
(or invalidate if converting a shared line to owned),
you send the atomic op and it gets handled atomically at
the far end (e.g. LLC, PCI express device, SoC coprocessor)
saving the interconnect (mesh, ring, whatever) bandwidth and
the round-trip time between L1 and LLC and reducing contention
for the line.
If it's already in the L1 cache, then the processor will
automatically treat it as a near-atomic, this is expected
to be a rare case with correctly designed atomic usage.
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 07.12.2021 um 19:49 schrieb Scott Lurndal:
How do you know?
Because this would be slower since the lock-modifications
woudln't be done in the L1-caches but in far memory. That's
just a silly idea.
Hello, it's a cache-coherent multiprocessor. You need to
fetch it exclusively into the L1 first, so instead of sending the fetch
(or invalidate if converting a shared line to owned),
you send the atomic op and it gets handled atomically at
the far end (e.g. LLC, PCI express device, SoC coprocessor)
saving the interconnect (mesh, ring, whatever) bandwidth and
the round-trip time between L1 and LLC and reducing contention
for the line.
If it's already in the L1 cache, then the processor will
automatically treat it as a near-atomic, this is expected
to be a rare case with correctly designed atomic usage.
scott@slp53.sl.home (Scott Lurndal) writes:
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 07.12.2021 um 19:49 schrieb Scott Lurndal:
How do you know?
Because this would be slower since the lock-modifications
woudln't be done in the L1-caches but in far memory. That's
just a silly idea.
Hello, it's a cache-coherent multiprocessor. You need to
fetch it exclusively into the L1 first, so instead of sending the fetch
(or invalidate if converting a shared line to owned),
you send the atomic op and it gets handled atomically at
the far end (e.g. LLC, PCI express device, SoC coprocessor)
saving the interconnect (mesh, ring, whatever) bandwidth and
the round-trip time between L1 and LLC and reducing contention
for the line.
If it's already in the L1 cache, then the processor will
automatically treat it as a near-atomic, this is expected
to be a rare case with correctly designed atomic usage.
In case you need a public reference for a shipping processor:
https://developer.arm.com/documentation/102099/0000/L1-data-memory-system/Instruction-implementation-in-the-L1-data-memory-system
scott@slp53.sl.home (Scott Lurndal) writes:
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 07.12.2021 um 19:49 schrieb Scott Lurndal:
How do you know?
Because this would be slower since the lock-modifications
woudln't be done in the L1-caches but in far memory. That's
just a silly idea.
Hello, it's a cache-coherent multiprocessor. You need to
fetch it exclusively into the L1 first, so instead of sending the fetch
(or invalidate if converting a shared line to owned),
you send the atomic op and it gets handled atomically at
the far end (e.g. LLC, PCI express device, SoC coprocessor)
saving the interconnect (mesh, ring, whatever) bandwidth and
the round-trip time between L1 and LLC and reducing contention
for the line.
If it's already in the L1 cache, then the processor will
automatically treat it as a near-atomic, this is expected
to be a rare case with correctly designed atomic usage.
In case you need a public reference for a shipping processor:
https://developer.arm.com/documentation/102099/0000/L1-data-memory-system/Instruction-implementation-in-the-L1-data-memory-system
On 12/7/2021 11:25 AM, Scott Lurndal wrote:
scott@slp53.sl.home (Scott Lurndal) writes:
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 07.12.2021 um 19:49 schrieb Scott Lurndal:
How do you know?
Because this would be slower since the lock-modifications
woudln't be done in the L1-caches but in far memory. That's
just a silly idea.
Hello, it's a cache-coherent multiprocessor.  You need to
fetch it exclusively into the L1 first, so instead of sending the fetch
(or invalidate if converting a shared line to owned),
you send the atomic op and it gets handled atomically at
the far end (e.g. LLC, PCI express device, SoC coprocessor)
saving the interconnect (mesh, ring, whatever) bandwidth and
the round-trip time between L1 and LLC and reducing contention
for the line.
If it's already in the L1 cache, then the processor will
automatically treat it as a near-atomic, this is expected
to be a rare case with correctly designed atomic usage.
In case you need a public reference for a shipping processor:
https://developer.arm.com/documentation/102099/0000/L1-data-memory-system/Instruction-implementation-in-the-L1-data-memory-system
You have encountered the rabbit hole of Bonita! I have proved her/it
wrong several times. No good, goes nowhere.
Am 08.12.2021 um 09:59 schrieb Chris M. Thomasson:
You have encountered the rabbit hole of Bonita! I have proved her/it
wrong several times. No good, goes nowhere.
What he links isn't a proof for what he says.
The above CPU doesn't implement the mentioned interconnect.
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 08.12.2021 um 09:59 schrieb Chris M. Thomasson:
You have encountered the rabbit hole of Bonita! I have proved her/it
wrong several times. No good, goes nowhere.
What he links isn't a proof for what he says.
As you note Chris, Christof/Bonita cannot admit he
was wrong.
The above CPU doesn't implement the mentioned interconnect.
Of course not, ARM doesn't make CPUs. They provide the IP
used to make real CPUs; for example the Amazon AWS Graviton 2 and 3.
Yet, ARM does provide interconnect IP which fully supports
near and far atomics.
On 12/7/2021 11:25 AM, Scott Lurndal wrote:
sc...@slp53.sl.home (Scott Lurndal) writes:
Bonita Montero <Bonita....@gmail.com> writes:
Am 07.12.2021 um 19:49 schrieb Scott Lurndal:
How do you know?
Because this would be slower since the lock-modifications
woudln't be done in the L1-caches but in far memory. That's
just a silly idea.
Hello, it's a cache-coherent multiprocessor. You need to
fetch it exclusively into the L1 first, so instead of sending the fetch
(or invalidate if converting a shared line to owned),
you send the atomic op and it gets handled atomically at
the far end (e.g. LLC, PCI express device, SoC coprocessor)
saving the interconnect (mesh, ring, whatever) bandwidth and
the round-trip time between L1 and LLC and reducing contention
for the line.
If it's already in the L1 cache, then the processor will
automatically treat it as a near-atomic, this is expected
to be a rare case with correctly designed atomic usage.
In case you need a public reference for a shipping processor:
https://developer.arm.com/documentation/102099/0000/L1-data-memory-system/Instruction-implementation-in-the-L1-data-memory-system
You have encountered the rabbit hole of Bonita! I have proved her/it
wrong several times. No good, goes nowhere.
removing attributions
You have encountered the rabbit hole of Bonita! I have proved her/it
wrong several times. No good, goes nowhere.
But it what comp.lang.c++ is. Whenever you come here then BM is
present, wrong (or not even wrong), desperately trying to shadow
it by snipping out of context, removing attributions, misrepresenting
what others wrote, moving goalposts etc. Keeping hearth and home
warm. ;-D
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 113 |
Nodes: | 8 (1 / 7) |
Uptime: | 43:24:21 |
Calls: | 2,499 |
Calls today: | 1 |
Files: | 8,651 |
Messages: | 1,908,299 |