EricP wrote:
MitchAlsup wrote:
If an exception occurs in the store (manifestation) section of an ESM
ATOMIC event, the event fails and none of the stores appears to have
been performed.
If an interrupt occurs in the store (manifestation) section of an ESM
ATOMIC event, if all the store can be committed they will be and the
event succeeds, and if they cannot (or cannot be determined that they
can) the the event fails and none of the stores appear to have been
performed.
In any event, if the thread performing the event cannot be completed,
due to transfer of control to more privilege operation, the event fails
control appears to have been transferred to the event control point, and >>> then control is transferred to the more privileged thread.
Yes, my HTM has some similarities.
Yes, I see lots of similarities--most of the differences are down in
the minutia.
MitchAlsup wrote:
EricP wrote:
MitchAlsup wrote:
If an exception occurs in the store (manifestation) section of an ESM
ATOMIC event, the event fails and none of the stores appears to have
been performed.
If an interrupt occurs in the store (manifestation) section of an ESM
ATOMIC event, if all the store can be committed they will be and the
event succeeds, and if they cannot (or cannot be determined that they
can) the the event fails and none of the stores appear to have been
performed.
In any event, if the thread performing the event cannot be completed,
due to transfer of control to more privilege operation, the event fails >>>> control appears to have been transferred to the event control point, and >>>> then control is transferred to the more privileged thread.
Yes, my HTM has some similarities.
Yes, I see lots of similarities--most of the differences are down in
the minutia.
Not too many similarities - my latest ATX design has diverged quite a bit from your original ASF proposal in 2010 that got me thinking about HTM.
For example, I think we both switched from the ASF/RTM approach viewing
a transaction abort as an exception where registers are rolled back to the starting state, to one which views an abort like a branch that preserves
the register values so that data can be passed from inside a transaction
to outside.
And following on from that, I think I adopted your idea of allowing
reads and writes to non-transactional memory while other transaction
member memory is protected. Again this is to allow values to be passed
from inside a transaction to outside.
But both of those changes are based on the problems people encountered
trying to use RTM and finding there was no way to get transaction
management information from inside the transaction to outside.
My latest ATX instructions are completely different from ASF and ESM.
ATX uses dynamically defined guard byte address ranges - guard for read
and guard for write. Once guard byte ranges are established, LD's and ST's inside the guard ranges are transactionally protected, those outside are not. Guard byte ranges can be dynamically added, protection raised and lowered
or released as the transaction proceeds.
I believe under the hood my implementation is mostly different.
My ATX has transaction management distributed to all nodes,
yours ESM is centralized. ATX negotiates the transaction guard range collision winner dynamically as the transaction proceeds so that if there
is contention only the winner makes it to a COMMIT and losers abort,
whereas ESM collects all the changes and makes a bulk decision at the end
on whether there was interference and who should win or lose.
The commonality on implementation is on buffering the updates outside the cache so the transaction is not sensitive to cache associativity evicts
as RTM was. ESM uses fully assoc miss buffers, I was thinking ATX would
have a fully assoc index but borrow line space from L1 to allow a larger transaction member line set (I wanted 16 lines as a minimum).
EricP wrote:
MitchAlsup wrote:
EricP wrote:
MitchAlsup wrote:
If an exception occurs in the store (manifestation) section of an
ESM ATOMIC event, the event fails and none of the stores appears to
have
been performed.
If an interrupt occurs in the store (manifestation) section of an ESM >>>>> ATOMIC event, if all the store can be committed they will be and the >>>>> event succeeds, and if they cannot (or cannot be determined that
they can) the the event fails and none of the stores appear to have
been
performed.
In any event, if the thread performing the event cannot be completed, >>>>> due to transfer of control to more privilege operation, the event
fails
control appears to have been transferred to the event control
point, and
then control is transferred to the more privileged thread.
Yes, my HTM has some similarities.
Yes, I see lots of similarities--most of the differences are down in
the minutia.
Not too many similarities - my latest ATX design has diverged quite a bit
from your original ASF proposal in 2010 that got me thinking about HTM.
Make that 2005±
For example, I think we both switched from the ASF/RTM approach viewing
a transaction abort as an exception where registers are rolled back to
the
starting state, to one which views an abort like a branch that preserves
the register values so that data can be passed from inside a transaction
to outside.
No, I did not do it that way. I choose not to restore the registers, and
made the compiler have to forget the now stale variables from the event.
I did this mostly because my implementation does not count on branches so there may not be a checkpoint to assist in backup. Control transfer to the control point is not considered a branch--because it is automagic.
And following on from that, I think I adopted your idea of allowing
reads and writes to non-transactional memory while other transaction
member memory is protected. Again this is to allow values to be passed
from inside a transaction to outside.
I do allow this. AND this is why each participant has to announce itself. {{That is: there is not something that starts an event and another thing
that ends an event and everything inside is participating in the event.}}
But both of those changes are based on the problems people encountered
trying to use RTM and finding there was no way to get transaction
management information from inside the transaction to outside.
That and debugging (but perhaps that is what you meant.)
My latest ATX instructions are completely different from ASF and ESM.
ATX uses dynamically defined guard byte address ranges - guard for read
and guard for write. Once guard byte ranges are established, LD's and
ST's
inside the guard ranges are transactionally protected, those outside
are not.
Guard byte ranges can be dynamically added, protection raised and lowered
or released as the transaction proceeds.
I believe under the hood my implementation is mostly different.
My ATX has transaction management distributed to all nodes,
yours ESM is centralized. ATX negotiates the transaction guard range
collision winner dynamically as the transaction proceeds so that if there
is contention only the winner makes it to a COMMIT and losers abort,
whereas ESM collects all the changes and makes a bulk decision at the end
on whether there was interference and who should win or lose.
The commonality on implementation is on buffering the updates outside the
cache so the transaction is not sensitive to cache associativity evicts
as RTM was. ESM uses fully assoc miss buffers, I was thinking ATX would
have a fully assoc index but borrow line space from L1 to allow a larger
transaction member line set (I wanted 16 lines as a minimum).
16 lines but only 1 read set (start:end) and 1 write set (start:end) ??
MitchAlsup wrote:
EricP wrote:
MitchAlsup wrote:
EricP wrote:
MitchAlsup wrote:
If an exception occurs in the store (manifestation) section of an
ESM ATOMIC event, the event fails and none of the stores appears to >>>>>> have
been performed.
If an interrupt occurs in the store (manifestation) section of an ESM >>>>>> ATOMIC event, if all the store can be committed they will be and the >>>>>> event succeeds, and if they cannot (or cannot be determined that
they can) the the event fails and none of the stores appear to have >>>>>> been
performed.
In any event, if the thread performing the event cannot be completed, >>>>>> due to transfer of control to more privilege operation, the event
fails
control appears to have been transferred to the event control
point, and
then control is transferred to the more privileged thread.
Yes, my HTM has some similarities.
Yes, I see lots of similarities--most of the differences are down in
the minutia.
Not too many similarities - my latest ATX design has diverged quite a bit >>> from your original ASF proposal in 2010 that got me thinking about HTM.
Make that 2005±
For example, I think we both switched from the ASF/RTM approach viewing
a transaction abort as an exception where registers are rolled back to
the
starting state, to one which views an abort like a branch that preserves >>> the register values so that data can be passed from inside a transaction >>> to outside.
No, I did not do it that way. I choose not to restore the registers, and
made the compiler have to forget the now stale variables from the event.
I did this mostly because my implementation does not count on branches so
there may not be a checkpoint to assist in backup. Control transfer to the >> control point is not considered a branch--because it is automagic.
Ok, bad analogy but the result is the same: the registers are not restored.
And following on from that, I think I adopted your idea of allowing
reads and writes to non-transactional memory while other transaction
member memory is protected. Again this is to allow values to be passed
from inside a transaction to outside.
I do allow this. AND this is why each participant has to announce itself.
{{That is: there is not something that starts an event and another thing
that ends an event and everything inside is participating in the event.}}
But both of those changes are based on the problems people encountered
trying to use RTM and finding there was no way to get transaction
management information from inside the transaction to outside.
That and debugging (but perhaps that is what you meant.)
Debugging too but I was thinking that a transaction might want to use
a register to hold an internal counter indicating how far it made it
into the transaction when it aborted. That might help the abort code
avoid a subsequent collision.
My latest ATX instructions are completely different from ASF and ESM.
ATX uses dynamically defined guard byte address ranges - guard for read
and guard for write. Once guard byte ranges are established, LD's and
ST's
inside the guard ranges are transactionally protected, those outside
are not.
Guard byte ranges can be dynamically added, protection raised and lowered >>> or released as the transaction proceeds.
I believe under the hood my implementation is mostly different.
My ATX has transaction management distributed to all nodes,
yours ESM is centralized. ATX negotiates the transaction guard range
collision winner dynamically as the transaction proceeds so that if there >>> is contention only the winner makes it to a COMMIT and losers abort,
whereas ESM collects all the changes and makes a bulk decision at the end >>> on whether there was interference and who should win or lose.
The commonality on implementation is on buffering the updates outside the >>> cache so the transaction is not sensitive to cache associativity evicts
as RTM was. ESM uses fully assoc miss buffers, I was thinking ATX would
have a fully assoc index but borrow line space from L1 to allow a larger >>> transaction member line set (I wanted 16 lines as a minimum).
16 lines but only 1 read set (start:end) and 1 write set (start:end) ??
There can be as many guard byte ranges as you want and can straddle
multiple cache line boundaries as long as the total bytes under
transaction guard protection is 16 cache lines (of 64-bytes each).
The intent is that you issue guards giving the object address and size,
do some protected loads and stores, then add new guards as new objects
join the transaction, do more protected loads and stores, and so on.
Then commit all the updates and release the guards.
You can request a Read Guard on an object byte range, evaluate an object, then upgrade that range to a Write Guard and make changes,
or release the guards on a range to remove an object from the transaction.
The number 16 comes from wanting up to 8 smallish objects in a transaction, with each object possibly straddling two cache lines.
Eg an AVL tree node with left, right, parent pointers and depth count.
(I didn't want users to have to worry about whether their objects
straddle cache line boundaries.)
That 16 sets the size of the CAM and number of cache line buffers holding pending byte updates plus other structures in the transaction manager.
The transaction manager breaks each guard range request into a series
of up to 16 cache line byte ranges
My Atomic Transaction instructions are:
// Start a transaction attempt, remember abort RIP
// Option is to be notified after collision winner commits
ATSTART abort_offset [,options]
// Guard a byte range for read
// Options are synchronous or asynchronous
// Synchronous blocks LD and ST instructions in the guard range from
// reading a cache line until the guard grant has been negotiated.
// Asynchronous does not block LD and ST but may cause ping-pongs.
ATGRDR address, byte_count [,options]
// Guard a byte range for write
// Options are synchronous or asynchronous
ATGRDW address, byte_count [,options]
// Release a guard a range from the transaction
ATGREL address, byte_count
// Commit transaction updates and release all guards
ATCOMMIT status_reg
// Cancel transaction, toss write-guarded updates, release guards
ATCANCEL status_reg
// Trigger an abort, toss write-guarded updates, jump to abort address
// Can pass an immediate byte value to the transaction status
ATABORT #imm8
// Read status of current transaction, if any.
// After an abort the status contains info on reason for the abort.
ATSTATUS status_reg
// Wait for commit notify from winner of a collision
ATWAITNFY
EricP wrote:
There can be as many guard byte ranges as you want and can straddle
multiple cache line boundaries as long as the total bytes under
transaction guard protection is 16 cache lines (of 64-bytes each).
The intent is that you issue guards giving the object address and size,
do some protected loads and stores, then add new guards as new objects
join the transaction, do more protected loads and stores, and so on.
Then commit all the updates and release the guards.
You can request a Read Guard on an object byte range, evaluate an object,
then upgrade that range to a Write Guard and make changes,
or release the guards on a range to remove an object from the
transaction.
The number 16 comes from wanting up to 8 smallish objects in a
transaction,
with each object possibly straddling two cache lines.
Eg an AVL tree node with left, right, parent pointers and depth count.
(I didn't want users to have to worry about whether their objects
straddle cache line boundaries.)
That 16 sets the size of the CAM and number of cache line buffers holding
pending byte updates plus other structures in the transaction manager.
The transaction manager breaks each guard range request into a series
of up to 16 cache line byte ranges
My Atomic Transaction instructions are:
// Start a transaction attempt, remember abort RIP
// Option is to be notified after collision winner commits
ATSTART abort_offset [,options]
// Guard a byte range for read
// Options are synchronous or asynchronous
// Synchronous blocks LD and ST instructions in the guard range from
// reading a cache line until the guard grant has been negotiated.
// Asynchronous does not block LD and ST but may cause ping-pongs.
ATGRDR address, byte_count [,options]
// Guard a byte range for write
// Options are synchronous or asynchronous
ATGRDW address, byte_count [,options]
// Release a guard a range from the transaction
ATGREL address, byte_count
// Commit transaction updates and release all guards
ATCOMMIT status_reg
// Cancel transaction, toss write-guarded updates, release guards
ATCANCEL status_reg
// Trigger an abort, toss write-guarded updates, jump to abort address
// Can pass an immediate byte value to the transaction status
ATABORT #imm8
// Read status of current transaction, if any.
// After an abort the status contains info on reason for the abort.
ATSTATUS status_reg
// Wait for commit notify from winner of a collision
ATWAITNFY
I see, you are using an instruction to mark each state transition--
whereas I use edge-detection (side effect) of a standard instruction.
Does this not necessarily increase the minimum path length ??
MitchAlsup wrote:
EricP wrote:
There can be as many guard byte ranges as you want and can straddle
multiple cache line boundaries as long as the total bytes under
transaction guard protection is 16 cache lines (of 64-bytes each).
The intent is that you issue guards giving the object address and size,
do some protected loads and stores, then add new guards as new objects
join the transaction, do more protected loads and stores, and so on.
Then commit all the updates and release the guards.
You can request a Read Guard on an object byte range, evaluate an object, >>> then upgrade that range to a Write Guard and make changes,
or release the guards on a range to remove an object from the
transaction.
The number 16 comes from wanting up to 8 smallish objects in a
transaction,
with each object possibly straddling two cache lines.
Eg an AVL tree node with left, right, parent pointers and depth count.
(I didn't want users to have to worry about whether their objects
straddle cache line boundaries.)
That 16 sets the size of the CAM and number of cache line buffers holding >>> pending byte updates plus other structures in the transaction manager.
The transaction manager breaks each guard range request into a series
of up to 16 cache line byte ranges
My Atomic Transaction instructions are:
// Start a transaction attempt, remember abort RIP
// Option is to be notified after collision winner commits
ATSTART abort_offset [,options]
// Guard a byte range for read
// Options are synchronous or asynchronous
// Synchronous blocks LD and ST instructions in the guard range from
// reading a cache line until the guard grant has been negotiated.
// Asynchronous does not block LD and ST but may cause ping-pongs.
ATGRDR address, byte_count [,options]
// Guard a byte range for write
// Options are synchronous or asynchronous
ATGRDW address, byte_count [,options]
// Release a guard a range from the transaction
ATGREL address, byte_count
// Commit transaction updates and release all guards
ATCOMMIT status_reg
// Cancel transaction, toss write-guarded updates, release guards
ATCANCEL status_reg
// Trigger an abort, toss write-guarded updates, jump to abort address
// Can pass an immediate byte value to the transaction status
ATABORT #imm8
// Read status of current transaction, if any.
// After an abort the status contains info on reason for the abort.
ATSTATUS status_reg
// Wait for commit notify from winner of a collision
ATWAITNFY
I see, you are using an instruction to mark each state transition--
whereas I use edge-detection (side effect) of a standard instruction.
Does this not necessarily increase the minimum path length ??
Not sure what you mean by minimum path length.
The number of instructions probably has the least affect on performance.
Often transactions are just moving memory locations about with little
or no calculations, so the majority of performance effects will be due
to coherence messaging, to negotiate guards and move cache lines about.
In some cases this can be overlapped, others not.
I was able to throw together a simulator to test the validity of the
guard protocol handshake and it does work. But that was is isolation.
To test ATX performance would require a full multi-core OoO simulator
with Load Store Queue, as my transaction manager interacts with LSQ,
and cache coherence message simulation, and I don't have that.
Some issues I see that could affect transaction performance:
1) I have validated the ATX protocol and it has one important optimization, but not given much thought to optimizations that a directory controller
might help with. Currently it assumes that guard requests coherence messages will be broadcast to all nodes in a system, and all will Grant/Deny reply.
This is intentional as it keeps the ATX coherence messages completely separate from cache coherence messages, and that is important because
it means you don't have to re-validate your coherence protocol
or change the cache subsystem or directory controller.
Since a directory controller knows which nodes have copies of lines
in what shared/exclusive state it might be able to optimize away much
of the ATX messaging. However that would require integrating ATX protocol with the directory controller to also track guard requests for lines.
2) Since the Atomic Transaction Manager (ATM) knows the guard range
and whether for read or write, it can optimize cache transfers to
upgrade read_share cache line to a read_exclusive, and eliminate the transitory line share state that occurs for some shared lines.
That would eliminate a whole set of handshake messages that may now occur
to transfer a line in a shared state, then another to make it exclusive.
3) I believe an OoO core will have to shut off conditional branch
speculation inside transactions as speculating guard requests
or loads or stores could cause unnecessary transaction aborts
or cache line ping-ponging. It may have to go beyond that and
shut off OoO all together so that the registers are written in
a predictable order if an asynchronous abort is triggered.
Possibly the amount of OoO allowed could be an option on the ATSTART instruction so the user can decide based on their algorithm.
4) Guard requests can be either synchronous or asynchronous.
The choice does not affect transaction validity, just performance.
Synchronous means the guard acts as a membar to following loads
and stores to the guarded bytes while the guard is negotiated.
Synchronous is intended to prevent you from too soon touching a
cache line and grabbing it away from its current owner-updater.
Asynchronous allows following ld/st to the guarded range to execute concurrent while the guard request is pending, allowing the transfer
of a cache lines to overlap with guard negotiation.
Asynchronous could cause cache lines to ping-pong.
The choice of synchronous or asynchronous is dependent on the algorithm
and can be different for different objects in a transaction.
3) I believe an OoO core will have to shut off conditional branch
speculation inside transactions as speculating guard requests
or loads or stores could cause unnecessary transaction aborts
or cache line ping-ponging. It may have to go beyond that and
shut off OoO all together so that the registers are written in
a predictable order if an asynchronous abort is triggered.
EricP <ThatWouldBeTelling@thevillage.com> writes:
3) I believe an OoO core will have to shut off conditional branch
speculation inside transactions as speculating guard requests
or loads or stores could cause unnecessary transaction aborts
or cache line ping-ponging. It may have to go beyond that and
shut off OoO all together so that the registers are written in
a predictable order if an asynchronous abort is triggered.
I am not sure what scenario you have in mind, but it seems to involve requesting a cache line of a different core. Note that fixing Spectre requires that while such a request is speculative, it must not change
the state of a remote cache line; otherwise this would consititute a
side channel out of the speculative state. So you will certainly not
see cache ping-ponging from properly implemented speculative accesses, whether inside a transaction or not.
- anton
I'm noting that speculation and transactions may interact badly. >Implementations may need some mechanism to limit it,
but that can affect concurrency and performance.
I'm not overly concerned about Spectre. The people who are potentially >affected are time-share services (cloud servers) that do not control
what programs are running concurrently on their processors.
But that is an expensive overkill for
most situations.
I was looking for simpler and cheaper solutions that are optional and
cover most situations.
- The branch predictor tables be separated by user and super mode,
and that OS's be advised to purge the user mode tables on thread switch.
- Adding a conditional branch hint NoSpeculate which basically causes the >front end Dispatcher to stall and single step that one branch until the test >condition resolves. Stalling at Dispatch allows the front end to continue
to fill with the predicted code path but does not allow any to execute.
- I have an extensive set of conditional trap instructions intended for >bounds checks, asserts, etc. In OoO a load or store following a
bounds check might execute before the check exception was delivered.
These trap instructions could have an optional NoSpeculate flag
which again stalls the Dispatcher and essentially single steps just
that instruction until the test condition resolves.
EricP <ThatWouldBeTelling@thevillage.com> writes:
I'm not overly concerned about Spectre. The people who are potentially >>affected are time-share services (cloud servers) that do not control
what programs are running concurrently on their processors.
This widespread belief is what has caused CPU manufacturers to not work
on fixing Spectre.
But is it true? Has everybody disabled JavaScript and Webassembly in
their browser and their PDF viewer, and disabled macros in their
Spreadsheet, Word processor and presentation program? And, looking at NetSpectre, has everybody disconnected their computers from the network?
Given that what I have read claims that it is _not_ possible to completely >protect against Spectre and its variants without a significant degradation
of performance,
my favored solution is to divide the CPU into two parts;
one made immune to Spectre, which runs Internet-facing code which might be >menaced by it, and another in which mitigations are not applied.
EricP <ThatWouldBeTelling@thevillage.com> writes:
I'm noting that speculation and transactions may interact badly.
Implementations may need some mechanism to limit it,
but that can affect concurrency and performance.
Even if you want to build a Spectre-vulnerable CPU that actually
changes cache states during speculation, I don't expect much effect on performance, because the branch predictor is usually right. And if it
isn't, and the speculative access actually interferes with a
transaction on another core, it will learn that that path was wrong
and will stop the wrong speculation after one or two tries.
OTOH, if you want to build a Spectre-immune CPU by delaying the state
change until the memory access is committed, that won't hurt the
performance much, either, because it's rare to need state changes, and because the waiting time is typically around 20 cycles or so, which is
small compared to the time needed to access and change the state of a
remote cache line.
I'm not overly concerned about Spectre. The people who are potentially
affected are time-share services (cloud servers) that do not control
what programs are running concurrently on their processors.
This widespread belief is what has caused CPU manufacturers to not
work on fixing Spectre.
But is it true? Has everybody disabled JavaScript and Webassembly in
their browser and their PDF viewer, and disabled macros in their
Spreadsheet, Word processor and presentation program? And, looking at NetSpectre, has everybody disconnected their computers from the
network?
But that is an expensive overkill for
most situations.
It would not be expensive (see below), and it's not overkill.
While software vulnerabilities may be plenty and relatively easy to
use, they can be fixed at any time, or the intended victim of an
attack may use a different software. Meanwhile, hardware
vulnerabilities like Spectre and Rowhammer are always there while the hardware is not replaced with fixed hardware (and the inaction of
hardware manufacturers ensures that no such replacement exists, and
even when it exists, it will take many years until most of the
hardware is replaced), so they are very attractive to attackers.
I was looking for simpler and cheaper solutions that are optional and
cover most situations.
- The branch predictor tables be separated by user and super mode,
and that OS's be advised to purge the user mode tables on thread switch.
Now that's an expensive approach in both silicon and performance: The
branch predictor, one of the biggest parts of a core would need to
become twice as big to get the same accuracy for a program that spends
almost all of its time in user mode, or almost all of its time in
system mode. And a user-level program would still be slowed down a
lot every time there is a thread switch.
While being expensive, this approach does not make the CPU
Spectre-immune. A thread can be attacked from within itself (e.g.,
from a JavaScript program running in the same thread), and the
attacker can train the branch predictor to do the attacker's bidding
by passing the appropriate data to system calls or as input data to a user-level processing program.
- Adding a conditional branch hint NoSpeculate which basically causes the
front end Dispatcher to stall and single step that one branch until the test >> condition resolves. Stalling at Dispatch allows the front end to continue
to fill with the predicted code path but does not allow any to execute.
Yes, if you disable speculation by using that for all branches, it
will help against Spectre. But the slowdown will be huge, slowing the
CPU down almost to in-order levels (e.g., a Cortex-A55 or Bonnell).
OTOH, if you want to apply this slowdown selectively, the question is
where, and if the remaining speculation does not leave the window open
for an attacker to use Spectre. The Linux kernel is trying to use
selective software mitigation, and of course a remaining hole was
found (and used by a security researcher for demonstrating not just
that hole, but something else, that's how I learned of it), so this
approach is everything but watertight.
- I have an extensive set of conditional trap instructions intended for
bounds checks, asserts, etc. In OoO a load or store following a
bounds check might execute before the check exception was delivered.
These trap instructions could have an optional NoSpeculate flag
which again stalls the Dispatcher and essentially single steps just
that instruction until the test condition resolves.
Also very expensive for programs (programming languages) that use
these instructions; speculative load hardening (SLH), which prevents
Spectre v1 by turning the control dependence on the bounds check into
a data dependence costs a factore 2.3-2.5 (depending on the SLH
variant) on the SPEC programs, and I expect your bounds-check trap instructions to be at least as expensive.
By constrast, a proper invisible-speculation fix would be much cheaper
in performance (papers on the memory access part of invisible
speculation give slowdown factors up to 1.2 (with some papers giving
smaller slowdowns and even occasional speedups), and I think that the memory-access part will have the biggest performance impact among the
changes necessary for a full-blown invisible-speculation fix.
- anton
On Mon, 01 Jan 2024 08:05:52 +0000, Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Given that what I have read claims that it is _not_ possible to completely protect against Spectre and its variants without a significant degradation
of performance,
my favored solution is to divide the CPU into two parts;
one made immune to Spectre, which runs Internet-facing code which might be menaced by it, and another in which mitigations are not applied.
This solution, though, basically has not been considered, and that is for
an obvious reason: it is insecure. Once malicious software has found a way through some other vulnerability to insinuate itself into the "trusted"
code of the computer, then it won't be stopped from making use of Spectre
to further its progress.
So it's not enough to put the Internet in a sandbox, we need new and better ideas about how to put a secure wall around that sandbox - to securely
limit how it interacts with the rest of the computer.
John Savard
EricP wrote:
1) I have validated the ATX protocol and it has one important
optimization,
but not given much thought to optimizations that a directory controller
might help with. Currently it assumes that guard requests coherence
messages
will be broadcast to all nodes in a system, and all will Grant/Deny
reply.
This is intentional as it keeps the ATX coherence messages completely
separate from cache coherence messages, and that is important because
it means you don't have to re-validate your coherence protocol
or change the cache subsystem or directory controller.
All I added to the coherence protocol is NAK and its use is restricted
to ATOMIC events; then later I added priority compare so that higher
priority events are not NAKed by lower priority events. Thus, the
protocol is the same except for NAKs.
4) Guard requests can be either synchronous or asynchronous.
The choice does not affect transaction validity, just performance.
Do you envision mixing and matching synch and asynch guards ?
Synchronous means the guard acts as a membar to following loads
and stores to the guarded bytes while the guard is negotiated.
Synchronous is intended to prevent you from too soon touching a
cache line and grabbing it away from its current owner-updater.
Asynchronous allows following ld/st to the guarded range to execute
concurrent while the guard request is pending, allowing the transfer
of a cache lines to overlap with guard negotiation.
Asynchronous could cause cache lines to ping-pong.
So can too little associativity in your data cache.
The choice of synchronous or asynchronous is dependent on the algorithm
and can be different for different objects in a transaction.
So, yes to my question above.
Why not allow speculative branches to cover asynch accesses to guarded
lines ??
MitchAlsup wrote:
EricP wrote:
1) I have validated the ATX protocol and it has one important
optimization,
but not given much thought to optimizations that a directory controller
might help with. Currently it assumes that guard requests coherence
messages
will be broadcast to all nodes in a system, and all will Grant/Deny
reply.
This is intentional as it keeps the ATX coherence messages completely
separate from cache coherence messages, and that is important because
it means you don't have to re-validate your coherence protocol
or change the cache subsystem or directory controller.
All I added to the coherence protocol is NAK and its use is restricted
to ATOMIC events; then later I added priority compare so that higher
priority events are not NAKed by lower priority events. Thus, the
protocol is the same except for NAKs.
And this is where I think my ATX atomic transactions differs from ESM,
it is in how transactions are negotiated.
I also have Cache Coherence (CC) protocol managing the shared/exclusive/owned line state and transfer of whole lines into, out of, and between caches. However I don't need a NAK in CC because line movement is never denied.
My ATX coherence messages are a completely different protocol from CC.
ATX messages deal with *permission* to read and write *individual bytes*
in cache lines, and knows nothing about the cache line state or where
it is located. The Atomic Transaction Manager (ATM) uses ATX messages to talks to other peer ATM's about access to guard ranges, it intercepts
stores from the LSQ to guarded byte ranges and tucks them aside,
and triggers local aborts if a transaction permission is denied.
Only at commit does ATM send guarded line updates to its local cache,
which does its normal thing to check if line is present in an exclusive/modified/owned state, and if not does a read_exclusive.
And the cache controller knows nothing about transactions,
it just sees a burst of updates to local cache.
When cache lines moves is a matter of performance optimization.
It can happen all at commit, or gradually and concurrently as
a transaction proceeds in anticipation of a commit.
This is also why my transactions are not sensitive to cache associativity
and conflict evicts - because it does not use cache evicts or invalidates
to trigger transactions aborts.
4) Guard requests can be either synchronous or asynchronous.
The choice does not affect transaction validity, just performance.
Do you envision mixing and matching synch and asynch guards ?
Synchronous means the guard acts as a membar to following loads
and stores to the guarded bytes while the guard is negotiated.
Synchronous is intended to prevent you from too soon touching a
cache line and grabbing it away from its current owner-updater.
Asynchronous allows following ld/st to the guarded range to execute
concurrent while the guard request is pending, allowing the transfer
of a cache lines to overlap with guard negotiation.
Asynchronous could cause cache lines to ping-pong.
So can too little associativity in your data cache.
Yes but at least it doesn't cause transaction aborts, as RTM does.
The choice of synchronous or asynchronous is dependent on the algorithm
and can be different for different objects in a transaction.
So, yes to my question above.
Why not allow speculative branches to cover asynch accesses to guarded
lines ??
Because I didn't think of it. So far I have been mostly concerned with
how to make it work at all.
How much extra complexity and hardware does it take to essentially
queue all internal state changes throughout a core and all its caches
until speculative branches resolve?
And how many stalls will that queuing latency introduce?
Remember that it can't handle a cache miss for any load that is
currently in the shadow of an unresolved conditional branch as that
would allow coherence traffic to escape to the rest of a system.
And if two cores each have their own D$L1 but a shared D$L2,
then they can't even speculate a cache miss from L1 to L2.
Or speculatively prefetch alternate path instructions because
that could change the state of a cache line from exclusive to shared.
And since I have no faith that this will actually fix the problem
but rather just move it someplace else, I see this as asking me to
spend a lot of time and money on a fix that isn't really a fix for
something that isn't (so far) actually a real world problem.
Quadibloc wrote:
Given that what I have read claims that it is _not_ possible to
completely protect against Spectre and its variants without a
significant degradation of performance,
I argue that one can design a processor that looses no performance and remains Spectré (and most other current attack strategies) immune.
Anton Ertl wrote:
But is it true? Has everybody disabled JavaScript and Webassembly in
their browser and their PDF viewer, and disabled macros in their
Spreadsheet, Word processor and presentation program? And, looking at
NetSpectre, has everybody disconnected their computers from the
network?
There are many thousands of academic papers on speculative execution attacks. >I cannot find a single example of even an attempt in the real world.
On the other hand, there are many successful phishing attacks each day.
In other words, if you could somehow fix all speculative execution >vulnerabilities, it would have zero impact on the actual successful
security breaches.
Rowhammer is different - its a memory corruption hardware error.
I'm not convinced that speculative execution leaks can all be fixed.
They are finding new mechanisms every day.
(uOp caches now need to be flushed on thread switch,
function unit or register port contention as a side channel,
speculative load forwarding attacks).
I think it will be an ongoing game of whack-a-mole.
Just get rid of the low hanging fruit - the retention of branch predictors >across security domain thread/process switches.
I was looking for simpler and cheaper solutions that are optional and
cover most situations.
- The branch predictor tables be separated by user and super mode,
and that OS's be advised to purge the user mode tables on thread switch.
Now that's an expensive approach in both silicon and performance: The
branch predictor, one of the biggest parts of a core would need to
become twice as big to get the same accuracy for a program that spends
almost all of its time in user mode, or almost all of its time in
system mode. And a user-level program would still be slowed down a
lot every time there is a thread switch.
Why would the branch predictions from a different thread/process
be helpful to your thread?
Retaining predictions across security domains *IS* the problem
because it allows an attacker to influence/control a victim.
The side channel leaks, while also important, are just a display mechanism. >But with no control over a victim an attacker can make no use of
side channels to display secrets.
Javascript is not a HW security domain.
It is the responsibility of the Javascript VM peddlers to ensure their >runtime environment is secure, as they appear to have done.
- I have an extensive set of conditional trap instructions intended for
bounds checks, asserts, etc. In OoO a load or store following a
bounds check might execute before the check exception was delivered.
These trap instructions could have an optional NoSpeculate flag
which again stalls the Dispatcher and essentially single steps just
that instruction until the test condition resolves.
Also very expensive for programs (programming languages) that use
these instructions; speculative load hardening (SLH), which prevents
Spectre v1 by turning the control dependence on the bounds check into
a data dependence costs a factore 2.3-2.5 (depending on the SLH
variant) on the SPEC programs, and I expect your bounds-check trap
instructions to be at least as expensive.
I'm not sure - it depends on the frequency of occurence and
the latency between Dispatch and branch condition resolution.
Based on nothing, I'm assuming both to be small :-)
By constrast, a proper invisible-speculation fix would be much cheaper
in performance (papers on the memory access part of invisible
speculation give slowdown factors up to 1.2 (with some papers giving
smaller slowdowns and even occasional speedups), and I think that the
memory-access part will have the biggest performance impact among the
changes necessary for a full-blown invisible-speculation fix.
- anton
How much extra complexity and hardware does it take to essentially
queue all internal state changes throughout a core and all its caches
until speculative branches resolve?
And how many stalls will that queuing latency introduce?
Remember that it can't handle a cache miss for any load that is
currently in the shadow of an unresolved conditional branch as that
would allow coherence traffic to escape to the rest of a system.
And if two cores each have their own D$L1 but a shared D$L2,
then they can't even speculate a cache miss from L1 to L2.
Or speculatively prefetch alternate path instructions because
that could change the state of a cache line from exclusive to shared.
And since I have no faith that this will actually fix the problem
but rather just move it someplace else, I see this as asking me to
spend a lot of time and money on a fix that isn't really a fix for
something that isn't (so far) actually a real world problem.
Quadibloc wrote:
On Mon, 01 Jan 2024 08:05:52 +0000, Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Given that what I have read claims that it is _not_ possible to
completely protect against Spectre and its variants without a
significant degradation of performance,
I argue that one can design a processor that looses no performance
and remains Spectré (and most other current attack strategies) immune.
my favored solution is to divide the CPU into two
parts; one made immune to Spectre, which runs Internet-facing code
which might be menaced by it, and another in which mitigations are
not applied.
This solution, though, basically has not been considered, and that
is for an obvious reason: it is insecure. Once malicious software
has found a way through some other vulnerability to insinuate
itself into the "trusted" code of the computer, then it won't be
stopped from making use of Spectre to further its progress.
So it's not enough to put the Internet in a sandbox, we need new
and better ideas about how to put a secure wall around that sandbox
- to securely limit how it interacts with the rest of the computer.
John Savard
Quadibloc wrote:
On Mon, 01 Jan 2024 08:05:52 +0000, Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Given that what I have read claims that it is _not_ possible to
completely protect against Spectre and its variants without a
significant degradation of performance,
I argue that one can design a processor that looses no performance
and remains Spectré (and most other current attack strategies) immune.
EricP wrote:
And this is where I think my ATX atomic transactions differs from ESM,
it is in how transactions are negotiated.
I also have Cache Coherence (CC) protocol managing the
shared/exclusive/owned
line state and transfer of whole lines into, out of, and between caches.
However I don't need a NAK in CC because line movement is never denied.
My ATX coherence messages are a completely different protocol from CC.
ATX messages deal with *permission* to read and write *individual bytes*
in cache lines, and knows nothing about the cache line state or where
it is located. The Atomic Transaction Manager (ATM) uses ATX messages to
talks to other peer ATM's about access to guard ranges, it intercepts
stores from the LSQ to guarded byte ranges and tucks them aside,
and triggers local aborts if a transaction permission is denied.
Can you speculate on the latency for a core to talk to an ATM and receive
a response back ?? {{I am assuming all cores on a "chip" can use the same STM.}}
MitchAlsup wrote:
EricP wrote:
And this is where I think my ATX atomic transactions differs from ESM,
it is in how transactions are negotiated.
I also have Cache Coherence (CC) protocol managing the
shared/exclusive/owned
line state and transfer of whole lines into, out of, and between caches. >>> However I don't need a NAK in CC because line movement is never denied.
My ATX coherence messages are a completely different protocol from CC.
ATX messages deal with *permission* to read and write *individual bytes* >>> in cache lines, and knows nothing about the cache line state or where
it is located. The Atomic Transaction Manager (ATM) uses ATX messages to >>> talks to other peer ATM's about access to guard ranges, it intercepts
stores from the LSQ to guarded byte ranges and tucks them aside,
and triggers local aborts if a transaction permission is denied.
Can you speculate on the latency for a core to talk to an ATM and receive
a response back ?? {{I am assuming all cores on a "chip" can use the same
STM.}}
Its difficult because there are so many options and possible optimizations.
In the base design (no Directory Controller optimization) a guard range
for bytes in a single line which does not have any shared bytes
(that is, no read-read shared or adjacent false shared bytes)
requires one request and one reply message to each peer core in a system.
For C cores thats 2*(C-1) msgs.
For unshared lines new guard requests in the same line require
no new messages.
If a line is shared between two different transactions, either both
read share the same bytes, or read-write or write-write adjacent bytes,
then each new guard range instructions requires a request and reply
BUT only with the cores that are sharing lines.
This would likely only happen if there were multiple objects that just
happen to reside in the same cache line and are accessed by two or more transactions.
The messages are sent from ATM to ATM over the coherence network.
Because they do not interact with caches they do not need to travel
down or up the cache hierarchy L1<=>L2<=>L3. Instead they can bypass
all those cache comms queues and go directly between the ATM and network. This eliminates all the queuing that cache coherence messages must transit.
When an ATX message arrives at a ATM, processing them requires no external information as it is all in the local lookup tables indexed by a CAM on
the line physical address. The CAM and tables likely require at least
2R2W ports, one for new local commands and one for inbound ATX messages. Processing ATX messages and sending a reply should be at least 1 per clock (and more table ports gets more messages per clock).
At the instruction level a core can have multiple guard requests outstanding at once, and can optionally overlap this with probing the local cache for hits and fetching missed line data either read_share or read_excl.
The cost that is most difficult to estimate is the collision rate.
The ATX protocol is a try-fail-retry based mechanism with FIFO ordering.
For N contenders each transaction can worst case fail N-1 times then succeed. So N contenders trying N times is worst case O(N^2) cost.
But it also depends on how far each is into the transaction and how much investment in that transaction when it collides, loses, and retries.
So best case, a transaction has perfect overlap for all its guard range requests plus fetching all its cache lines and no shared lines.
The latency could be about the time to fetch a single line.
Worst case, the guard ranges are serialized, line fetches are serialized, every line is shared, every transaction collides and retries max times. Because of FIFO ordering that number is bounded, but it might be big.
MitchAlsup wrote:
Can you speculate on the latency for a core to talk to an ATM and receive
a response back ?? {{I am assuming all cores on a "chip" can use the same
STM.}}
The cost that is most difficult to estimate is the collision rate.
The ATX protocol is a try-fail-retry based mechanism with FIFO ordering.
For N contenders each transaction can worst case fail N-1 times then
succeed.
So N contenders trying N times is worst case O(N^2) cost.
But it also depends on how far each is into the transaction and how much investment in that transaction when it collides, loses, and retries.
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:
branch predictor, one of the biggest parts of a core would need to
- The branch predictor tables be separated by user and super mode,
and that OS's be advised to purge the user mode tables on thread switch. >>> Now that's an expensive approach in both silicon and performance: The
become twice as big to get the same accuracy for a program that spends
almost all of its time in user mode, or almost all of its time in
system mode. And a user-level program would still be slowed down a
lot every time there is a thread switch.
Why would the branch predictions from a different thread/process
be helpful to your thread?
The typical scenario where a thread can benefit from not flushing the
branch predictor is when there is a switch to a different thread for a
short while and then a switch back.
However, there are also scenarios where threads benefit from the
branch predictions collected in a different thread:
* if the thread is in the same process, and processes the same code or
the same data.
* if the thread is in a different process, and executes a common
library (e.g., libc), or works on the same data (e.g., in a pipe).
Retaining predictions across security domains *IS* the problem
because it allows an attacker to influence/control a victim.
The side channel leaks, while also important, are just a display mechanism. >> But with no control over a victim an attacker can make no use of
side channels to display secrets.
Spectre can be fixed by either preventing the side channel from the speculative to the committed state (the approach I suggest), or by
preventing speculation (what the people who want to turn off
speculation suggest).
You suggest that erasing the branch predictor on thread switches is
just as good as preventing speculation. But it isn't. Even without training, as long as there is speculation and the side channel from
the speculative to the committed world, some data will be leaked. Ok,
you may be tempted to rely on your luck that it's not sensitive data,
but that does not appear to be a very trustworthy approach.
And that's especially the case because the attacker may be able to
help luck in the attacker's direction by passing data to the victim
process that results in training the branch predictor of the victim in
a specific way. E.g., a PDF document processed by a browser will
result in a lot of branches being taken in a certain way, which will
train the branch predictor in a certain way.
I'm not sure - it depends on the frequency of occurence and- I have an extensive set of conditional trap instructions intended for >>>> bounds checks, asserts, etc. In OoO a load or store following aAlso very expensive for programs (programming languages) that use
bounds check might execute before the check exception was delivered.
These trap instructions could have an optional NoSpeculate flag
which again stalls the Dispatcher and essentially single steps just
that instruction until the test condition resolves.
these instructions; speculative load hardening (SLH), which prevents
Spectre v1 by turning the control dependence on the bounds check into
a data dependence costs a factore 2.3-2.5 (depending on the SLH
variant) on the SPEC programs, and I expect your bounds-check trap
instructions to be at least as expensive.
the latency between Dispatch and branch condition resolution.
Based on nothing, I'm assuming both to be small :-)
The main reason why OoO so vastly outperforms in-order for
general-purpose code is that execution does not have to wait for the
branches to be resolved; this is reflected in the size of the
outstanding branches, which is, e.g., 128 for the Golden Cove < https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/#gracemont-s-out-of-order-engine>
(look for "Branch Order Buffer"). If you throw in one NoSpeculate
branch, this reduces this number to 0 for this branch. And given the
number of loads in a program, you probably have to make most branches "NoSpeculate" to be safe. So you fall back to close to in-order
performance.
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:
branch predictor, one of the biggest parts of a core would need to
- The branch predictor tables be separated by user and super mode,
and that OS's be advised to purge the user mode tables on thread switch. >>>> Now that's an expensive approach in both silicon and performance: The
become twice as big to get the same accuracy for a program that spends >>>> almost all of its time in user mode, or almost all of its time in
system mode. And a user-level program would still be slowed down a
lot every time there is a thread switch.
That was the simplest form. A more sophisticated version could have
a 2 or 3 bit tag like an ASID on each branch predictor entry.
Tag 0 is for super mode, others are for the most recent 3 or 7 processes
run on this core. If a lookup hits on an entry with a different tag
then the entry is set to its uninitialized state.
Though I'm not sure how this could work with the return stack predictor. Probably have to keep 4 or 8 copies of it.
Why would the branch predictions from a different thread/process
be helpful to your thread?
The typical scenario where a thread can benefit from not flushing the
branch predictor is when there is a switch to a different thread for a
short while and then a switch back.
However, there are also scenarios where threads benefit from the
branch predictions collected in a different thread:
* if the thread is in the same process, and processes the same code or
the same data.
Yes, only flush if the new thread is in a different process.
* if the thread is in a different process, and executes a common
library (e.g., libc), or works on the same data (e.g., in a pipe).
IMO this "optimization" is not worth the security hole.
Retaining predictions across security domains *IS* the problem
because it allows an attacker to influence/control a victim.
The side channel leaks, while also important, are just a display mechanism. >>> But with no control over a victim an attacker can make no use of
side channels to display secrets.
Spectre can be fixed by either preventing the side channel from the
speculative to the committed state (the approach I suggest), or by
preventing speculation (what the people who want to turn off
speculation suggest).
You suggest that erasing the branch predictor on thread switches is
just as good as preventing speculation. But it isn't. Even without
training, as long as there is speculation and the side channel from
the speculative to the committed world, some data will be leaked. Ok,
you may be tempted to rely on your luck that it's not sensitive data,
but that does not appear to be a very trustworthy approach.
I would also rely on the NoSpeculate branch hint to stall branches
that check array bounds.
EricP wrote:
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:
Now that's an expensive approach in both silicon and performance: The >>>>> branch predictor, one of the biggest parts of a core would need to
- The branch predictor tables be separated by user and super mode, >>>>>> and that OS's be advised to purge the user mode tables on thread
switch.
become twice as big to get the same accuracy for a program that spends >>>>> almost all of its time in user mode, or almost all of its time in
system mode. And a user-level program would still be slowed down a
lot every time there is a thread switch.
That was the simplest form. A more sophisticated version could have
a 2 or 3 bit tag like an ASID on each branch predictor entry.
Which halves the number of entries you can store or worse. Remember
branch prediction states are un-tagged 2-bit saturating counters.
Tag 0 is for super mode, others are for the most recent 3 or 7 processes
run on this core. If a lookup hits on an entry with a different tag
then the entry is set to its uninitialized state.
Though I'm not sure how this could work with the return stack predictor.
Probably have to keep 4 or 8 copies of it.
Why would the branch predictions from a different thread/process
be helpful to your thread?
The typical scenario where a thread can benefit from not flushing the
branch predictor is when there is a switch to a different thread for a
short while and then a switch back.
However, there are also scenarios where threads benefit from the
branch predictions collected in a different thread:
* if the thread is in the same process, and processes the same code or
the same data.
Yes, only flush if the new thread is in a different process.
* if the thread is in a different process, and executes a common
library (e.g., libc), or works on the same data (e.g., in a pipe).
IMO this "optimization" is not worth the security hole.
Retaining predictions across security domains *IS* the problem
because it allows an attacker to influence/control a victim.
The side channel leaks, while also important, are just a display
mechanism.
But with no control over a victim an attacker can make no use of
side channels to display secrets.
Spectre can be fixed by either preventing the side channel from the
speculative to the committed state (the approach I suggest), or by
preventing speculation (what the people who want to turn off
speculation suggest).
You suggest that erasing the branch predictor on thread switches is
just as good as preventing speculation. But it isn't. Even without
training, as long as there is speculation and the side channel from
the speculative to the committed world, some data will be leaked. Ok,
you may be tempted to rely on your luck that it's not sensitive data,
but that does not appear to be a very trustworthy approach.
I would also rely on the NoSpeculate branch hint to stall branches
that check array bounds.
My 66000 PREDication does not use the branch prediction tables.
MitchAlsup wrote:
I would also rely on the NoSpeculate branch hint to stall branches
that check array bounds.
My 66000 PREDication does not use the branch prediction tables.
There has been research on interaction between conditional branches and predicated code, mostly from around the Itanium time. Basically, when you move some execution from conditional to predicated it changes the stats
for the branches that remain.
I have also seen mention of "predication predictors", I think it was to
elide the predict-predicate-false instructions from the stream.
MitchAlsup wrote:
EricP wrote:
And this is where I think my ATX atomic transactions differs from ESM,
it is in how transactions are negotiated.
I also have Cache Coherence (CC) protocol managing the
shared/exclusive/owned
line state and transfer of whole lines into, out of, and between caches. >>> However I don't need a NAK in CC because line movement is never denied.
My ATX coherence messages are a completely different protocol from CC.
ATX messages deal with *permission* to read and write *individual bytes* >>> in cache lines, and knows nothing about the cache line state or where
it is located. The Atomic Transaction Manager (ATM) uses ATX messages to >>> talks to other peer ATM's about access to guard ranges, it intercepts
stores from the LSQ to guarded byte ranges and tucks them aside,
and triggers local aborts if a transaction permission is denied.
The messages are sent from ATM to ATM over the coherence network.
Because they do not interact with caches they do not need to travel
down or up the cache hierarchy L1<=>L2<=>L3. Instead they can bypass
all those cache comms queues and go directly between the ATM and network. This eliminates all the queuing that cache coherence messages must transit.
EricP wrote:
MitchAlsup wrote:
EricP wrote:
And this is where I think my ATX atomic transactions differs from ESM, >>>> it is in how transactions are negotiated.
I also have Cache Coherence (CC) protocol managing the
shared/exclusive/owned
line state and transfer of whole lines into, out of, and between caches. >>>> However I don't need a NAK in CC because line movement is never denied. >>>
My ATX coherence messages are a completely different protocol from CC. >>>> ATX messages deal with *permission* to read and write *individual bytes* >>>> in cache lines, and knows nothing about the cache line state or where
it is located. The Atomic Transaction Manager (ATM) uses ATX messages to >>>> talks to other peer ATM's about access to guard ranges, it intercepts
stores from the LSQ to guarded byte ranges and tucks them aside,
and triggers local aborts if a transaction permission is denied.
The messages are sent from ATM to ATM over the coherence network.
Because they do not interact with caches they do not need to travel
down or up the cache hierarchy L1<=>L2<=>L3. Instead they can bypass
all those cache comms queues and go directly between the ATM and network.
This eliminates all the queuing that cache coherence messages must transit.
I had something of an epiphany last night that I thought I'd pass on.
The cache coherence protocol (CCP) supports communication between
cache coherency managers (CCM) which they currently use to negotiate
cache line ownership. The various L1, L2, L3 level managers pass CCP
messages up and down the hierarchy between themselves over comms queues,
and out over the inter-core network.
I had been thinking that my Atomic Transaction Manager (ATM) would be
located at the end of the Load Store Queue just before the cache and CCM.
The ATM can intercept LSQ commands to the cache and modify them,
to tuck aside a store in a transaction, or send commands to the
LSQ itself, such as to command the LSQ dependency matrix to stall
all memory ops to a particular cache line address.
Since the ATX messages do not directly interact with the local cache
they can bypass the level comms queues and flow directly between
the ATM and the coherence network.
Core-0 Core-1
LSQ<=>ATM<=>L1_CCM L1_CCM<=>ATM<=>LSQ
^ | | ^
| L2_CCM L2_CCM |
| | | |
| L3_CCM L3_CCM |
| | | |
v v v v
network<------------>network
Instead of thinking of the ATM as an independent unit attached to the LSQ, what if I see it as a sub-unit of the LSQ. That would make the ATX messages used for negotiating transactions just an example of a general concept of *messages sent between LSQ's to coordinate their operation*,
just as CCP messages are sent between CCM's.
In short, messaging directly between the LSQ's managers on different
cores is potentially a whole new class of coherence and control.
So the question is: besides my atomic transactions,
what else might LSQ's want to say directly to each other?
And remember the LSQ has other resources, the ordered queue of LD/ST ops,
the address CAM's and op dependency matrix, pending store data, the TLB.
For example, TLB shootdown is one that's already available on some cores
but now could be seen as part of this general class of LSQ messaging.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 299 |
Nodes: | 16 (2 / 14) |
Uptime: | 37:03:56 |
Calls: | 6,682 |
Files: | 12,222 |
Messages: | 5,343,120 |