Forum: >>> Magnum BBS <<<

Hardware Transaction Memory approaches (was Superior architecture style

From EricP@21:1/5 to MitchAlsup on Sat Dec 30 13:46:33 2023

MitchAlsup wrote:

EricP wrote:

MitchAlsup wrote:

If an exception occurs in the store (manifestation) section of an ESM
ATOMIC event, the event fails and none of the stores appears to have
been performed.

If an interrupt occurs in the store (manifestation) section of an ESM
ATOMIC event, if all the store can be committed they will be and the
event succeeds, and if they cannot (or cannot be determined that they
can) the the event fails and none of the stores appear to have been
performed.

In any event, if the thread performing the event cannot be completed,
due to transfer of control to more privilege operation, the event fails
control appears to have been transferred to the event control point, and >>> then control is transferred to the more privileged thread.

Yes, my HTM has some similarities.

Yes, I see lots of similarities--most of the differences are down in
the minutia.

Not too many similarities - my latest ATX design has diverged quite a bit
from your original ASF proposal in 2010 that got me thinking about HTM.

For example, I think we both switched from the ASF/RTM approach viewing
a transaction abort as an exception where registers are rolled back to the starting state, to one which views an abort like a branch that preserves
the register values so that data can be passed from inside a transaction
to outside.

And following on from that, I think I adopted your idea of allowing
reads and writes to non-transactional memory while other transaction
member memory is protected. Again this is to allow values to be passed
from inside a transaction to outside.

But both of those changes are based on the problems people encountered
trying to use RTM and finding there was no way to get transaction
management information from inside the transaction to outside.

My latest ATX instructions are completely different from ASF and ESM.
ATX uses dynamically defined guard byte address ranges - guard for read
and guard for write. Once guard byte ranges are established, LD's and ST's inside the guard ranges are transactionally protected, those outside are not. Guard byte ranges can be dynamically added, protection raised and lowered
or released as the transaction proceeds.

I believe under the hood my implementation is mostly different.
My ATX has transaction management distributed to all nodes,
yours ESM is centralized. ATX negotiates the transaction guard range
collision winner dynamically as the transaction proceeds so that if there
is contention only the winner makes it to a COMMIT and losers abort,
whereas ESM collects all the changes and makes a bulk decision at the end
on whether there was interference and who should win or lose.

The commonality on implementation is on buffering the updates outside the
cache so the transaction is not sensitive to cache associativity evicts
as RTM was. ESM uses fully assoc miss buffers, I was thinking ATX would
have a fully assoc index but borrow line space from L1 to allow a larger transaction member line set (I wanted 16 lines as a minimum).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Sat Dec 30 19:57:22 2023

EricP wrote:

MitchAlsup wrote:

EricP wrote:

MitchAlsup wrote:

If an exception occurs in the store (manifestation) section of an ESM
ATOMIC event, the event fails and none of the stores appears to have
been performed.

If an interrupt occurs in the store (manifestation) section of an ESM
ATOMIC event, if all the store can be committed they will be and the
event succeeds, and if they cannot (or cannot be determined that they
can) the the event fails and none of the stores appear to have been
performed.

In any event, if the thread performing the event cannot be completed,
due to transfer of control to more privilege operation, the event fails >>>> control appears to have been transferred to the event control point, and >>>> then control is transferred to the more privileged thread.

Yes, my HTM has some similarities.

Yes, I see lots of similarities--most of the differences are down in
the minutia.

Not too many similarities - my latest ATX design has diverged quite a bit from your original ASF proposal in 2010 that got me thinking about HTM.

Make that 2005±

For example, I think we both switched from the ASF/RTM approach viewing
a transaction abort as an exception where registers are rolled back to the starting state, to one which views an abort like a branch that preserves
the register values so that data can be passed from inside a transaction
to outside.

No, I did not do it that way. I choose not to restore the registers, and
made the compiler have to forget the now stale variables from the event.
I did this mostly because my implementation does not count on branches so
there may not be a checkpoint to assist in backup. Control transfer to the control point is not considered a branch--because it is automagic.

And following on from that, I think I adopted your idea of allowing
reads and writes to non-transactional memory while other transaction
member memory is protected. Again this is to allow values to be passed
from inside a transaction to outside.

I do allow this. AND this is why each participant has to announce itself. {{That is: there is not something that starts an event and another thing
that ends an event and everything inside is participating in the event.}}

But both of those changes are based on the problems people encountered
trying to use RTM and finding there was no way to get transaction
management information from inside the transaction to outside.

That and debugging (but perhaps that is what you meant.)

My latest ATX instructions are completely different from ASF and ESM.
ATX uses dynamically defined guard byte address ranges - guard for read
and guard for write. Once guard byte ranges are established, LD's and ST's inside the guard ranges are transactionally protected, those outside are not. Guard byte ranges can be dynamically added, protection raised and lowered
or released as the transaction proceeds.

I believe under the hood my implementation is mostly different.
My ATX has transaction management distributed to all nodes,
yours ESM is centralized. ATX negotiates the transaction guard range collision winner dynamically as the transaction proceeds so that if there
is contention only the winner makes it to a COMMIT and losers abort,
whereas ESM collects all the changes and makes a bulk decision at the end
on whether there was interference and who should win or lose.

The commonality on implementation is on buffering the updates outside the cache so the transaction is not sensitive to cache associativity evicts
as RTM was. ESM uses fully assoc miss buffers, I was thinking ATX would
have a fully assoc index but borrow line space from L1 to allow a larger transaction member line set (I wanted 16 lines as a minimum).

16 lines but only 1 read set (start:end) and 1 write set (start:end) ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to MitchAlsup on Sat Dec 30 17:08:00 2023

MitchAlsup wrote:

EricP wrote:

MitchAlsup wrote:

EricP wrote:

MitchAlsup wrote:

If an exception occurs in the store (manifestation) section of an
ESM ATOMIC event, the event fails and none of the stores appears to
have
been performed.

If an interrupt occurs in the store (manifestation) section of an ESM >>>>> ATOMIC event, if all the store can be committed they will be and the >>>>> event succeeds, and if they cannot (or cannot be determined that
they can) the the event fails and none of the stores appear to have
been
performed.

In any event, if the thread performing the event cannot be completed, >>>>> due to transfer of control to more privilege operation, the event
fails
control appears to have been transferred to the event control
point, and
then control is transferred to the more privileged thread.

Yes, my HTM has some similarities.

Yes, I see lots of similarities--most of the differences are down in
the minutia.

Not too many similarities - my latest ATX design has diverged quite a bit
from your original ASF proposal in 2010 that got me thinking about HTM.

Make that 2005±

For example, I think we both switched from the ASF/RTM approach viewing
a transaction abort as an exception where registers are rolled back to
the
starting state, to one which views an abort like a branch that preserves
the register values so that data can be passed from inside a transaction
to outside.

No, I did not do it that way. I choose not to restore the registers, and
made the compiler have to forget the now stale variables from the event.
I did this mostly because my implementation does not count on branches so there may not be a checkpoint to assist in backup. Control transfer to the control point is not considered a branch--because it is automagic.

Ok, bad analogy but the result is the same: the registers are not restored.

And following on from that, I think I adopted your idea of allowing
reads and writes to non-transactional memory while other transaction
member memory is protected. Again this is to allow values to be passed
from inside a transaction to outside.

I do allow this. AND this is why each participant has to announce itself. {{That is: there is not something that starts an event and another thing
that ends an event and everything inside is participating in the event.}}

But both of those changes are based on the problems people encountered
trying to use RTM and finding there was no way to get transaction
management information from inside the transaction to outside.

That and debugging (but perhaps that is what you meant.)

Debugging too but I was thinking that a transaction might want to use
a register to hold an internal counter indicating how far it made it
into the transaction when it aborted. That might help the abort code
avoid a subsequent collision.

My latest ATX instructions are completely different from ASF and ESM.
ATX uses dynamically defined guard byte address ranges - guard for read
and guard for write. Once guard byte ranges are established, LD's and
ST's
inside the guard ranges are transactionally protected, those outside
are not.
Guard byte ranges can be dynamically added, protection raised and lowered
or released as the transaction proceeds.

I believe under the hood my implementation is mostly different.
My ATX has transaction management distributed to all nodes,
yours ESM is centralized. ATX negotiates the transaction guard range
collision winner dynamically as the transaction proceeds so that if there
is contention only the winner makes it to a COMMIT and losers abort,
whereas ESM collects all the changes and makes a bulk decision at the end
on whether there was interference and who should win or lose.

The commonality on implementation is on buffering the updates outside the
cache so the transaction is not sensitive to cache associativity evicts
as RTM was. ESM uses fully assoc miss buffers, I was thinking ATX would
have a fully assoc index but borrow line space from L1 to allow a larger
transaction member line set (I wanted 16 lines as a minimum).

16 lines but only 1 read set (start:end) and 1 write set (start:end) ??

There can be as many guard byte ranges as you want and can straddle
multiple cache line boundaries as long as the total bytes under
transaction guard protection is 16 cache lines (of 64-bytes each).

The intent is that you issue guards giving the object address and size,
do some protected loads and stores, then add new guards as new objects
join the transaction, do more protected loads and stores, and so on.
Then commit all the updates and release the guards.

You can request a Read Guard on an object byte range, evaluate an object,
then upgrade that range to a Write Guard and make changes,
or release the guards on a range to remove an object from the transaction.

The number 16 comes from wanting up to 8 smallish objects in a transaction, with each object possibly straddling two cache lines.
Eg an AVL tree node with left, right, parent pointers and depth count.
(I didn't want users to have to worry about whether their objects
straddle cache line boundaries.)

That 16 sets the size of the CAM and number of cache line buffers holding pending byte updates plus other structures in the transaction manager.

The transaction manager breaks each guard range request into a series
of up to 16 cache line byte ranges

My Atomic Transaction instructions are:

// Start a transaction attempt, remember abort RIP
// Option is to be notified after collision winner commits
ATSTART abort_offset [,options]

// Guard a byte range for read
// Options are synchronous or asynchronous
// Synchronous blocks LD and ST instructions in the guard range from
// reading a cache line until the guard grant has been negotiated.
// Asynchronous does not block LD and ST but may cause ping-pongs.
ATGRDR address, byte_count [,options]

// Guard a byte range for write
// Options are synchronous or asynchronous
ATGRDW address, byte_count [,options]

// Release a guard a range from the transaction
ATGREL address, byte_count

// Commit transaction updates and release all guards
ATCOMMIT status_reg

// Cancel transaction, toss write-guarded updates, release guards
ATCANCEL status_reg

// Trigger an abort, toss write-guarded updates, jump to abort address
// Can pass an immediate byte value to the transaction status
ATABORT #imm8

// Read status of current transaction, if any.
// After an abort the status contains info on reason for the abort.
ATSTATUS status_reg

// Wait for commit notify from winner of a collision
ATWAITNFY

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Sun Dec 31 00:27:52 2023

EricP wrote:

MitchAlsup wrote:

EricP wrote:

MitchAlsup wrote:

EricP wrote:

MitchAlsup wrote:

If an exception occurs in the store (manifestation) section of an
ESM ATOMIC event, the event fails and none of the stores appears to >>>>>> have
been performed.

If an interrupt occurs in the store (manifestation) section of an ESM >>>>>> ATOMIC event, if all the store can be committed they will be and the >>>>>> event succeeds, and if they cannot (or cannot be determined that
they can) the the event fails and none of the stores appear to have >>>>>> been
performed.

In any event, if the thread performing the event cannot be completed, >>>>>> due to transfer of control to more privilege operation, the event
fails
control appears to have been transferred to the event control
point, and
then control is transferred to the more privileged thread.

Yes, my HTM has some similarities.

Yes, I see lots of similarities--most of the differences are down in
the minutia.

Not too many similarities - my latest ATX design has diverged quite a bit >>> from your original ASF proposal in 2010 that got me thinking about HTM.

Make that 2005±

For example, I think we both switched from the ASF/RTM approach viewing
a transaction abort as an exception where registers are rolled back to
the
starting state, to one which views an abort like a branch that preserves >>> the register values so that data can be passed from inside a transaction >>> to outside.

No, I did not do it that way. I choose not to restore the registers, and
made the compiler have to forget the now stale variables from the event.
I did this mostly because my implementation does not count on branches so
there may not be a checkpoint to assist in backup. Control transfer to the >> control point is not considered a branch--because it is automagic.

Ok, bad analogy but the result is the same: the registers are not restored.

And following on from that, I think I adopted your idea of allowing
reads and writes to non-transactional memory while other transaction
member memory is protected. Again this is to allow values to be passed
from inside a transaction to outside.

I do allow this. AND this is why each participant has to announce itself.
{{That is: there is not something that starts an event and another thing
that ends an event and everything inside is participating in the event.}}

But both of those changes are based on the problems people encountered
trying to use RTM and finding there was no way to get transaction
management information from inside the transaction to outside.

That and debugging (but perhaps that is what you meant.)

Debugging too but I was thinking that a transaction might want to use
a register to hold an internal counter indicating how far it made it
into the transaction when it aborted. That might help the abort code
avoid a subsequent collision.

You could use a counter, or you could dump a bunch of intermediate state
into a buffer and print it so you can see the instantaneous state of the process while the event transpired.

My latest ATX instructions are completely different from ASF and ESM.
ATX uses dynamically defined guard byte address ranges - guard for read
and guard for write. Once guard byte ranges are established, LD's and
ST's
inside the guard ranges are transactionally protected, those outside
are not.
Guard byte ranges can be dynamically added, protection raised and lowered >>> or released as the transaction proceeds.

I believe under the hood my implementation is mostly different.
My ATX has transaction management distributed to all nodes,
yours ESM is centralized. ATX negotiates the transaction guard range
collision winner dynamically as the transaction proceeds so that if there >>> is contention only the winner makes it to a COMMIT and losers abort,
whereas ESM collects all the changes and makes a bulk decision at the end >>> on whether there was interference and who should win or lose.

The commonality on implementation is on buffering the updates outside the >>> cache so the transaction is not sensitive to cache associativity evicts
as RTM was. ESM uses fully assoc miss buffers, I was thinking ATX would
have a fully assoc index but borrow line space from L1 to allow a larger >>> transaction member line set (I wanted 16 lines as a minimum).

16 lines but only 1 read set (start:end) and 1 write set (start:end) ??

There can be as many guard byte ranges as you want and can straddle
multiple cache line boundaries as long as the total bytes under
transaction guard protection is 16 cache lines (of 64-bytes each).

The intent is that you issue guards giving the object address and size,
do some protected loads and stores, then add new guards as new objects
join the transaction, do more protected loads and stores, and so on.
Then commit all the updates and release the guards.

You can request a Read Guard on an object byte range, evaluate an object, then upgrade that range to a Write Guard and make changes,
or release the guards on a range to remove an object from the transaction.

The number 16 comes from wanting up to 8 smallish objects in a transaction, with each object possibly straddling two cache lines.
Eg an AVL tree node with left, right, parent pointers and depth count.
(I didn't want users to have to worry about whether their objects
straddle cache line boundaries.)

That 16 sets the size of the CAM and number of cache line buffers holding pending byte updates plus other structures in the transaction manager.

The transaction manager breaks each guard range request into a series
of up to 16 cache line byte ranges

My Atomic Transaction instructions are:

// Start a transaction attempt, remember abort RIP
// Option is to be notified after collision winner commits
ATSTART abort_offset [,options]

// Guard a byte range for read
// Options are synchronous or asynchronous
// Synchronous blocks LD and ST instructions in the guard range from
// reading a cache line until the guard grant has been negotiated.
// Asynchronous does not block LD and ST but may cause ping-pongs.
ATGRDR address, byte_count [,options]

// Guard a byte range for write
// Options are synchronous or asynchronous
ATGRDW address, byte_count [,options]

// Release a guard a range from the transaction
ATGREL address, byte_count

// Commit transaction updates and release all guards
ATCOMMIT status_reg

// Cancel transaction, toss write-guarded updates, release guards
ATCANCEL status_reg

// Trigger an abort, toss write-guarded updates, jump to abort address
// Can pass an immediate byte value to the transaction status
ATABORT #imm8

// Read status of current transaction, if any.
// After an abort the status contains info on reason for the abort.
ATSTATUS status_reg

// Wait for commit notify from winner of a collision
ATWAITNFY

I see, you are using an instruction to mark each state transition--
whereas I use edge-detection (side effect) of a standard instruction.

Does this not necessarily increase the minimum path length ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to MitchAlsup on Sun Dec 31 11:26:23 2023

MitchAlsup wrote:

EricP wrote:

There can be as many guard byte ranges as you want and can straddle
multiple cache line boundaries as long as the total bytes under
transaction guard protection is 16 cache lines (of 64-bytes each).

The intent is that you issue guards giving the object address and size,
do some protected loads and stores, then add new guards as new objects
join the transaction, do more protected loads and stores, and so on.
Then commit all the updates and release the guards.

You can request a Read Guard on an object byte range, evaluate an object,
then upgrade that range to a Write Guard and make changes,
or release the guards on a range to remove an object from the
transaction.

The number 16 comes from wanting up to 8 smallish objects in a
transaction,
with each object possibly straddling two cache lines.
Eg an AVL tree node with left, right, parent pointers and depth count.
(I didn't want users to have to worry about whether their objects
straddle cache line boundaries.)

That 16 sets the size of the CAM and number of cache line buffers holding
pending byte updates plus other structures in the transaction manager.

The transaction manager breaks each guard range request into a series
of up to 16 cache line byte ranges

My Atomic Transaction instructions are:

// Start a transaction attempt, remember abort RIP
// Option is to be notified after collision winner commits
ATSTART abort_offset [,options]

// Guard a byte range for read
// Options are synchronous or asynchronous
// Synchronous blocks LD and ST instructions in the guard range from
// reading a cache line until the guard grant has been negotiated.
// Asynchronous does not block LD and ST but may cause ping-pongs.
ATGRDR address, byte_count [,options]

// Guard a byte range for write
// Options are synchronous or asynchronous
ATGRDW address, byte_count [,options]

// Release a guard a range from the transaction
ATGREL address, byte_count

// Commit transaction updates and release all guards
ATCOMMIT status_reg

// Cancel transaction, toss write-guarded updates, release guards
ATCANCEL status_reg

// Trigger an abort, toss write-guarded updates, jump to abort address
// Can pass an immediate byte value to the transaction status
ATABORT #imm8

// Read status of current transaction, if any.
// After an abort the status contains info on reason for the abort.
ATSTATUS status_reg

// Wait for commit notify from winner of a collision
ATWAITNFY

I see, you are using an instruction to mark each state transition--
whereas I use edge-detection (side effect) of a standard instruction.

Does this not necessarily increase the minimum path length ??

Not sure what you mean by minimum path length.
The number of instructions probably has the least affect on performance.

Often transactions are just moving memory locations about with little
or no calculations, so the majority of performance effects will be due
to coherence messaging, to negotiate guards and move cache lines about.
In some cases this can be overlapped, others not.

I was able to throw together a simulator to test the validity of the
guard protocol handshake and it does work. But that was is isolation.
To test ATX performance would require a full multi-core OoO simulator
with Load Store Queue, as my transaction manager interacts with LSQ,
and cache coherence message simulation, and I don't have that.

Some issues I see that could affect transaction performance:

1) I have validated the ATX protocol and it has one important optimization,
but not given much thought to optimizations that a directory controller
might help with. Currently it assumes that guard requests coherence messages will be broadcast to all nodes in a system, and all will Grant/Deny reply.

This is intentional as it keeps the ATX coherence messages completely
separate from cache coherence messages, and that is important because
it means you don't have to re-validate your coherence protocol
or change the cache subsystem or directory controller.

Since a directory controller knows which nodes have copies of lines
in what shared/exclusive state it might be able to optimize away much
of the ATX messaging. However that would require integrating ATX protocol
with the directory controller to also track guard requests for lines.

2) Since the Atomic Transaction Manager (ATM) knows the guard range
and whether for read or write, it can optimize cache transfers to
upgrade read_share cache line to a read_exclusive, and eliminate the
transitory line share state that occurs for some shared lines.

That would eliminate a whole set of handshake messages that may now occur
to transfer a line in a shared state, then another to make it exclusive.

3) I believe an OoO core will have to shut off conditional branch
speculation inside transactions as speculating guard requests
or loads or stores could cause unnecessary transaction aborts
or cache line ping-ponging. It may have to go beyond that and
shut off OoO all together so that the registers are written in
a predictable order if an asynchronous abort is triggered.

Possibly the amount of OoO allowed could be an option on the ATSTART instruction so the user can decide based on their algorithm.

4) Guard requests can be either synchronous or asynchronous.
The choice does not affect transaction validity, just performance.

Synchronous means the guard acts as a membar to following loads
and stores to the guarded bytes while the guard is negotiated.
Synchronous is intended to prevent you from too soon touching a
cache line and grabbing it away from its current owner-updater.

Asynchronous allows following ld/st to the guarded range to execute
concurrent while the guard request is pending, allowing the transfer
of a cache lines to overlap with guard negotiation.
Asynchronous could cause cache lines to ping-pong.

The choice of synchronous or asynchronous is dependent on the algorithm
and can be different for different objects in a transaction.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Sun Dec 31 17:10:04 2023

EricP wrote:

MitchAlsup wrote:

EricP wrote:

There can be as many guard byte ranges as you want and can straddle
multiple cache line boundaries as long as the total bytes under
transaction guard protection is 16 cache lines (of 64-bytes each).

The intent is that you issue guards giving the object address and size,
do some protected loads and stores, then add new guards as new objects
join the transaction, do more protected loads and stores, and so on.
Then commit all the updates and release the guards.

You can request a Read Guard on an object byte range, evaluate an object, >>> then upgrade that range to a Write Guard and make changes,
or release the guards on a range to remove an object from the
transaction.

The number 16 comes from wanting up to 8 smallish objects in a
transaction,
with each object possibly straddling two cache lines.
Eg an AVL tree node with left, right, parent pointers and depth count.
(I didn't want users to have to worry about whether their objects
straddle cache line boundaries.)

That 16 sets the size of the CAM and number of cache line buffers holding >>> pending byte updates plus other structures in the transaction manager.

The transaction manager breaks each guard range request into a series
of up to 16 cache line byte ranges

My Atomic Transaction instructions are:

// Start a transaction attempt, remember abort RIP
// Option is to be notified after collision winner commits
ATSTART abort_offset [,options]

// Guard a byte range for read
// Options are synchronous or asynchronous
// Synchronous blocks LD and ST instructions in the guard range from
// reading a cache line until the guard grant has been negotiated.
// Asynchronous does not block LD and ST but may cause ping-pongs.
ATGRDR address, byte_count [,options]

// Guard a byte range for write
// Options are synchronous or asynchronous
ATGRDW address, byte_count [,options]

// Release a guard a range from the transaction
ATGREL address, byte_count

// Commit transaction updates and release all guards
ATCOMMIT status_reg

// Cancel transaction, toss write-guarded updates, release guards
ATCANCEL status_reg

// Trigger an abort, toss write-guarded updates, jump to abort address
// Can pass an immediate byte value to the transaction status
ATABORT #imm8

// Read status of current transaction, if any.
// After an abort the status contains info on reason for the abort.
ATSTATUS status_reg

// Wait for commit notify from winner of a collision
ATWAITNFY

I see, you are using an instruction to mark each state transition--
whereas I use edge-detection (side effect) of a standard instruction.

Does this not necessarily increase the minimum path length ??

Not sure what you mean by minimum path length.
The number of instructions probably has the least affect on performance.

Often transactions are just moving memory locations about with little
or no calculations, so the majority of performance effects will be due
to coherence messaging, to negotiate guards and move cache lines about.
In some cases this can be overlapped, others not.

Another cause of delay is conversion from causal consistency outside
of an event and sequential consistency within an event. {Should you
choose to do this}

I was able to throw together a simulator to test the validity of the
guard protocol handshake and it does work. But that was is isolation.
To test ATX performance would require a full multi-core OoO simulator
with Load Store Queue, as my transaction manager interacts with LSQ,
and cache coherence message simulation, and I don't have that.

Some issues I see that could affect transaction performance:

1) I have validated the ATX protocol and it has one important optimization, but not given much thought to optimizations that a directory controller
might help with. Currently it assumes that guard requests coherence messages will be broadcast to all nodes in a system, and all will Grant/Deny reply.

This is intentional as it keeps the ATX coherence messages completely separate from cache coherence messages, and that is important because
it means you don't have to re-validate your coherence protocol
or change the cache subsystem or directory controller.

All I added to the coherence protocol is NAK and its use is restricted
to ATOMIC events; then later I added priority compare so that higher
priority events are not NAKed by lower priority events. Thus, the
protocol is the same except for NAKs.

Since a directory controller knows which nodes have copies of lines
in what shared/exclusive state it might be able to optimize away much
of the ATX messaging. However that would require integrating ATX protocol with the directory controller to also track guard requests for lines.

2) Since the Atomic Transaction Manager (ATM) knows the guard range
and whether for read or write, it can optimize cache transfers to
upgrade read_share cache line to a read_exclusive, and eliminate the transitory line share state that occurs for some shared lines.

Yes, I do some of this too, with a change in flavor:: the first pass
through an ATOMIC event the locked LDs are sent out with Intent to Modify. Should interference occur and fail the event, the subsequent locked LDs
are sent out without intent and when data does arrive a coherent invalidate
is sent out. This mimics test_and_test_and_set() without doing any more
thatn test_and_set().

That would eliminate a whole set of handshake messages that may now occur
to transfer a line in a shared state, then another to make it exclusive.

3) I believe an OoO core will have to shut off conditional branch
speculation inside transactions as speculating guard requests
or loads or stores could cause unnecessary transaction aborts
or cache line ping-ponging. It may have to go beyond that and
shut off OoO all together so that the registers are written in
a predictable order if an asynchronous abort is triggered.

If you allow speculative branches in an event, you will need a way to
request a cache line and then not use it if that request has become
OoO with respect to the sequentially consistent memory order produced
by this processor. I figured out how to solve this circa 1991 so I
don't consider it a stumbling block.

Possibly the amount of OoO allowed could be an option on the ATSTART instruction so the user can decide based on their algorithm.

4) Guard requests can be either synchronous or asynchronous.
The choice does not affect transaction validity, just performance.

Do you envision mixing and matching synch and asynch guards ?

Synchronous means the guard acts as a membar to following loads
and stores to the guarded bytes while the guard is negotiated.
Synchronous is intended to prevent you from too soon touching a
cache line and grabbing it away from its current owner-updater.

Asynchronous allows following ld/st to the guarded range to execute concurrent while the guard request is pending, allowing the transfer
of a cache lines to overlap with guard negotiation.
Asynchronous could cause cache lines to ping-pong.

So can too little associativity in your data cache.

The choice of synchronous or asynchronous is dependent on the algorithm
and can be different for different objects in a transaction.

So, yes to my question above.

Why not allow speculative branches to cover asynch accesses to guarded
lines ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Sun Dec 31 18:02:36 2023

EricP <ThatWouldBeTelling@thevillage.com> writes:

3) I believe an OoO core will have to shut off conditional branch
speculation inside transactions as speculating guard requests
or loads or stores could cause unnecessary transaction aborts
or cache line ping-ponging. It may have to go beyond that and
shut off OoO all together so that the registers are written in
a predictable order if an asynchronous abort is triggered.

I am not sure what scenario you have in mind, but it seems to involve requesting a cache line of a different core. Note that fixing Spectre
requires that while such a request is speculative, it must not change
the state of a remote cache line; otherwise this would consititute a
side channel out of the speculative state. So you will certainly not
see cache ping-ponging from properly implemented speculative accesses,
whether inside a transaction or not.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Sun Dec 31 14:54:47 2023

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

3) I believe an OoO core will have to shut off conditional branch
speculation inside transactions as speculating guard requests
or loads or stores could cause unnecessary transaction aborts
or cache line ping-ponging. It may have to go beyond that and
shut off OoO all together so that the registers are written in
a predictable order if an asynchronous abort is triggered.

I am not sure what scenario you have in mind, but it seems to involve requesting a cache line of a different core. Note that fixing Spectre requires that while such a request is speculative, it must not change
the state of a remote cache line; otherwise this would consititute a
side channel out of the speculative state. So you will certainly not
see cache ping-ponging from properly implemented speculative accesses, whether inside a transaction or not.

- anton

I'm noting that speculation and transactions may interact badly. Implementations may need some mechanism to limit it,
but that can affect concurrency and performance.

I'm not overly concerned about Spectre. The people who are potentially
affected are time-share services (cloud servers) that do not control
what programs are running concurrently on their processors.

If one wants to play in that market, yes full Spectre protection could be
a sales feature for a cpu model. But that is an expensive overkill for
most situations.

I was looking for simpler and cheaper solutions that are optional and
cover most situations.

- The branch predictor tables be separated by user and super mode,
and that OS's be advised to purge the user mode tables on thread switch.

- Adding a conditional branch hint NoSpeculate which basically causes the
front end Dispatcher to stall and single step that one branch until the test condition resolves. Stalling at Dispatch allows the front end to continue
to fill with the predicted code path but does not allow any to execute.

- I have an extensive set of conditional trap instructions intended for
bounds checks, asserts, etc. In OoO a load or store following a
bounds check might execute before the check exception was delivered.
These trap instructions could have an optional NoSpeculate flag
which again stalls the Dispatcher and essentially single steps just
that instruction until the test condition resolves.

If I had the NoSpeculate branch hints then transaction users would
be advised to use them. Otherwise the transaction mechanism would
have to automatically shut off all speculation.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Mon Jan 1 08:05:52 2024

EricP <ThatWouldBeTelling@thevillage.com> writes:

I'm noting that speculation and transactions may interact badly. >Implementations may need some mechanism to limit it,
but that can affect concurrency and performance.

Even if you want to build a Spectre-vulnerable CPU that actually
changes cache states during speculation, I don't expect much effect on performance, because the branch predictor is usually right. And if it
isn't, and the speculative access actually interferes with a
transaction on another core, it will learn that that path was wrong
and will stop the wrong speculation after one or two tries.

OTOH, if you want to build a Spectre-immune CPU by delaying the state
change until the memory access is committed, that won't hurt the
performance much, either, because it's rare to need state changes, and
because the waiting time is typically around 20 cycles or so, which is
small compared to the time needed to access and change the state of a
remote cache line.

I'm not overly concerned about Spectre. The people who are potentially >affected are time-share services (cloud servers) that do not control
what programs are running concurrently on their processors.

This widespread belief is what has caused CPU manufacturers to not
work on fixing Spectre.

But is it true? Has everybody disabled JavaScript and Webassembly in
their browser and their PDF viewer, and disabled macros in their
Spreadsheet, Word processor and presentation program? And, looking at NetSpectre, has everybody disconnected their computers from the
network?

But that is an expensive overkill for
most situations.

It would not be expensive (see below), and it's not overkill.

While software vulnerabilities may be plenty and relatively easy to
use, they can be fixed at any time, or the intended victim of an
attack may use a different software. Meanwhile, hardware
vulnerabilities like Spectre and Rowhammer are always there while the
hardware is not replaced with fixed hardware (and the inaction of
hardware manufacturers ensures that no such replacement exists, and
even when it exists, it will take many years until most of the
hardware is replaced), so they are very attractive to attackers.

I was looking for simpler and cheaper solutions that are optional and
cover most situations.

- The branch predictor tables be separated by user and super mode,
and that OS's be advised to purge the user mode tables on thread switch.

Now that's an expensive approach in both silicon and performance: The
branch predictor, one of the biggest parts of a core would need to
become twice as big to get the same accuracy for a program that spends
almost all of its time in user mode, or almost all of its time in
system mode. And a user-level program would still be slowed down a
lot every time there is a thread switch.

While being expensive, this approach does not make the CPU
Spectre-immune. A thread can be attacked from within itself (e.g.,
from a JavaScript program running in the same thread), and the
attacker can train the branch predictor to do the attacker's bidding
by passing the appropriate data to system calls or as input data to a user-level processing program.

- Adding a conditional branch hint NoSpeculate which basically causes the >front end Dispatcher to stall and single step that one branch until the test >condition resolves. Stalling at Dispatch allows the front end to continue
to fill with the predicted code path but does not allow any to execute.

Yes, if you disable speculation by using that for all branches, it
will help against Spectre. But the slowdown will be huge, slowing the
CPU down almost to in-order levels (e.g., a Cortex-A55 or Bonnell).

OTOH, if you want to apply this slowdown selectively, the question is
where, and if the remaining speculation does not leave the window open
for an attacker to use Spectre. The Linux kernel is trying to use
selective software mitigation, and of course a remaining hole was
found (and used by a security researcher for demonstrating not just
that hole, but something else, that's how I learned of it), so this
approach is everything but watertight.

- I have an extensive set of conditional trap instructions intended for >bounds checks, asserts, etc. In OoO a load or store following a
bounds check might execute before the check exception was delivered.
These trap instructions could have an optional NoSpeculate flag
which again stalls the Dispatcher and essentially single steps just
that instruction until the test condition resolves.

Also very expensive for programs (programming languages) that use
these instructions; speculative load hardening (SLH), which prevents
Spectre v1 by turning the control dependence on the bounds check into
a data dependence costs a factore 2.3-2.5 (depending on the SLH
variant) on the SPEC programs, and I expect your bounds-check trap
instructions to be at least as expensive.

By constrast, a proper invisible-speculation fix would be much cheaper
in performance (papers on the memory access part of invisible
speculation give slowdown factors up to 1.2 (with some papers giving
smaller slowdowns and even occasional speedups), and I think that the memory-access part will have the biggest performance impact among the
changes necessary for a full-blown invisible-speculation fix.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Anton Ertl on Mon Jan 1 14:19:19 2024

On Mon, 01 Jan 2024 08:05:52 +0000, Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

I'm not overly concerned about Spectre. The people who are potentially >>affected are time-share services (cloud servers) that do not control
what programs are running concurrently on their processors.

This widespread belief is what has caused CPU manufacturers to not work
on fixing Spectre.

But is it true? Has everybody disabled JavaScript and Webassembly in
their browser and their PDF viewer, and disabled macros in their
Spreadsheet, Word processor and presentation program? And, looking at NetSpectre, has everybody disconnected their computers from the network?

Here, you are completely correct.

Spectre, and its other cousins, like Rowhammer, are hardware vulerabilities that can't be eradicated only in software, which makes them very serious.

In the early days of computing, computer viruses were something you
could get by running and installing software, and so it was relatively
easy to practice hygenic computing.

Today, though, our web browsers, E-mail clients, word processors, and
numerous other programs invisibly execute code. As well, the common
buffer overflow vulnerability has allowed computers to be taken over
through Internet-facing applications that do not execute externally
supplied code, and which, therefore, are not such as to normally be
suspected of being dangerous.

Given that what I have read claims that it is _not_ possible to completely protect against Spectre and its variants without a significant degradation
of performance, my favored solution is to divide the CPU into two parts;
one made immune to Spectre, which runs Internet-facing code which might be menaced by it, and another in which mitigations are not applied.

This solution, though, basically has not been considered, and that is for
an obvious reason: it is insecure. Once malicious software has found a way through some other vulnerability to insinuate itself into the "trusted"
code of the computer, then it won't be stopped from making use of Spectre
to further its progress.

So it's not enough to put the Internet in a sandbox, we need new and better ideas about how to put a secure wall around that sandbox - to securely
limit how it interacts with the rest of the computer.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Mon Jan 1 15:15:27 2024

Quadibloc <quadibloc@servername.invalid> writes:

Given that what I have read claims that it is _not_ possible to completely >protect against Spectre and its variants without a significant degradation
of performance,

Depends on what you mean by "protect" and "significant". The work on
invisible speculation (a proper fix) reports slowdowns like (for the
memory access component, which appears to have the biggest influence
on performance) a factor 1.2 for a slower variant, or, for a faster
variant, IIRC between a slowdown by 1.06 and a speedup of 1.04. By
contrast, a software mitigation like speculative load hardening
produces a slowdown factor 2.3-2.5, and that protects only against
Spectre v1.

my favored solution is to divide the CPU into two parts;
one made immune to Spectre, which runs Internet-facing code which might be >menaced by it, and another in which mitigations are not applied.

You can do that now, by buying an RK3588-based SBC (e.g., a Radxa
Rock5B), which has 4 Cortex-A76 (OoO) and 4 Cortex-A55 (in-order)
cores, and then run something like QubesOS on that, and use only the
A55 for the Internet-facing stuff. Except that QubesOS for now only
works on AMD64.

Note that the A55 is more than three times slower than the A76
(numbers are times in seconds):

- Rock 5B (1805MHz A55) Debian 11 (texlive-latex-recommended) 2.105
- Rock 5B (2257MHz A76) Debian 11 (texlive-latex-recommended) 0.638

However, differentiating beween what is "internet-facing" and what is
not is something you don't want to task a layman with. It's too easy
to falsely classify something as "not internet-facing" when in reality
it's a file that came from the 'net and might have been tampered with
by an attacker.

Anyway, we have not seen a surge in such systems. In particular, the
Raspi4 and Respi5, which probably could have gone for such a
big.LITTLE design, with the LITTLE component being Spectre-immune,
both went for big-only designs. And on the OS side, we have not seen
attempts at isolation beyond QubesOS, either.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Mon Jan 1 12:43:45 2024

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

I'm noting that speculation and transactions may interact badly.
Implementations may need some mechanism to limit it,
but that can affect concurrency and performance.

Even if you want to build a Spectre-vulnerable CPU that actually
changes cache states during speculation, I don't expect much effect on performance, because the branch predictor is usually right. And if it
isn't, and the speculative access actually interferes with a
transaction on another core, it will learn that that path was wrong
and will stop the wrong speculation after one or two tries.

OTOH, if you want to build a Spectre-immune CPU by delaying the state
change until the memory access is committed, that won't hurt the
performance much, either, because it's rare to need state changes, and because the waiting time is typically around 20 cycles or so, which is
small compared to the time needed to access and change the state of a
remote cache line.

I'm not overly concerned about Spectre. The people who are potentially
affected are time-share services (cloud servers) that do not control
what programs are running concurrently on their processors.

This widespread belief is what has caused CPU manufacturers to not
work on fixing Spectre.

But is it true? Has everybody disabled JavaScript and Webassembly in
their browser and their PDF viewer, and disabled macros in their
Spreadsheet, Word processor and presentation program? And, looking at NetSpectre, has everybody disconnected their computers from the
network?

There are many thousands of academic papers on speculative execution attacks.
I cannot find a single example of even an attempt in the real world.

On the other hand, there are many successful phishing attacks each day.

In other words, if you could somehow fix all speculative execution vulnerabilities, it would have zero impact on the actual successful
security breaches.

But that is an expensive overkill for
most situations.

It would not be expensive (see below), and it's not overkill.

While software vulnerabilities may be plenty and relatively easy to
use, they can be fixed at any time, or the intended victim of an
attack may use a different software. Meanwhile, hardware
vulnerabilities like Spectre and Rowhammer are always there while the hardware is not replaced with fixed hardware (and the inaction of
hardware manufacturers ensures that no such replacement exists, and
even when it exists, it will take many years until most of the
hardware is replaced), so they are very attractive to attackers.

Rowhammer is different - its a memory corruption hardware error.

I'm not convinced that speculative execution leaks can all be fixed.
They are finding new mechanisms every day.
(uOp caches now need to be flushed on thread switch,
function unit or register port contention as a side channel,
speculative load forwarding attacks).

I think it will be an ongoing game of whack-a-mole.

Just get rid of the low hanging fruit - the retention of branch predictors across security domain thread/process switches.

And as far as I can tell these S.E. attacks are not attractive at all,
in that I can find no reports of them at all in the real world.

I was looking for simpler and cheaper solutions that are optional and
cover most situations.

- The branch predictor tables be separated by user and super mode,
and that OS's be advised to purge the user mode tables on thread switch.

Now that's an expensive approach in both silicon and performance: The
branch predictor, one of the biggest parts of a core would need to
become twice as big to get the same accuracy for a program that spends
almost all of its time in user mode, or almost all of its time in
system mode. And a user-level program would still be slowed down a
lot every time there is a thread switch.

Why would the branch predictions from a different thread/process
be helpful to your thread?

Retaining predictions across security domains *IS* the problem
because it allows an attacker to influence/control a victim.
The side channel leaks, while also important, are just a display mechanism.
But with no control over a victim an attacker can make no use of
side channels to display secrets.

While being expensive, this approach does not make the CPU
Spectre-immune. A thread can be attacked from within itself (e.g.,
from a JavaScript program running in the same thread), and the
attacker can train the branch predictor to do the attacker's bidding
by passing the appropriate data to system calls or as input data to a user-level processing program.

Javascript is not a HW security domain.
It is the responsibility of the Javascript VM peddlers to ensure their
runtime environment is secure, as they appear to have done.

- Adding a conditional branch hint NoSpeculate which basically causes the
front end Dispatcher to stall and single step that one branch until the test >> condition resolves. Stalling at Dispatch allows the front end to continue
to fill with the predicted code path but does not allow any to execute.

Yes, if you disable speculation by using that for all branches, it
will help against Spectre. But the slowdown will be huge, slowing the
CPU down almost to in-order levels (e.g., a Cortex-A55 or Bonnell).

OTOH, if you want to apply this slowdown selectively, the question is
where, and if the remaining speculation does not leave the window open
for an attacker to use Spectre. The Linux kernel is trying to use
selective software mitigation, and of course a remaining hole was
found (and used by a security researcher for demonstrating not just
that hole, but something else, that's how I learned of it), so this
approach is everything but watertight.

The question is how many IF statements are guarding leak vulnerable
code pathways *when the attackers branch predictor controls are removed*?
Can these IF's be automatically identified?

- I have an extensive set of conditional trap instructions intended for
bounds checks, asserts, etc. In OoO a load or store following a
bounds check might execute before the check exception was delivered.
These trap instructions could have an optional NoSpeculate flag
which again stalls the Dispatcher and essentially single steps just
that instruction until the test condition resolves.

Also very expensive for programs (programming languages) that use
these instructions; speculative load hardening (SLH), which prevents
Spectre v1 by turning the control dependence on the bounds check into
a data dependence costs a factore 2.3-2.5 (depending on the SLH
variant) on the SPEC programs, and I expect your bounds-check trap instructions to be at least as expensive.

I'm not sure - it depends on the frequency of occurence and
the latency between Dispatch and branch condition resolution.

Based on nothing, I'm assuming both to be small :-)

By constrast, a proper invisible-speculation fix would be much cheaper
in performance (papers on the memory access part of invisible
speculation give slowdown factors up to 1.2 (with some papers giving
smaller slowdowns and even occasional speedups), and I think that the memory-access part will have the biggest performance impact among the
changes necessary for a full-blown invisible-speculation fix.

- anton

How much extra complexity and hardware does it take to essentially
queue all internal state changes throughout a core and all its caches
until speculative branches resolve?

And how many stalls will that queuing latency introduce?
Remember that it can't handle a cache miss for any load that is
currently in the shadow of an unresolved conditional branch as that
would allow coherence traffic to escape to the rest of a system.
And if two cores each have their own D$L1 but a shared D$L2,
then they can't even speculate a cache miss from L1 to L2.
Or speculatively prefetch alternate path instructions because
that could change the state of a cache line from exclusive to shared.

And since I have no faith that this will actually fix the problem
but rather just move it someplace else, I see this as asking me to
spend a lot of time and money on a fix that isn't really a fix for
something that isn't (so far) actually a real world problem.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Mon Jan 1 19:39:07 2024

Quadibloc wrote:

On Mon, 01 Jan 2024 08:05:52 +0000, Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Given that what I have read claims that it is _not_ possible to completely protect against Spectre and its variants without a significant degradation
of performance,

I argue that one can design a processor that looses no performance and remains Spectré (and most other current attack strategies) immune.

my favored solution is to divide the CPU into two parts;
one made immune to Spectre, which runs Internet-facing code which might be menaced by it, and another in which mitigations are not applied.

This solution, though, basically has not been considered, and that is for
an obvious reason: it is insecure. Once malicious software has found a way through some other vulnerability to insinuate itself into the "trusted"
code of the computer, then it won't be stopped from making use of Spectre
to further its progress.

So it's not enough to put the Internet in a sandbox, we need new and better ideas about how to put a secure wall around that sandbox - to securely
limit how it interacts with the rest of the computer.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to MitchAlsup on Mon Jan 1 14:17:36 2024

MitchAlsup wrote:

EricP wrote:

1) I have validated the ATX protocol and it has one important
optimization,
but not given much thought to optimizations that a directory controller
might help with. Currently it assumes that guard requests coherence
messages
will be broadcast to all nodes in a system, and all will Grant/Deny
reply.

This is intentional as it keeps the ATX coherence messages completely
separate from cache coherence messages, and that is important because
it means you don't have to re-validate your coherence protocol
or change the cache subsystem or directory controller.

All I added to the coherence protocol is NAK and its use is restricted
to ATOMIC events; then later I added priority compare so that higher
priority events are not NAKed by lower priority events. Thus, the
protocol is the same except for NAKs.

And this is where I think my ATX atomic transactions differs from ESM,
it is in how transactions are negotiated.

I also have Cache Coherence (CC) protocol managing the shared/exclusive/owned line state and transfer of whole lines into, out of, and between caches. However I don't need a NAK in CC because line movement is never denied.

My ATX coherence messages are a completely different protocol from CC.
ATX messages deal with *permission* to read and write *individual bytes*
in cache lines, and knows nothing about the cache line state or where
it is located. The Atomic Transaction Manager (ATM) uses ATX messages to
talks to other peer ATM's about access to guard ranges, it intercepts
stores from the LSQ to guarded byte ranges and tucks them aside,
and triggers local aborts if a transaction permission is denied.

Only at commit does ATM send guarded line updates to its local cache,
which does its normal thing to check if line is present in an exclusive/modified/owned state, and if not does a read_exclusive.
And the cache controller knows nothing about transactions,
it just sees a burst of updates to local cache.

When cache lines moves is a matter of performance optimization.
It can happen all at commit, or gradually and concurrently as
a transaction proceeds in anticipation of a commit.

This is also why my transactions are not sensitive to cache associativity
and conflict evicts - because it does not use cache evicts or invalidates
to trigger transactions aborts.

4) Guard requests can be either synchronous or asynchronous.
The choice does not affect transaction validity, just performance.

Do you envision mixing and matching synch and asynch guards ?

Synchronous means the guard acts as a membar to following loads
and stores to the guarded bytes while the guard is negotiated.
Synchronous is intended to prevent you from too soon touching a
cache line and grabbing it away from its current owner-updater.

Asynchronous allows following ld/st to the guarded range to execute
concurrent while the guard request is pending, allowing the transfer
of a cache lines to overlap with guard negotiation.
Asynchronous could cause cache lines to ping-pong.

So can too little associativity in your data cache.

Yes but at least it doesn't cause transaction aborts, as RTM does.

The choice of synchronous or asynchronous is dependent on the algorithm
and can be different for different objects in a transaction.

So, yes to my question above.

Why not allow speculative branches to cover asynch accesses to guarded
lines ??

Because I didn't think of it. So far I have been mostly concerned with
how to make it work at all.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Mon Jan 1 19:51:09 2024

EricP wrote:

MitchAlsup wrote:

EricP wrote:

1) I have validated the ATX protocol and it has one important
optimization,
but not given much thought to optimizations that a directory controller
might help with. Currently it assumes that guard requests coherence
messages
will be broadcast to all nodes in a system, and all will Grant/Deny
reply.

This is intentional as it keeps the ATX coherence messages completely
separate from cache coherence messages, and that is important because
it means you don't have to re-validate your coherence protocol
or change the cache subsystem or directory controller.

All I added to the coherence protocol is NAK and its use is restricted
to ATOMIC events; then later I added priority compare so that higher
priority events are not NAKed by lower priority events. Thus, the
protocol is the same except for NAKs.

And this is where I think my ATX atomic transactions differs from ESM,
it is in how transactions are negotiated.

I also have Cache Coherence (CC) protocol managing the shared/exclusive/owned line state and transfer of whole lines into, out of, and between caches. However I don't need a NAK in CC because line movement is never denied.

My ATX coherence messages are a completely different protocol from CC.
ATX messages deal with *permission* to read and write *individual bytes*
in cache lines, and knows nothing about the cache line state or where
it is located. The Atomic Transaction Manager (ATM) uses ATX messages to talks to other peer ATM's about access to guard ranges, it intercepts
stores from the LSQ to guarded byte ranges and tucks them aside,
and triggers local aborts if a transaction permission is denied.

Can you speculate on the latency for a core to talk to an ATM and receive
a response back ?? {{I am assuming all cores on a "chip" can use the same STM.}}

Only at commit does ATM send guarded line updates to its local cache,
which does its normal thing to check if line is present in an exclusive/modified/owned state, and if not does a read_exclusive.
And the cache controller knows nothing about transactions,
it just sees a burst of updates to local cache.

When cache lines moves is a matter of performance optimization.
It can happen all at commit, or gradually and concurrently as
a transaction proceeds in anticipation of a commit.

This is also why my transactions are not sensitive to cache associativity
and conflict evicts - because it does not use cache evicts or invalidates
to trigger transactions aborts.

4) Guard requests can be either synchronous or asynchronous.
The choice does not affect transaction validity, just performance.

Do you envision mixing and matching synch and asynch guards ?

Synchronous means the guard acts as a membar to following loads
and stores to the guarded bytes while the guard is negotiated.
Synchronous is intended to prevent you from too soon touching a
cache line and grabbing it away from its current owner-updater.

Asynchronous allows following ld/st to the guarded range to execute
concurrent while the guard request is pending, allowing the transfer
of a cache lines to overlap with guard negotiation.
Asynchronous could cause cache lines to ping-pong.

So can too little associativity in your data cache.

Yes but at least it doesn't cause transaction aborts, as RTM does.

The choice of synchronous or asynchronous is dependent on the algorithm
and can be different for different objects in a transaction.

So, yes to my question above.

Why not allow speculative branches to cover asynch accesses to guarded
lines ??

Because I didn't think of it. So far I have been mostly concerned with
how to make it work at all.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Mon Jan 1 20:10:11 2024

EricP wrote:

How much extra complexity and hardware does it take to essentially
queue all internal state changes throughout a core and all its caches
until speculative branches resolve?

For an Opteron scale processor:: Not That Much--you end up using
the ReOrder Buffer for little tweeks to CPU state, and you use the
Miss Buffer for latent changes to TLB and Caches, leaving only the
branch predictor stuff. Since these change more slowly than instruc-
tion state you can probably add a few bits to RoB checkpointing to
cover the branch predictor updates.

And how many stalls will that queuing latency introduce?

None (<1%) if the buffering is done correctly.

Remember that it can't handle a cache miss for any load that is
currently in the shadow of an unresolved conditional branch as that
would allow coherence traffic to escape to the rest of a system.

That is a requirement for Sequential Consistency but not for Causal Consistency.

And if two cores each have their own D$L1 but a shared D$L2,
then they can't even speculate a cache miss from L1 to L2.

Careful, here. It is possible to speculatively access another CPUs
cache; and before the data arrives, branch recovery cancels why that
line is showing up. Here, you CAN send the line back to the provider
or on to DRAM even if you cannot deposit it in your cache.
{{The OWNED state (MOESI) creates this requirement.}}

So, you CAN under CC speculatively do this, what you cannot do is to
forget, like one does repairing mispredictions, and you need strategies
for each thing you may have inbound that will not ultimately be used
due to misspeculation.

Or speculatively prefetch alternate path instructions because
that could change the state of a cache line from exclusive to shared.

You can't put HW prefetches in the CPU caches primarily because you
cannot loose the line.state being reallocated for the arriving line.
But you COULD hold onto that line in something like the Miss Buffer
until retirement.

You can put SW prefetches in a CPU cache after the prefetch instruction retires.

And since I have no faith that this will actually fix the problem
but rather just move it someplace else, I see this as asking me to
spend a lot of time and money on a fix that isn't really a fix for
something that isn't (so far) actually a real world problem.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Tue Jan 2 04:29:00 2024

On Mon, 01 Jan 2024 19:39:07 +0000, MitchAlsup wrote:

Quadibloc wrote:

Given that what I have read claims that it is _not_ possible to
completely protect against Spectre and its variants without a
significant degradation of performance,

I argue that one can design a processor that looses no performance and remains Spectré (and most other current attack strategies) immune.

You may be right; I'm not going to argue with that claim. I'm only
saying that it is _generally believed_ that it isn't possible to do
that well in dealing with Spectre and related vulnerabilities.

If you're right, then all we need to do is make CPUs secure after
the fashion you recommend.

But if the naysayers are right?

In that depressing scenario, I think that I have finally come up
with what is needed.

A general purpose computer, that needs to be able to execute
software locally with high performance, but which can also
access the Internet?

I think that what it needs is a _double_ sandbox.

You have the primary computer, which is built for speed. Only
security measures with next to no overhead are incorporated in
it. But the primary computer can't talk to the Internet.

Instead, connecting to the Internet is the job of a secondary
computer. This computer is permanently hardwired to be unable
to write to the only portion of memory that it is allowed to
execute code from - like the early Bell Telephones Electronic
Switching System.

So it doesn't load programs into memory from disk. That is done
by the primary computer, which _can_ write to the memory that the
secondary computer can execute code from.

The secondary computer has as its major security feature that it
can't ever alter any executable code. (It also has no direct
access to the hard disk either.) It can be a high-performance
computer as well. The web browser, and other Internet-facing applications
run on the secondary computer.

Except: since it can't load or modify executable code, it cannot
do just-in-time compilation, the only efficient way to run an
interpreter. So any executable content from web sites and so on,
like Java Script, goes somewhere else.

And the sandbox for _that_ is the *ternary* computer. Not ternary
like the SETUN, but just that it is the _third_ computer, after the
primary and secondary ones.

This computer is secured against Spectre and Rowhammer... by being
built from 486-era technology. The CPU is in-order, so no Spectre.
The clock rate is slow enough so that the memory is not vulnerable
to Rowhammer-style attacks.

When the secondary computer finds executable content in web sites,
it doesn't pass it up to the primary computer, it passes it down
to the ternary computer, which serves as a high-quality sandbox,
being physically separate, and having secure hardware.

So a malicious web site trying to use JavaScript to do bad things
to this kind of computer... finds the JavaScript is running on a
processor without the seemingly unavoidable Spectre and Rowhammer vulnerabilities. It *did* avoid them, completely.

And if it finds some other way to influence the computer above the
sandbox? That computer is a desert for attackers, as it contains no
means of altering any code which it executes. So almost any attack
is impossible there!

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Tue Jan 2 07:59:37 2024

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

But is it true? Has everybody disabled JavaScript and Webassembly in
their browser and their PDF viewer, and disabled macros in their
Spreadsheet, Word processor and presentation program? And, looking at
NetSpectre, has everybody disconnected their computers from the
network?

There are many thousands of academic papers on speculative execution attacks. >I cannot find a single example of even an attempt in the real world.

Maybe you were not looking? E.g., read <https://dustri.org/b/spectre-exploits-in-the-wild.html>.

But, you might say, the publication of this exploit does not show that
Spectre has been used in the wild.

But attackers usually do not announce how they got at your secret
keys.

On the other hand, there are many successful phishing attacks each day.

Yes, so?

In other words, if you could somehow fix all speculative execution >vulnerabilities, it would have zero impact on the actual successful
security breaches.

How do you know that? If the intended victim of a phishing attack
does not bite, and does not use the software that the attacker has
exploits for, what does the attacker do? Let's say the attacker uses
a Spectre exploit to get at a private key. The victim may not even
notice this; but if the attacker later performs a ransomware attack
that makes use of the secret key, how will the attack be explained?
Given the myth that there are no real-world Spectre attacks, the
investigation will settle on an unknown security breach, or, because
"unknown" is so uncomfortable, on PEBCAK (e.g., a phishing attack).

Rowhammer is different - its a memory corruption hardware error.

The commonalities with Spectre are that both are hardware errors, and
fixed hardware for both could have been released since their
discovery, but that has not happened.

I'm not convinced that speculative execution leaks can all be fixed.
They are finding new mechanisms every day.
(uOp caches now need to be flushed on thread switch,
function unit or register port contention as a side channel,
speculative load forwarding attacks).

uOp caches are microarchitectural state, speculative load forwarding
is another speculation variant.

Whatever speculation is used (branches, exception, memory aliasing
(speculative load forwarding), etc.), the hardware people have managed
for three decades to avoid leaking speculative architectural state to
committed architectural state on misspeculation (Zenbleed is the
exception that proves the rule). They just need to apply the same
discipline to microarchitectural state, and they will prevent Spectre
exploits that work through microarchitectural state, whether it's
cache, uop cache, branch predictor, etc.

Yes, hardware also needs to avoid revealing through resource
contention side channels what's going on in the speculative world.
There is work on that, too. I don't have such a nice argument for why
the hardware people will succeed with that as I have for
microarchitectural state, but I expect that, if they put their minds
to it, they will succeed. OTOH, the widespread idea that Spectre and especially Spectre using resource-contention side channels is
irrelevant in the real world may prevent them from putting their mind
to it.

The fact that we still have Rowhammer because one "cannot find a
single example of even an attempt in the real world", despite it being relatively easy to fix if the memory controller people accepted it as
their responsibility to fix it, speaks for the scenario where the
hardware people will not put their minds to fixing Spectre, not the resource-contention side channel, and not even the
microarchitectural-state side channel.

I think it will be an ongoing game of whack-a-mole.

The current approach of leaving Spectre to software mitigations is
certainly going to be that. A pervasive hardware fix covers all
variants, because it eliminates all side channels from speculative
state to committed state.

Just get rid of the low hanging fruit - the retention of branch predictors >across security domain thread/process switches.

As mentioned, Spectre can be exploited even in the scenario you have
outlined.

I was looking for simpler and cheaper solutions that are optional and
cover most situations.

- The branch predictor tables be separated by user and super mode,
and that OS's be advised to purge the user mode tables on thread switch.

Now that's an expensive approach in both silicon and performance: The
branch predictor, one of the biggest parts of a core would need to
become twice as big to get the same accuracy for a program that spends
almost all of its time in user mode, or almost all of its time in
system mode. And a user-level program would still be slowed down a
lot every time there is a thread switch.

Why would the branch predictions from a different thread/process
be helpful to your thread?

The typical scenario where a thread can benefit from not flushing the
branch predictor is when there is a switch to a different thread for a
short while and then a switch back.

However, there are also scenarios where threads benefit from the
branch predictions collected in a different thread:

* if the thread is in the same process, and processes the same code or
the same data.

* if the thread is in a different process, and executes a common
library (e.g., libc), or works on the same data (e.g., in a pipe).

Retaining predictions across security domains *IS* the problem
because it allows an attacker to influence/control a victim.
The side channel leaks, while also important, are just a display mechanism. >But with no control over a victim an attacker can make no use of
side channels to display secrets.

Spectre can be fixed by either preventing the side channel from the
speculative to the committed state (the approach I suggest), or by
preventing speculation (what the people who want to turn off
speculation suggest).

You suggest that erasing the branch predictor on thread switches is
just as good as preventing speculation. But it isn't. Even without
training, as long as there is speculation and the side channel from
the speculative to the committed world, some data will be leaked. Ok,
you may be tempted to rely on your luck that it's not sensitive data,
but that does not appear to be a very trustworthy approach.

And that's especially the case because the attacker may be able to
help luck in the attacker's direction by passing data to the victim
process that results in training the branch predictor of the victim in
a specific way. E.g., a PDF document processed by a browser will
result in a lot of branches being taken in a certain way, which will
train the branch predictor in a certain way.

Javascript is not a HW security domain.
It is the responsibility of the Javascript VM peddlers to ensure their >runtime environment is secure, as they appear to have done.

The architecture manual defines what happens when certain instructions
are executed. E.g., when you perform an architectural bounds check,
the access does not happen. If the microarchitecture does it
differently, it's the responsibility of the hardware people to ensure
that this microarchitectural stuff does not leak data through side
channels if it can be prevented; and it can.

The security of at least one commerical operating system (for
Burroughs large systems) is based on this concept.

If you cannot rely on branch instructions doing what the architecture
manual says, why should you rely on other hardware mechanisms (that
you may or may not consider to be "HW security domain") to do what the architecture manual says. And indeed, e.g., the page-protection
mechanism does not prevent speculative access and revealing that
through a side channel on a lot of hardware, either (the original
Meltdown).

Do the software mitigations applied by the JavaScript implementors
prevent Spectre completely? I have my doubts. As you write, there
are a large number of Spectre variants that they have to protect
against, including stuff like Inception that predicts branches that
are not there.

- I have an extensive set of conditional trap instructions intended for
bounds checks, asserts, etc. In OoO a load or store following a
bounds check might execute before the check exception was delivered.
These trap instructions could have an optional NoSpeculate flag
which again stalls the Dispatcher and essentially single steps just
that instruction until the test condition resolves.

Also very expensive for programs (programming languages) that use
these instructions; speculative load hardening (SLH), which prevents
Spectre v1 by turning the control dependence on the bounds check into
a data dependence costs a factore 2.3-2.5 (depending on the SLH
variant) on the SPEC programs, and I expect your bounds-check trap
instructions to be at least as expensive.

I'm not sure - it depends on the frequency of occurence and
the latency between Dispatch and branch condition resolution.

Based on nothing, I'm assuming both to be small :-)

The main reason why OoO so vastly outperforms in-order for
general-purpose code is that execution does not have to wait for the
branches to be resolved; this is reflected in the size of the
outstanding branches, which is, e.g., 128 for the Golden Cove < https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/#gracemont-s-out-of-order-engine>
(look for "Branch Order Buffer"). If you throw in one NoSpeculate
branch, this reduces this number to 0 for this branch. And given the
number of loads in a program, you probably have to make most branches "NoSpeculate" to be safe. So you fall back to close to in-order
performance.

By constrast, a proper invisible-speculation fix would be much cheaper
in performance (papers on the memory access part of invisible
speculation give slowdown factors up to 1.2 (with some papers giving
smaller slowdowns and even occasional speedups), and I think that the
memory-access part will have the biggest performance impact among the
changes necessary for a full-blown invisible-speculation fix.

- anton

How much extra complexity and hardware does it take to essentially
queue all internal state changes throughout a core and all its caches
until speculative branches resolve?

For the caches, Mitch Alsup tells us that we just need the load
buffers that we have anyway. In any case, it's at most one cache line
for each outstanding load. But you also get a benefit: this works
like an L1 that is that much larger. You can reduce the buffering
necessary by keeping only management data (a few bits) for the loads
that hit the L1 caches. For speculative stores, that has been in the
store buffers since the beginning of modern OoO.

For conditional branches, you just have to remember the outcome (1
bit) and feed it to the branch predictor when the branch commits.
For indirect branches, you just have to remember the target address
(64 bits).

And how many stalls will that queuing latency introduce?
Remember that it can't handle a cache miss for any load that is
currently in the shadow of an unresolved conditional branch as that
would allow coherence traffic to escape to the rest of a system.

Above you argued that NoSpeculate branches are cheap. I am not
convinced of that, but if it is, making state-changing memory access NoSpeculate is certainly relatively cheap. That's because a
state-changing memory access is relatively expensive anyway, so adding
maybe 20 cycles the memory access is no longer speculative does not
make it much more expensive. Also, such memory accesses are quite
rare, with, as you say, cache misses probably being the most common
cause.

And if two cores each have their own D$L1 but a shared D$L2,
then they can't even speculate a cache miss from L1 to L2.

As long as the cache miss does not change the state of the L2 cache
line and the resource-contention side channel is eliminated, you
certainly can. But, yes, bigger private caches are likely to help.
Now the trend in recent years has been towards bigger private caches
for the P-cores; e.g., Raptor Lake has 2MB of private L2 cache.

Actually, if it turns out to be a big issue, you can actually perform
a speculative load from, e.g., a line that is exclusive to a different
core without changing the state. That load then speculates on the
content of the line not changing, and this speculation has to be
verified (and the line changed to shared) when the load is about to be committed.

Or speculatively prefetch alternate path instructions because
that could change the state of a cache line from exclusive to shared.

Please elaborate. What "alternative path"?

The hardware prefetcher must only be trained on architectural memory
accesses (to avoid Spectre), and any software prefetch instructions
have the same speculation restrictions as the actual loads. What a
software prefetch instruction would achieve here is to become
architectural many cycles before the actual load and switch the cache
line to shared and actually put the cache line into the local cache at
that time.

And since I have no faith that this will actually fix the problem
but rather just move it someplace else, I see this as asking me to
spend a lot of time and money on a fix that isn't really a fix for
something that isn't (so far) actually a real world problem.

A software guy can point to phishing and say the same thing about
fixing software vulnerabilities before the vulnerability is proven to
be exploited by real-world attackers.

The typical approach taken in the software world is to not take claims
of security problems serious without a demonstrated exploit, but to
act when an exploit has been presented.

In the case of Spectre, a lot of exploits have been presented (you
wrote "many thousand"), and the hardware people should have started
fixing the hardware in 2017, when they learned about Spectre, given
the lead time of hardware designs. Instead, they sit on their hands
and let the users tell each other that it's Somebody Else's Problem.

And no, it's not too difficult. The pieces have been published years
ago.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to MitchAlsup on Tue Jan 2 16:14:57 2024

On Mon, 1 Jan 2024 19:39:07 +0000
mitchalsup@aol.com (MitchAlsup) wrote:

Quadibloc wrote:

On Mon, 01 Jan 2024 08:05:52 +0000, Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Given that what I have read claims that it is _not_ possible to
completely protect against Spectre and its variants without a
significant degradation of performance,

I argue that one can design a processor that looses no performance
and remains Spectr� (and most other current attack strategies) immune.

my favored solution is to divide the CPU into two
parts; one made immune to Spectre, which runs Internet-facing code
which might be menaced by it, and another in which mitigations are
not applied.

This solution, though, basically has not been considered, and that
is for an obvious reason: it is insecure. Once malicious software
has found a way through some other vulnerability to insinuate
itself into the "trusted" code of the computer, then it won't be
stopped from making use of Spectre to further its progress.

So it's not enough to put the Internet in a sandbox, we need new
and better ideas about how to put a secure wall around that sandbox
- to securely limit how it interacts with the rest of the computer.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to MitchAlsup on Tue Jan 2 16:24:52 2024

On Mon, 1 Jan 2024 19:39:07 +0000
mitchalsup@aol.com (MitchAlsup) wrote:

Quadibloc wrote:

On Mon, 01 Jan 2024 08:05:52 +0000, Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Given that what I have read claims that it is _not_ possible to
completely protect against Spectre and its variants without a
significant degradation of performance,

I argue that one can design a processor that looses no performance
and remains Spectr� (and most other current attack strategies) immune.

Most important 'current attack strategies' are the same as they were 30
years ago. Side channels, row hammers etc. are good to write papers
about and to scare few people. As far as real worlds threats goes, they
are of very low importance.
If attacker found a way to run arbitrary binary on my computer, at my
normal non-root privilege, he has 1000 easier (than side channels) way
to achieve his goals.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to MitchAlsup on Wed Jan 3 10:38:58 2024

MitchAlsup wrote:

EricP wrote:

And this is where I think my ATX atomic transactions differs from ESM,
it is in how transactions are negotiated.

I also have Cache Coherence (CC) protocol managing the
shared/exclusive/owned
line state and transfer of whole lines into, out of, and between caches.
However I don't need a NAK in CC because line movement is never denied.

My ATX coherence messages are a completely different protocol from CC.
ATX messages deal with *permission* to read and write *individual bytes*
in cache lines, and knows nothing about the cache line state or where
it is located. The Atomic Transaction Manager (ATM) uses ATX messages to
talks to other peer ATM's about access to guard ranges, it intercepts
stores from the LSQ to guarded byte ranges and tucks them aside,
and triggers local aborts if a transaction permission is denied.

Can you speculate on the latency for a core to talk to an ATM and receive
a response back ?? {{I am assuming all cores on a "chip" can use the same STM.}}

Its difficult because there are so many options and possible optimizations.

In the base design (no Directory Controller optimization) a guard range
for bytes in a single line which does not have any shared bytes
(that is, no read-read shared or adjacent false shared bytes)
requires one request and one reply message to each peer core in a system.
For C cores thats 2*(C-1) msgs.

For unshared lines new guard requests in the same line require
no new messages.

If a line is shared between two different transactions, either both
read share the same bytes, or read-write or write-write adjacent bytes,
then each new guard range instructions requires a request and reply
BUT only with the cores that are sharing lines.
This would likely only happen if there were multiple objects that just
happen to reside in the same cache line and are accessed by two or more transactions.

The messages are sent from ATM to ATM over the coherence network.
Because they do not interact with caches they do not need to travel
down or up the cache hierarchy L1<=>L2<=>L3. Instead they can bypass
all those cache comms queues and go directly between the ATM and network.
This eliminates all the queuing that cache coherence messages must transit.

When an ATX message arrives at a ATM, processing them requires no external information as it is all in the local lookup tables indexed by a CAM on
the line physical address. The CAM and tables likely require at least
2R2W ports, one for new local commands and one for inbound ATX messages. Processing ATX messages and sending a reply should be at least 1 per clock
(and more table ports gets more messages per clock).

At the instruction level a core can have multiple guard requests outstanding
at once, and can optionally overlap this with probing the local cache for
hits and fetching missed line data either read_share or read_excl.

The cost that is most difficult to estimate is the collision rate.
The ATX protocol is a try-fail-retry based mechanism with FIFO ordering.
For N contenders each transaction can worst case fail N-1 times then succeed. So N contenders trying N times is worst case O(N^2) cost.
But it also depends on how far each is into the transaction and how much investment in that transaction when it collides, loses, and retries.

So best case, a transaction has perfect overlap for all its guard range requests plus fetching all its cache lines and no shared lines.
The latency could be about the time to fetch a single line.

Worst case, the guard ranges are serialized, line fetches are serialized,
every line is shared, every transaction collides and retries max times.
Because of FIFO ordering that number is bounded, but it might be big.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Wed Jan 3 17:13:59 2024

EricP wrote:

MitchAlsup wrote:

EricP wrote:

And this is where I think my ATX atomic transactions differs from ESM,
it is in how transactions are negotiated.

I also have Cache Coherence (CC) protocol managing the
shared/exclusive/owned
line state and transfer of whole lines into, out of, and between caches. >>> However I don't need a NAK in CC because line movement is never denied.

My ATX coherence messages are a completely different protocol from CC.
ATX messages deal with *permission* to read and write *individual bytes* >>> in cache lines, and knows nothing about the cache line state or where
it is located. The Atomic Transaction Manager (ATM) uses ATX messages to >>> talks to other peer ATM's about access to guard ranges, it intercepts
stores from the LSQ to guarded byte ranges and tucks them aside,
and triggers local aborts if a transaction permission is denied.

Can you speculate on the latency for a core to talk to an ATM and receive
a response back ?? {{I am assuming all cores on a "chip" can use the same
STM.}}

Its difficult because there are so many options and possible optimizations.

In the base design (no Directory Controller optimization) a guard range
for bytes in a single line which does not have any shared bytes
(that is, no read-read shared or adjacent false shared bytes)
requires one request and one reply message to each peer core in a system.
For C cores thats 2*(C-1) msgs.

For unshared lines new guard requests in the same line require
no new messages.

If a line is shared between two different transactions, either both
read share the same bytes, or read-write or write-write adjacent bytes,
then each new guard range instructions requires a request and reply
BUT only with the cores that are sharing lines.
This would likely only happen if there were multiple objects that just
happen to reside in the same cache line and are accessed by two or more transactions.

The messages are sent from ATM to ATM over the coherence network.
Because they do not interact with caches they do not need to travel
down or up the cache hierarchy L1<=>L2<=>L3. Instead they can bypass
all those cache comms queues and go directly between the ATM and network. This eliminates all the queuing that cache coherence messages must transit.

When an ATX message arrives at a ATM, processing them requires no external information as it is all in the local lookup tables indexed by a CAM on
the line physical address. The CAM and tables likely require at least
2R2W ports, one for new local commands and one for inbound ATX messages. Processing ATX messages and sending a reply should be at least 1 per clock (and more table ports gets more messages per clock).

At the instruction level a core can have multiple guard requests outstanding at once, and can optionally overlap this with probing the local cache for hits and fetching missed line data either read_share or read_excl.

The cost that is most difficult to estimate is the collision rate.
The ATX protocol is a try-fail-retry based mechanism with FIFO ordering.
For N contenders each transaction can worst case fail N-1 times then succeed. So N contenders trying N times is worst case O(N^2) cost.
But it also depends on how far each is into the transaction and how much investment in that transaction when it collides, loses, and retries.

So best case, a transaction has perfect overlap for all its guard range requests plus fetching all its cache lines and no shared lines.
The latency could be about the time to fetch a single line.

Worst case, the guard ranges are serialized, line fetches are serialized, every line is shared, every transaction collides and retries max times. Because of FIFO ordering that number is bounded, but it might be big.

Thanks for the lucid explanation.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to EricP on Wed Jan 3 19:59:27 2024

EricP wrote:

MitchAlsup wrote:

Can you speculate on the latency for a core to talk to an ATM and receive
a response back ?? {{I am assuming all cores on a "chip" can use the same
STM.}}

The cost that is most difficult to estimate is the collision rate.
The ATX protocol is a try-fail-retry based mechanism with FIFO ordering.
For N contenders each transaction can worst case fail N-1 times then
succeed.
So N contenders trying N times is worst case O(N^2) cost.
But it also depends on how far each is into the transaction and how much investment in that transaction when it collides, loses, and retries.

Another thing is about failure retry loop latency.
If the loser in a collision immediately loops around and tries again that
might flood the coherence network with messages. If the loser backs off
for a period of time it can spend too much time in the back off.

The ATSTART instruction has an option to request a notification message
after a collision that causes an abort.

When there is a collision for access to bytes between two transactions
the transaction number assigned at the start is used to decide who wins,
the lower number is the older and has priority.

The ATX guard message includes the 64-bit transaction number for deciding
the winner, plus a 16-bit "try" number, the guard access read or write,
a bit vector of which line bytes this applies to,
and a bit requesting a notify message if access denied.

If a notify is requested the winner in a collision remembers the
transaction and try numbers of the loser and sends a Denied reply.
The Denied causes the loser to abort and jump to its fail address where it executes a ATWAITNFY instruction to await a notification message with its transaction and try number (similar to the x86 MWAIT instruction).

After the winner eventually commits, for each losing collider that
requested notification it sends a message with the transaction and try
number to the loser to wake it up as soon as possible. The ATWAITNFY instruction matches the transaction and try numbers and continues.

It looks useful in theory but needs to be tested.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Fri Jan 5 15:23:31 2024

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

- The branch predictor tables be separated by user and super mode,
and that OS's be advised to purge the user mode tables on thread switch. >>> Now that's an expensive approach in both silicon and performance: The

branch predictor, one of the biggest parts of a core would need to
become twice as big to get the same accuracy for a program that spends
almost all of its time in user mode, or almost all of its time in
system mode. And a user-level program would still be slowed down a
lot every time there is a thread switch.

That was the simplest form. A more sophisticated version could have
a 2 or 3 bit tag like an ASID on each branch predictor entry.
Tag 0 is for super mode, others are for the most recent 3 or 7 processes
run on this core. If a lookup hits on an entry with a different tag
then the entry is set to its uninitialized state.

Though I'm not sure how this could work with the return stack predictor. Probably have to keep 4 or 8 copies of it.

Why would the branch predictions from a different thread/process
be helpful to your thread?

The typical scenario where a thread can benefit from not flushing the
branch predictor is when there is a switch to a different thread for a
short while and then a switch back.

However, there are also scenarios where threads benefit from the
branch predictions collected in a different thread:

* if the thread is in the same process, and processes the same code or
the same data.

Yes, only flush if the new thread is in a different process.

* if the thread is in a different process, and executes a common
library (e.g., libc), or works on the same data (e.g., in a pipe).

IMO this "optimization" is not worth the security hole.

Retaining predictions across security domains *IS* the problem
because it allows an attacker to influence/control a victim.
The side channel leaks, while also important, are just a display mechanism. >> But with no control over a victim an attacker can make no use of
side channels to display secrets.

Spectre can be fixed by either preventing the side channel from the speculative to the committed state (the approach I suggest), or by
preventing speculation (what the people who want to turn off
speculation suggest).

You suggest that erasing the branch predictor on thread switches is
just as good as preventing speculation. But it isn't. Even without training, as long as there is speculation and the side channel from
the speculative to the committed world, some data will be leaked. Ok,
you may be tempted to rely on your luck that it's not sensitive data,
but that does not appear to be a very trustworthy approach.

I would also rely on the NoSpeculate branch hint to stall branches
that check array bounds.

And that's especially the case because the attacker may be able to
help luck in the attacker's direction by passing data to the victim
process that results in training the branch predictor of the victim in
a specific way. E.g., a PDF document processed by a browser will
result in a lot of branches being taken in a certain way, which will
train the branch predictor in a certain way.

- I have an extensive set of conditional trap instructions intended for >>>> bounds checks, asserts, etc. In OoO a load or store following a
bounds check might execute before the check exception was delivered.
These trap instructions could have an optional NoSpeculate flag
which again stalls the Dispatcher and essentially single steps just
that instruction until the test condition resolves.

Also very expensive for programs (programming languages) that use
these instructions; speculative load hardening (SLH), which prevents
Spectre v1 by turning the control dependence on the bounds check into
a data dependence costs a factore 2.3-2.5 (depending on the SLH
variant) on the SPEC programs, and I expect your bounds-check trap
instructions to be at least as expensive.

I'm not sure - it depends on the frequency of occurence and
the latency between Dispatch and branch condition resolution.

Based on nothing, I'm assuming both to be small :-)

The main reason why OoO so vastly outperforms in-order for
general-purpose code is that execution does not have to wait for the
branches to be resolved; this is reflected in the size of the
outstanding branches, which is, e.g., 128 for the Golden Cove < https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/#gracemont-s-out-of-order-engine>
(look for "Branch Order Buffer"). If you throw in one NoSpeculate
branch, this reduces this number to 0 for this branch. And given the
number of loads in a program, you probably have to make most branches "NoSpeculate" to be safe. So you fall back to close to in-order
performance.

BTW near as I can tell the "Branch Order Buffer" is just Intel's name
for what others call the register file rename checkpoint.

As I saw it, the NoSpeculate hint only stalls to wait for its branch,
not all pending branches. The idea being that most branches are not for
array bounds checks so don't need guarding, and for those that are bounds checks the array access LD's and ST's likely immediately follow.
But it could also be used to eliminate timing variations in crypto loops.

But yes, it does stall all following instructions.
This was intended as a simple option that user or compilers can apply.
To guard just a subset of the following instructions would require
a mechanism like full predication which would be much more expensive.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Fri Jan 5 20:57:37 2024

EricP wrote:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

- The branch predictor tables be separated by user and super mode,
and that OS's be advised to purge the user mode tables on thread switch. >>>> Now that's an expensive approach in both silicon and performance: The

branch predictor, one of the biggest parts of a core would need to
become twice as big to get the same accuracy for a program that spends >>>> almost all of its time in user mode, or almost all of its time in
system mode. And a user-level program would still be slowed down a
lot every time there is a thread switch.

That was the simplest form. A more sophisticated version could have
a 2 or 3 bit tag like an ASID on each branch predictor entry.

Which halves the number of entries you can store or worse. Remember
branch prediction states are un-tagged 2-bit saturating counters.

Tag 0 is for super mode, others are for the most recent 3 or 7 processes
run on this core. If a lookup hits on an entry with a different tag
then the entry is set to its uninitialized state.

Though I'm not sure how this could work with the return stack predictor. Probably have to keep 4 or 8 copies of it.

Why would the branch predictions from a different thread/process
be helpful to your thread?

The typical scenario where a thread can benefit from not flushing the
branch predictor is when there is a switch to a different thread for a
short while and then a switch back.

However, there are also scenarios where threads benefit from the
branch predictions collected in a different thread:

* if the thread is in the same process, and processes the same code or
the same data.

Yes, only flush if the new thread is in a different process.

* if the thread is in a different process, and executes a common
library (e.g., libc), or works on the same data (e.g., in a pipe).

IMO this "optimization" is not worth the security hole.

Retaining predictions across security domains *IS* the problem
because it allows an attacker to influence/control a victim.
The side channel leaks, while also important, are just a display mechanism. >>> But with no control over a victim an attacker can make no use of
side channels to display secrets.

Spectre can be fixed by either preventing the side channel from the
speculative to the committed state (the approach I suggest), or by
preventing speculation (what the people who want to turn off
speculation suggest).

You suggest that erasing the branch predictor on thread switches is
just as good as preventing speculation. But it isn't. Even without
training, as long as there is speculation and the side channel from
the speculative to the committed world, some data will be leaked. Ok,
you may be tempted to rely on your luck that it's not sensitive data,
but that does not appear to be a very trustworthy approach.

I would also rely on the NoSpeculate branch hint to stall branches
that check array bounds.

My 66000 PREDication does not use the branch prediction tables.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to MitchAlsup on Sat Jan 6 13:16:03 2024

MitchAlsup wrote:

EricP wrote:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

- The branch predictor tables be separated by user and super mode, >>>>>> and that OS's be advised to purge the user mode tables on thread
switch.

Now that's an expensive approach in both silicon and performance: The >>>>> branch predictor, one of the biggest parts of a core would need to
become twice as big to get the same accuracy for a program that spends >>>>> almost all of its time in user mode, or almost all of its time in
system mode. And a user-level program would still be slowed down a
lot every time there is a thread switch.

That was the simplest form. A more sophisticated version could have
a 2 or 3 bit tag like an ASID on each branch predictor entry.

Which halves the number of entries you can store or worse. Remember
branch prediction states are un-tagged 2-bit saturating counters.

I am under the impression that modern conditional branch predictors
use many more bits so they can track loop patterns.
Many BP's already use CAM address tags to improve accuracy.

A Survey of Techniques for Dynamic Branch Prediction, 2018 https://arxiv.org/abs/1804.00261

But even if it is just costs 2 bits then so be it, that's the cost.
This 2-bit ASID tag needs to be a CAM so you can invalidate
all entries of a specific tag for recycle in one clock.

The branch target buffers would also need CAM tags.
Return stack predictor would need separate tables.

Tag 0 is for super mode, others are for the most recent 3 or 7 processes
run on this core. If a lookup hits on an entry with a different tag
then the entry is set to its uninitialized state.

Though I'm not sure how this could work with the return stack predictor.
Probably have to keep 4 or 8 copies of it.

Why would the branch predictions from a different thread/process
be helpful to your thread?

The typical scenario where a thread can benefit from not flushing the
branch predictor is when there is a switch to a different thread for a
short while and then a switch back.

However, there are also scenarios where threads benefit from the
branch predictions collected in a different thread:

* if the thread is in the same process, and processes the same code or
the same data.

Yes, only flush if the new thread is in a different process.

* if the thread is in a different process, and executes a common
library (e.g., libc), or works on the same data (e.g., in a pipe).

IMO this "optimization" is not worth the security hole.

Retaining predictions across security domains *IS* the problem
because it allows an attacker to influence/control a victim.
The side channel leaks, while also important, are just a display
mechanism.
But with no control over a victim an attacker can make no use of
side channels to display secrets.

Spectre can be fixed by either preventing the side channel from the
speculative to the committed state (the approach I suggest), or by
preventing speculation (what the people who want to turn off
speculation suggest).

You suggest that erasing the branch predictor on thread switches is
just as good as preventing speculation. But it isn't. Even without
training, as long as there is speculation and the side channel from
the speculative to the committed world, some data will be leaked. Ok,
you may be tempted to rely on your luck that it's not sensitive data,
but that does not appear to be a very trustworthy approach.

I would also rely on the NoSpeculate branch hint to stall branches
that check array bounds.

My 66000 PREDication does not use the branch prediction tables.

There has been research on interaction between conditional branches and predicated code, mostly from around the Itanium time. Basically, when you
move some execution from conditional to predicated it changes the stats
for the branches that remain.

I have also seen mention of "predication predictors", I think it was to
elide the predict-predicate-false instructions from the stream.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Sat Jan 6 18:58:51 2024

EricP wrote:

MitchAlsup wrote:

I would also rely on the NoSpeculate branch hint to stall branches
that check array bounds.

My 66000 PREDication does not use the branch prediction tables.

There has been research on interaction between conditional branches and predicated code, mostly from around the Itanium time. Basically, when you move some execution from conditional to predicated it changes the stats
for the branches that remain.

One would expect that. In My 66000 predication is <now> clause by clause.
Under predication, one expects to fetch at least to the last instruction
of the else-clause by the time you know the branch condition. So, fetch redirection is unwarranted (no need to branch, just skip the then-clause
or skip the else-clause.

I have also seen mention of "predication predictors", I think it was to
elide the predict-predicate-false instructions from the stream.

Sooner or later this will help, but I don't see a need at 6-wide, yet.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to EricP on Sun Jan 7 12:23:23 2024

EricP wrote:

MitchAlsup wrote:

EricP wrote:

And this is where I think my ATX atomic transactions differs from ESM,
it is in how transactions are negotiated.

I also have Cache Coherence (CC) protocol managing the
shared/exclusive/owned
line state and transfer of whole lines into, out of, and between caches. >>> However I don't need a NAK in CC because line movement is never denied.

My ATX coherence messages are a completely different protocol from CC.
ATX messages deal with *permission* to read and write *individual bytes* >>> in cache lines, and knows nothing about the cache line state or where
it is located. The Atomic Transaction Manager (ATM) uses ATX messages to >>> talks to other peer ATM's about access to guard ranges, it intercepts
stores from the LSQ to guarded byte ranges and tucks them aside,
and triggers local aborts if a transaction permission is denied.

The messages are sent from ATM to ATM over the coherence network.
Because they do not interact with caches they do not need to travel
down or up the cache hierarchy L1<=>L2<=>L3. Instead they can bypass
all those cache comms queues and go directly between the ATM and network. This eliminates all the queuing that cache coherence messages must transit.

I had something of an epiphany last night that I thought I'd pass on.

The cache coherence protocol (CCP) supports communication between
cache coherency managers (CCM) which they currently use to negotiate
cache line ownership. The various L1, L2, L3 level managers pass CCP
messages up and down the hierarchy between themselves over comms queues,
and out over the inter-core network.

I had been thinking that my Atomic Transaction Manager (ATM) would be
located at the end of the Load Store Queue just before the cache and CCM.
The ATM can intercept LSQ commands to the cache and modify them,
to tuck aside a store in a transaction, or send commands to the
LSQ itself, such as to command the LSQ dependency matrix to stall
all memory ops to a particular cache line address.

Since the ATX messages do not directly interact with the local cache
they can bypass the level comms queues and flow directly between
the ATM and the coherence network.

Core-0 Core-1
LSQ<=>ATM<=>L1_CCM L1_CCM<=>ATM<=>LSQ
^ | | ^
| L2_CCM L2_CCM |
| | | |
| L3_CCM L3_CCM |
| | | |
v v v v
network<------------>network

Instead of thinking of the ATM as an independent unit attached to the LSQ,
what if I see it as a sub-unit of the LSQ. That would make the ATX messages used for negotiating transactions just an example of a general concept of *messages sent between LSQ's to coordinate their operation*,
just as CCP messages are sent between CCM's.

In short, messaging directly between the LSQ's managers on different
cores is potentially a whole new class of coherence and control.

So the question is: besides my atomic transactions,
what else might LSQ's want to say directly to each other?
And remember the LSQ has other resources, the ordered queue of LD/ST ops,
the address CAM's and op dependency matrix, pending store data, the TLB.

For example, TLB shootdown is one that's already available on some cores
but now could be seen as part of this general class of LSQ messaging.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Sun Jan 7 19:52:42 2024

EricP wrote:

EricP wrote:

MitchAlsup wrote:

EricP wrote:

And this is where I think my ATX atomic transactions differs from ESM, >>>> it is in how transactions are negotiated.

I also have Cache Coherence (CC) protocol managing the
shared/exclusive/owned
line state and transfer of whole lines into, out of, and between caches. >>>> However I don't need a NAK in CC because line movement is never denied. >>>
My ATX coherence messages are a completely different protocol from CC. >>>> ATX messages deal with *permission* to read and write *individual bytes* >>>> in cache lines, and knows nothing about the cache line state or where
it is located. The Atomic Transaction Manager (ATM) uses ATX messages to >>>> talks to other peer ATM's about access to guard ranges, it intercepts
stores from the LSQ to guarded byte ranges and tucks them aside,
and triggers local aborts if a transaction permission is denied.

The messages are sent from ATM to ATM over the coherence network.
Because they do not interact with caches they do not need to travel
down or up the cache hierarchy L1<=>L2<=>L3. Instead they can bypass
all those cache comms queues and go directly between the ATM and network.
This eliminates all the queuing that cache coherence messages must transit.

I had something of an epiphany last night that I thought I'd pass on.

The cache coherence protocol (CCP) supports communication between
cache coherency managers (CCM) which they currently use to negotiate
cache line ownership. The various L1, L2, L3 level managers pass CCP
messages up and down the hierarchy between themselves over comms queues,
and out over the inter-core network.

I had been thinking that my Atomic Transaction Manager (ATM) would be
located at the end of the Load Store Queue just before the cache and CCM.
The ATM can intercept LSQ commands to the cache and modify them,
to tuck aside a store in a transaction, or send commands to the
LSQ itself, such as to command the LSQ dependency matrix to stall
all memory ops to a particular cache line address.

Since the ATX messages do not directly interact with the local cache
they can bypass the level comms queues and flow directly between
the ATM and the coherence network.

Core-0 Core-1
LSQ<=>ATM<=>L1_CCM L1_CCM<=>ATM<=>LSQ
^ | | ^
| L2_CCM L2_CCM |
| | | |
| L3_CCM L3_CCM |
| | | |
v v v v
network<------------>network

Instead of thinking of the ATM as an independent unit attached to the LSQ, what if I see it as a sub-unit of the LSQ. That would make the ATX messages used for negotiating transactions just an example of a general concept of *messages sent between LSQ's to coordinate their operation*,
just as CCP messages are sent between CCM's.

The only thing I would add is that the delay from ATM to Network may be multiple cycles since L2 and L3 are both larger in diameter than the
speed of signal in wire per clock.

In short, messaging directly between the LSQ's managers on different
cores is potentially a whole new class of coherence and control.

So the question is: besides my atomic transactions,
what else might LSQ's want to say directly to each other?
And remember the LSQ has other resources, the ordered queue of LD/ST ops,
the address CAM's and op dependency matrix, pending store data, the TLB.

You probably only want the translated PAs in LSQ not the TLB itself.

For example, TLB shootdown is one that's already available on some cores
but now could be seen as part of this general class of LSQ messaging.

I found it better to just define the TLB as coherent--and get rid of
TLB shootdowns entirely (no IPIs for PTE downgrades).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	299
Nodes:	16 (2 / 14)
Uptime:	37:03:56
Calls:	6,682
Files:	12,222
Messages:	5,343,120

Hardware Transaction Memory approaches (was Superior architecture style

Who's Online

System Info