• Hardware Transaction Memory approaches (was Superior architecture style

    From EricP@21:1/5 to MitchAlsup on Sat Dec 30 13:46:33 2023
    MitchAlsup wrote:
    EricP wrote:
    MitchAlsup wrote:

    If an exception occurs in the store (manifestation) section of an ESM
    ATOMIC event, the event fails and none of the stores appears to have
    been performed.

    If an interrupt occurs in the store (manifestation) section of an ESM
    ATOMIC event, if all the store can be committed they will be and the
    event succeeds, and if they cannot (or cannot be determined that they
    can) the the event fails and none of the stores appear to have been
    performed.

    In any event, if the thread performing the event cannot be completed,
    due to transfer of control to more privilege operation, the event fails
    control appears to have been transferred to the event control point, and >>> then control is transferred to the more privileged thread.

    Yes, my HTM has some similarities.

    Yes, I see lots of similarities--most of the differences are down in
    the minutia.

    Not too many similarities - my latest ATX design has diverged quite a bit
    from your original ASF proposal in 2010 that got me thinking about HTM.

    For example, I think we both switched from the ASF/RTM approach viewing
    a transaction abort as an exception where registers are rolled back to the starting state, to one which views an abort like a branch that preserves
    the register values so that data can be passed from inside a transaction
    to outside.

    And following on from that, I think I adopted your idea of allowing
    reads and writes to non-transactional memory while other transaction
    member memory is protected. Again this is to allow values to be passed
    from inside a transaction to outside.

    But both of those changes are based on the problems people encountered
    trying to use RTM and finding there was no way to get transaction
    management information from inside the transaction to outside.

    My latest ATX instructions are completely different from ASF and ESM.
    ATX uses dynamically defined guard byte address ranges - guard for read
    and guard for write. Once guard byte ranges are established, LD's and ST's inside the guard ranges are transactionally protected, those outside are not. Guard byte ranges can be dynamically added, protection raised and lowered
    or released as the transaction proceeds.

    I believe under the hood my implementation is mostly different.
    My ATX has transaction management distributed to all nodes,
    yours ESM is centralized. ATX negotiates the transaction guard range
    collision winner dynamically as the transaction proceeds so that if there
    is contention only the winner makes it to a COMMIT and losers abort,
    whereas ESM collects all the changes and makes a bulk decision at the end
    on whether there was interference and who should win or lose.

    The commonality on implementation is on buffering the updates outside the
    cache so the transaction is not sensitive to cache associativity evicts
    as RTM was. ESM uses fully assoc miss buffers, I was thinking ATX would
    have a fully assoc index but borrow line space from L1 to allow a larger transaction member line set (I wanted 16 lines as a minimum).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Sat Dec 30 19:57:22 2023
    EricP wrote:

    MitchAlsup wrote:
    EricP wrote:
    MitchAlsup wrote:

    If an exception occurs in the store (manifestation) section of an ESM
    ATOMIC event, the event fails and none of the stores appears to have
    been performed.

    If an interrupt occurs in the store (manifestation) section of an ESM
    ATOMIC event, if all the store can be committed they will be and the
    event succeeds, and if they cannot (or cannot be determined that they
    can) the the event fails and none of the stores appear to have been
    performed.

    In any event, if the thread performing the event cannot be completed,
    due to transfer of control to more privilege operation, the event fails >>>> control appears to have been transferred to the event control point, and >>>> then control is transferred to the more privileged thread.

    Yes, my HTM has some similarities.

    Yes, I see lots of similarities--most of the differences are down in
    the minutia.

    Not too many similarities - my latest ATX design has diverged quite a bit from your original ASF proposal in 2010 that got me thinking about HTM.

    Make that 2005±

    For example, I think we both switched from the ASF/RTM approach viewing
    a transaction abort as an exception where registers are rolled back to the starting state, to one which views an abort like a branch that preserves
    the register values so that data can be passed from inside a transaction
    to outside.

    No, I did not do it that way. I choose not to restore the registers, and
    made the compiler have to forget the now stale variables from the event.
    I did this mostly because my implementation does not count on branches so
    there may not be a checkpoint to assist in backup. Control transfer to the control point is not considered a branch--because it is automagic.

    And following on from that, I think I adopted your idea of allowing
    reads and writes to non-transactional memory while other transaction
    member memory is protected. Again this is to allow values to be passed
    from inside a transaction to outside.

    I do allow this. AND this is why each participant has to announce itself. {{That is: there is not something that starts an event and another thing
    that ends an event and everything inside is participating in the event.}}

    But both of those changes are based on the problems people encountered
    trying to use RTM and finding there was no way to get transaction
    management information from inside the transaction to outside.

    That and debugging (but perhaps that is what you meant.)

    My latest ATX instructions are completely different from ASF and ESM.
    ATX uses dynamically defined guard byte address ranges - guard for read
    and guard for write. Once guard byte ranges are established, LD's and ST's inside the guard ranges are transactionally protected, those outside are not. Guard byte ranges can be dynamically added, protection raised and lowered
    or released as the transaction proceeds.

    I believe under the hood my implementation is mostly different.
    My ATX has transaction management distributed to all nodes,
    yours ESM is centralized. ATX negotiates the transaction guard range collision winner dynamically as the transaction proceeds so that if there
    is contention only the winner makes it to a COMMIT and losers abort,
    whereas ESM collects all the changes and makes a bulk decision at the end
    on whether there was interference and who should win or lose.

    The commonality on implementation is on buffering the updates outside the cache so the transaction is not sensitive to cache associativity evicts
    as RTM was. ESM uses fully assoc miss buffers, I was thinking ATX would
    have a fully assoc index but borrow line space from L1 to allow a larger transaction member line set (I wanted 16 lines as a minimum).

    16 lines but only 1 read set (start:end) and 1 write set (start:end) ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Sat Dec 30 17:08:00 2023
    MitchAlsup wrote:
    EricP wrote:

    MitchAlsup wrote:
    EricP wrote:
    MitchAlsup wrote:

    If an exception occurs in the store (manifestation) section of an
    ESM ATOMIC event, the event fails and none of the stores appears to
    have
    been performed.

    If an interrupt occurs in the store (manifestation) section of an ESM >>>>> ATOMIC event, if all the store can be committed they will be and the >>>>> event succeeds, and if they cannot (or cannot be determined that
    they can) the the event fails and none of the stores appear to have
    been
    performed.

    In any event, if the thread performing the event cannot be completed, >>>>> due to transfer of control to more privilege operation, the event
    fails
    control appears to have been transferred to the event control
    point, and
    then control is transferred to the more privileged thread.

    Yes, my HTM has some similarities.

    Yes, I see lots of similarities--most of the differences are down in
    the minutia.

    Not too many similarities - my latest ATX design has diverged quite a bit
    from your original ASF proposal in 2010 that got me thinking about HTM.

    Make that 2005±

    For example, I think we both switched from the ASF/RTM approach viewing
    a transaction abort as an exception where registers are rolled back to
    the
    starting state, to one which views an abort like a branch that preserves
    the register values so that data can be passed from inside a transaction
    to outside.

    No, I did not do it that way. I choose not to restore the registers, and
    made the compiler have to forget the now stale variables from the event.
    I did this mostly because my implementation does not count on branches so there may not be a checkpoint to assist in backup. Control transfer to the control point is not considered a branch--because it is automagic.

    Ok, bad analogy but the result is the same: the registers are not restored.

    And following on from that, I think I adopted your idea of allowing
    reads and writes to non-transactional memory while other transaction
    member memory is protected. Again this is to allow values to be passed
    from inside a transaction to outside.

    I do allow this. AND this is why each participant has to announce itself. {{That is: there is not something that starts an event and another thing
    that ends an event and everything inside is participating in the event.}}

    But both of those changes are based on the problems people encountered
    trying to use RTM and finding there was no way to get transaction
    management information from inside the transaction to outside.

    That and debugging (but perhaps that is what you meant.)

    Debugging too but I was thinking that a transaction might want to use
    a register to hold an internal counter indicating how far it made it
    into the transaction when it aborted. That might help the abort code
    avoid a subsequent collision.

    My latest ATX instructions are completely different from ASF and ESM.
    ATX uses dynamically defined guard byte address ranges - guard for read
    and guard for write. Once guard byte ranges are established, LD's and
    ST's
    inside the guard ranges are transactionally protected, those outside
    are not.
    Guard byte ranges can be dynamically added, protection raised and lowered
    or released as the transaction proceeds.

    I believe under the hood my implementation is mostly different.
    My ATX has transaction management distributed to all nodes,
    yours ESM is centralized. ATX negotiates the transaction guard range
    collision winner dynamically as the transaction proceeds so that if there
    is contention only the winner makes it to a COMMIT and losers abort,
    whereas ESM collects all the changes and makes a bulk decision at the end
    on whether there was interference and who should win or lose.

    The commonality on implementation is on buffering the updates outside the
    cache so the transaction is not sensitive to cache associativity evicts
    as RTM was. ESM uses fully assoc miss buffers, I was thinking ATX would
    have a fully assoc index but borrow line space from L1 to allow a larger
    transaction member line set (I wanted 16 lines as a minimum).

    16 lines but only 1 read set (start:end) and 1 write set (start:end) ??

    There can be as many guard byte ranges as you want and can straddle
    multiple cache line boundaries as long as the total bytes under
    transaction guard protection is 16 cache lines (of 64-bytes each).

    The intent is that you issue guards giving the object address and size,
    do some protected loads and stores, then add new guards as new objects
    join the transaction, do more protected loads and stores, and so on.
    Then commit all the updates and release the guards.

    You can request a Read Guard on an object byte range, evaluate an object,
    then upgrade that range to a Write Guard and make changes,
    or release the guards on a range to remove an object from the transaction.

    The number 16 comes from wanting up to 8 smallish objects in a transaction, with each object possibly straddling two cache lines.
    Eg an AVL tree node with left, right, parent pointers and depth count.
    (I didn't want users to have to worry about whether their objects
    straddle cache line boundaries.)

    That 16 sets the size of the CAM and number of cache line buffers holding pending byte updates plus other structures in the transaction manager.

    The transaction manager breaks each guard range request into a series
    of up to 16 cache line byte ranges

    My Atomic Transaction instructions are:

    // Start a transaction attempt, remember abort RIP
    // Option is to be notified after collision winner commits
    ATSTART abort_offset [,options]

    // Guard a byte range for read
    // Options are synchronous or asynchronous
    // Synchronous blocks LD and ST instructions in the guard range from
    // reading a cache line until the guard grant has been negotiated.
    // Asynchronous does not block LD and ST but may cause ping-pongs.
    ATGRDR address, byte_count [,options]

    // Guard a byte range for write
    // Options are synchronous or asynchronous
    ATGRDW address, byte_count [,options]

    // Release a guard a range from the transaction
    ATGREL address, byte_count

    // Commit transaction updates and release all guards
    ATCOMMIT status_reg

    // Cancel transaction, toss write-guarded updates, release guards
    ATCANCEL status_reg

    // Trigger an abort, toss write-guarded updates, jump to abort address
    // Can pass an immediate byte value to the transaction status
    ATABORT #imm8

    // Read status of current transaction, if any.
    // After an abort the status contains info on reason for the abort.
    ATSTATUS status_reg

    // Wait for commit notify from winner of a collision
    ATWAITNFY

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Sun Dec 31 00:27:52 2023
    EricP wrote:

    MitchAlsup wrote:
    EricP wrote:

    MitchAlsup wrote:
    EricP wrote:
    MitchAlsup wrote:

    If an exception occurs in the store (manifestation) section of an
    ESM ATOMIC event, the event fails and none of the stores appears to >>>>>> have
    been performed.

    If an interrupt occurs in the store (manifestation) section of an ESM >>>>>> ATOMIC event, if all the store can be committed they will be and the >>>>>> event succeeds, and if they cannot (or cannot be determined that
    they can) the the event fails and none of the stores appear to have >>>>>> been
    performed.

    In any event, if the thread performing the event cannot be completed, >>>>>> due to transfer of control to more privilege operation, the event
    fails
    control appears to have been transferred to the event control
    point, and
    then control is transferred to the more privileged thread.

    Yes, my HTM has some similarities.

    Yes, I see lots of similarities--most of the differences are down in
    the minutia.

    Not too many similarities - my latest ATX design has diverged quite a bit >>> from your original ASF proposal in 2010 that got me thinking about HTM.

    Make that 2005±

    For example, I think we both switched from the ASF/RTM approach viewing
    a transaction abort as an exception where registers are rolled back to
    the
    starting state, to one which views an abort like a branch that preserves >>> the register values so that data can be passed from inside a transaction >>> to outside.

    No, I did not do it that way. I choose not to restore the registers, and
    made the compiler have to forget the now stale variables from the event.
    I did this mostly because my implementation does not count on branches so
    there may not be a checkpoint to assist in backup. Control transfer to the >> control point is not considered a branch--because it is automagic.

    Ok, bad analogy but the result is the same: the registers are not restored.

    And following on from that, I think I adopted your idea of allowing
    reads and writes to non-transactional memory while other transaction
    member memory is protected. Again this is to allow values to be passed
    from inside a transaction to outside.

    I do allow this. AND this is why each participant has to announce itself.
    {{That is: there is not something that starts an event and another thing
    that ends an event and everything inside is participating in the event.}}

    But both of those changes are based on the problems people encountered
    trying to use RTM and finding there was no way to get transaction
    management information from inside the transaction to outside.

    That and debugging (but perhaps that is what you meant.)

    Debugging too but I was thinking that a transaction might want to use
    a register to hold an internal counter indicating how far it made it
    into the transaction when it aborted. That might help the abort code
    avoid a subsequent collision.

    You could use a counter, or you could dump a bunch of intermediate state
    into a buffer and print it so you can see the instantaneous state of the process while the event transpired.

    My latest ATX instructions are completely different from ASF and ESM.
    ATX uses dynamically defined guard byte address ranges - guard for read
    and guard for write. Once guard byte ranges are established, LD's and
    ST's
    inside the guard ranges are transactionally protected, those outside
    are not.
    Guard byte ranges can be dynamically added, protection raised and lowered >>> or released as the transaction proceeds.

    I believe under the hood my implementation is mostly different.
    My ATX has transaction management distributed to all nodes,
    yours ESM is centralized. ATX negotiates the transaction guard range
    collision winner dynamically as the transaction proceeds so that if there >>> is contention only the winner makes it to a COMMIT and losers abort,
    whereas ESM collects all the changes and makes a bulk decision at the end >>> on whether there was interference and who should win or lose.

    The commonality on implementation is on buffering the updates outside the >>> cache so the transaction is not sensitive to cache associativity evicts
    as RTM was. ESM uses fully assoc miss buffers, I was thinking ATX would
    have a fully assoc index but borrow line space from L1 to allow a larger >>> transaction member line set (I wanted 16 lines as a minimum).

    16 lines but only 1 read set (start:end) and 1 write set (start:end) ??

    There can be as many guard byte ranges as you want and can straddle
    multiple cache line boundaries as long as the total bytes under
    transaction guard protection is 16 cache lines (of 64-bytes each).

    The intent is that you issue guards giving the object address and size,
    do some protected loads and stores, then add new guards as new objects
    join the transaction, do more protected loads and stores, and so on.
    Then commit all the updates and release the guards.

    You can request a Read Guard on an object byte range, evaluate an object, then upgrade that range to a Write Guard and make changes,
    or release the guards on a range to remove an object from the transaction.

    The number 16 comes from wanting up to 8 smallish objects in a transaction, with each object possibly straddling two cache lines.
    Eg an AVL tree node with left, right, parent pointers and depth count.
    (I didn't want users to have to worry about whether their objects
    straddle cache line boundaries.)

    That 16 sets the size of the CAM and number of cache line buffers holding pending byte updates plus other structures in the transaction manager.

    The transaction manager breaks each guard range request into a series
    of up to 16 cache line byte ranges

    My Atomic Transaction instructions are:

    // Start a transaction attempt, remember abort RIP
    // Option is to be notified after collision winner commits
    ATSTART abort_offset [,options]

    // Guard a byte range for read
    // Options are synchronous or asynchronous
    // Synchronous blocks LD and ST instructions in the guard range from
    // reading a cache line until the guard grant has been negotiated.
    // Asynchronous does not block LD and ST but may cause ping-pongs.
    ATGRDR address, byte_count [,options]

    // Guard a byte range for write
    // Options are synchronous or asynchronous
    ATGRDW address, byte_count [,options]

    // Release a guard a range from the transaction
    ATGREL address, byte_count

    // Commit transaction updates and release all guards
    ATCOMMIT status_reg

    // Cancel transaction, toss write-guarded updates, release guards
    ATCANCEL status_reg

    // Trigger an abort, toss write-guarded updates, jump to abort address
    // Can pass an immediate byte value to the transaction status
    ATABORT #imm8

    // Read status of current transaction, if any.
    // After an abort the status contains info on reason for the abort.
    ATSTATUS status_reg

    // Wait for commit notify from winner of a collision
    ATWAITNFY


    I see, you are using an instruction to mark each state transition--
    whereas I use edge-detection (side effect) of a standard instruction.

    Does this not necessarily increase the minimum path length ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Sun Dec 31 11:26:23 2023
    MitchAlsup wrote:
    EricP wrote:

    There can be as many guard byte ranges as you want and can straddle
    multiple cache line boundaries as long as the total bytes under
    transaction guard protection is 16 cache lines (of 64-bytes each).

    The intent is that you issue guards giving the object address and size,
    do some protected loads and stores, then add new guards as new objects
    join the transaction, do more protected loads and stores, and so on.
    Then commit all the updates and release the guards.

    You can request a Read Guard on an object byte range, evaluate an object,
    then upgrade that range to a Write Guard and make changes,
    or release the guards on a range to remove an object from the
    transaction.

    The number 16 comes from wanting up to 8 smallish objects in a
    transaction,
    with each object possibly straddling two cache lines.
    Eg an AVL tree node with left, right, parent pointers and depth count.
    (I didn't want users to have to worry about whether their objects
    straddle cache line boundaries.)

    That 16 sets the size of the CAM and number of cache line buffers holding
    pending byte updates plus other structures in the transaction manager.

    The transaction manager breaks each guard range request into a series
    of up to 16 cache line byte ranges

    My Atomic Transaction instructions are:

    // Start a transaction attempt, remember abort RIP
    // Option is to be notified after collision winner commits
    ATSTART abort_offset [,options]

    // Guard a byte range for read
    // Options are synchronous or asynchronous
    // Synchronous blocks LD and ST instructions in the guard range from
    // reading a cache line until the guard grant has been negotiated.
    // Asynchronous does not block LD and ST but may cause ping-pongs.
    ATGRDR address, byte_count [,options]

    // Guard a byte range for write
    // Options are synchronous or asynchronous
    ATGRDW address, byte_count [,options]

    // Release a guard a range from the transaction
    ATGREL address, byte_count

    // Commit transaction updates and release all guards
    ATCOMMIT status_reg

    // Cancel transaction, toss write-guarded updates, release guards
    ATCANCEL status_reg

    // Trigger an abort, toss write-guarded updates, jump to abort address
    // Can pass an immediate byte value to the transaction status
    ATABORT #imm8

    // Read status of current transaction, if any.
    // After an abort the status contains info on reason for the abort.
    ATSTATUS status_reg

    // Wait for commit notify from winner of a collision
    ATWAITNFY


    I see, you are using an instruction to mark each state transition--
    whereas I use edge-detection (side effect) of a standard instruction.

    Does this not necessarily increase the minimum path length ??

    Not sure what you mean by minimum path length.
    The number of instructions probably has the least affect on performance.

    Often transactions are just moving memory locations about with little
    or no calculations, so the majority of performance effects will be due
    to coherence messaging, to negotiate guards and move cache lines about.
    In some cases this can be overlapped, others not.

    I was able to throw together a simulator to test the validity of the
    guard protocol handshake and it does work. But that was is isolation.
    To test ATX performance would require a full multi-core OoO simulator
    with Load Store Queue, as my transaction manager interacts with LSQ,
    and cache coherence message simulation, and I don't have that.

    Some issues I see that could affect transaction performance:

    1) I have validated the ATX protocol and it has one important optimization,
    but not given much thought to optimizations that a directory controller
    might help with. Currently it assumes that guard requests coherence messages will be broadcast to all nodes in a system, and all will Grant/Deny reply.

    This is intentional as it keeps the ATX coherence messages completely
    separate from cache coherence messages, and that is important because
    it means you don't have to re-validate your coherence protocol
    or change the cache subsystem or directory controller.

    Since a directory controller knows which nodes have copies of lines
    in what shared/exclusive state it might be able to optimize away much
    of the ATX messaging. However that would require integrating ATX protocol
    with the directory controller to also track guard requests for lines.

    2) Since the Atomic Transaction Manager (ATM) knows the guard range
    and whether for read or write, it can optimize cache transfers to
    upgrade read_share cache line to a read_exclusive, and eliminate the
    transitory line share state that occurs for some shared lines.

    That would eliminate a whole set of handshake messages that may now occur
    to transfer a line in a shared state, then another to make it exclusive.

    3) I believe an OoO core will have to shut off conditional branch
    speculation inside transactions as speculating guard requests
    or loads or stores could cause unnecessary transaction aborts
    or cache line ping-ponging. It may have to go beyond that and
    shut off OoO all together so that the registers are written in
    a predictable order if an asynchronous abort is triggered.

    Possibly the amount of OoO allowed could be an option on the ATSTART instruction so the user can decide based on their algorithm.

    4) Guard requests can be either synchronous or asynchronous.
    The choice does not affect transaction validity, just performance.

    Synchronous means the guard acts as a membar to following loads
    and stores to the guarded bytes while the guard is negotiated.
    Synchronous is intended to prevent you from too soon touching a
    cache line and grabbing it away from its current owner-updater.

    Asynchronous allows following ld/st to the guarded range to execute
    concurrent while the guard request is pending, allowing the transfer
    of a cache lines to overlap with guard negotiation.
    Asynchronous could cause cache lines to ping-pong.

    The choice of synchronous or asynchronous is dependent on the algorithm
    and can be different for different objects in a transaction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Sun Dec 31 17:10:04 2023
    EricP wrote:

    MitchAlsup wrote:
    EricP wrote:

    There can be as many guard byte ranges as you want and can straddle
    multiple cache line boundaries as long as the total bytes under
    transaction guard protection is 16 cache lines (of 64-bytes each).

    The intent is that you issue guards giving the object address and size,
    do some protected loads and stores, then add new guards as new objects
    join the transaction, do more protected loads and stores, and so on.
    Then commit all the updates and release the guards.

    You can request a Read Guard on an object byte range, evaluate an object, >>> then upgrade that range to a Write Guard and make changes,
    or release the guards on a range to remove an object from the
    transaction.

    The number 16 comes from wanting up to 8 smallish objects in a
    transaction,
    with each object possibly straddling two cache lines.
    Eg an AVL tree node with left, right, parent pointers and depth count.
    (I didn't want users to have to worry about whether their objects
    straddle cache line boundaries.)

    That 16 sets the size of the CAM and number of cache line buffers holding >>> pending byte updates plus other structures in the transaction manager.

    The transaction manager breaks each guard range request into a series
    of up to 16 cache line byte ranges

    My Atomic Transaction instructions are:

    // Start a transaction attempt, remember abort RIP
    // Option is to be notified after collision winner commits
    ATSTART abort_offset [,options]

    // Guard a byte range for read
    // Options are synchronous or asynchronous
    // Synchronous blocks LD and ST instructions in the guard range from
    // reading a cache line until the guard grant has been negotiated.
    // Asynchronous does not block LD and ST but may cause ping-pongs.
    ATGRDR address, byte_count [,options]

    // Guard a byte range for write
    // Options are synchronous or asynchronous
    ATGRDW address, byte_count [,options]

    // Release a guard a range from the transaction
    ATGREL address, byte_count

    // Commit transaction updates and release all guards
    ATCOMMIT status_reg

    // Cancel transaction, toss write-guarded updates, release guards
    ATCANCEL status_reg

    // Trigger an abort, toss write-guarded updates, jump to abort address
    // Can pass an immediate byte value to the transaction status
    ATABORT #imm8

    // Read status of current transaction, if any.
    // After an abort the status contains info on reason for the abort.
    ATSTATUS status_reg

    // Wait for commit notify from winner of a collision
    ATWAITNFY


    I see, you are using an instruction to mark each state transition--
    whereas I use edge-detection (side effect) of a standard instruction.

    Does this not necessarily increase the minimum path length ??

    Not sure what you mean by minimum path length.
    The number of instructions probably has the least affect on performance.

    Often transactions are just moving memory locations about with little
    or no calculations, so the majority of performance effects will be due
    to coherence messaging, to negotiate guards and move cache lines about.
    In some cases this can be overlapped, others not.

    Another cause of delay is conversion from causal consistency outside
    of an event and sequential consistency within an event. {Should you
    choose to do this}

    I was able to throw together a simulator to test the validity of the
    guard protocol handshake and it does work. But that was is isolation.
    To test ATX performance would require a full multi-core OoO simulator
    with Load Store Queue, as my transaction manager interacts with LSQ,
    and cache coherence message simulation, and I don't have that.

    Some issues I see that could affect transaction performance:

    1) I have validated the ATX protocol and it has one important optimization, but not given much thought to optimizations that a directory controller
    might help with. Currently it assumes that guard requests coherence messages will be broadcast to all nodes in a system, and all will Grant/Deny reply.

    This is intentional as it keeps the ATX coherence messages completely separate from cache coherence messages, and that is important because
    it means you don't have to re-validate your coherence protocol
    or change the cache subsystem or directory controller.

    All I added to the coherence protocol is NAK and its use is restricted
    to ATOMIC events; then later I added priority compare so that higher
    priority events are not NAKed by lower priority events. Thus, the
    protocol is the same except for NAKs.

    Since a directory controller knows which nodes have copies of lines
    in what shared/exclusive state it might be able to optimize away much
    of the ATX messaging. However that would require integrating ATX protocol with the directory controller to also track guard requests for lines.

    2) Since the Atomic Transaction Manager (ATM) knows the guard range
    and whether for read or write, it can optimize cache transfers to
    upgrade read_share cache line to a read_exclusive, and eliminate the transitory line share state that occurs for some shared lines.

    Yes, I do some of this too, with a change in flavor:: the first pass
    through an ATOMIC event the locked LDs are sent out with Intent to Modify. Should interference occur and fail the event, the subsequent locked LDs
    are sent out without intent and when data does arrive a coherent invalidate
    is sent out. This mimics test_and_test_and_set() without doing any more
    thatn test_and_set().

    That would eliminate a whole set of handshake messages that may now occur
    to transfer a line in a shared state, then another to make it exclusive.

    3) I believe an OoO core will have to shut off conditional branch
    speculation inside transactions as speculating guard requests
    or loads or stores could cause unnecessary transaction aborts
    or cache line ping-ponging. It may have to go beyond that and
    shut off OoO all together so that the registers are written in
    a predictable order if an asynchronous abort is triggered.

    If you allow speculative branches in an event, you will need a way to
    request a cache line and then not use it if that request has become
    OoO with respect to the sequentially consistent memory order produced
    by this processor. I figured out how to solve this circa 1991 so I
    don't consider it a stumbling block.

    Possibly the amount of OoO allowed could be an option on the ATSTART instruction so the user can decide based on their algorithm.

    4) Guard requests can be either synchronous or asynchronous.
    The choice does not affect transaction validity, just performance.

    Do you envision mixing and matching synch and asynch guards ?

    Synchronous means the guard acts as a membar to following loads
    and stores to the guarded bytes while the guard is negotiated.
    Synchronous is intended to prevent you from too soon touching a
    cache line and grabbing it away from its current owner-updater.

    Asynchronous allows following ld/st to the guarded range to execute concurrent while the guard request is pending, allowing the transfer
    of a cache lines to overlap with guard negotiation.
    Asynchronous could cause cache lines to ping-pong.

    So can too little associativity in your data cache.

    The choice of synchronous or asynchronous is dependent on the algorithm
    and can be different for different objects in a transaction.

    So, yes to my question above.

    Why not allow speculative branches to cover asynch accesses to guarded
    lines ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Sun Dec 31 18:02:36 2023
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    3) I believe an OoO core will have to shut off conditional branch
    speculation inside transactions as speculating guard requests
    or loads or stores could cause unnecessary transaction aborts
    or cache line ping-ponging. It may have to go beyond that and
    shut off OoO all together so that the registers are written in
    a predictable order if an asynchronous abort is triggered.

    I am not sure what scenario you have in mind, but it seems to involve requesting a cache line of a different core. Note that fixing Spectre
    requires that while such a request is speculative, it must not change
    the state of a remote cache line; otherwise this would consititute a
    side channel out of the speculative state. So you will certainly not
    see cache ping-ponging from properly implemented speculative accesses,
    whether inside a transaction or not.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Sun Dec 31 14:54:47 2023
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    3) I believe an OoO core will have to shut off conditional branch
    speculation inside transactions as speculating guard requests
    or loads or stores could cause unnecessary transaction aborts
    or cache line ping-ponging. It may have to go beyond that and
    shut off OoO all together so that the registers are written in
    a predictable order if an asynchronous abort is triggered.

    I am not sure what scenario you have in mind, but it seems to involve requesting a cache line of a different core. Note that fixing Spectre requires that while such a request is speculative, it must not change
    the state of a remote cache line; otherwise this would consititute a
    side channel out of the speculative state. So you will certainly not
    see cache ping-ponging from properly implemented speculative accesses, whether inside a transaction or not.

    - anton

    I'm noting that speculation and transactions may interact badly. Implementations may need some mechanism to limit it,
    but that can affect concurrency and performance.

    I'm not overly concerned about Spectre. The people who are potentially
    affected are time-share services (cloud servers) that do not control
    what programs are running concurrently on their processors.

    If one wants to play in that market, yes full Spectre protection could be
    a sales feature for a cpu model. But that is an expensive overkill for
    most situations.

    I was looking for simpler and cheaper solutions that are optional and
    cover most situations.

    - The branch predictor tables be separated by user and super mode,
    and that OS's be advised to purge the user mode tables on thread switch.

    - Adding a conditional branch hint NoSpeculate which basically causes the
    front end Dispatcher to stall and single step that one branch until the test condition resolves. Stalling at Dispatch allows the front end to continue
    to fill with the predicted code path but does not allow any to execute.

    - I have an extensive set of conditional trap instructions intended for
    bounds checks, asserts, etc. In OoO a load or store following a
    bounds check might execute before the check exception was delivered.
    These trap instructions could have an optional NoSpeculate flag
    which again stalls the Dispatcher and essentially single steps just
    that instruction until the test condition resolves.

    If I had the NoSpeculate branch hints then transaction users would
    be advised to use them. Otherwise the transaction mechanism would
    have to automatically shut off all speculation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Mon Jan 1 08:05:52 2024
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    I'm noting that speculation and transactions may interact badly. >Implementations may need some mechanism to limit it,
    but that can affect concurrency and performance.

    Even if you want to build a Spectre-vulnerable CPU that actually
    changes cache states during speculation, I don't expect much effect on performance, because the branch predictor is usually right. And if it
    isn't, and the speculative access actually interferes with a
    transaction on another core, it will learn that that path was wrong
    and will stop the wrong speculation after one or two tries.

    OTOH, if you want to build a Spectre-immune CPU by delaying the state
    change until the memory access is committed, that won't hurt the
    performance much, either, because it's rare to need state changes, and
    because the waiting time is typically around 20 cycles or so, which is
    small compared to the time needed to access and change the state of a
    remote cache line.

    I'm not overly concerned about Spectre. The people who are potentially >affected are time-share services (cloud servers) that do not control
    what programs are running concurrently on their processors.

    This widespread belief is what has caused CPU manufacturers to not
    work on fixing Spectre.

    But is it true? Has everybody disabled JavaScript and Webassembly in
    their browser and their PDF viewer, and disabled macros in their
    Spreadsheet, Word processor and presentation program? And, looking at NetSpectre, has everybody disconnected their computers from the
    network?

    But that is an expensive overkill for
    most situations.

    It would not be expensive (see below), and it's not overkill.

    While software vulnerabilities may be plenty and relatively easy to
    use, they can be fixed at any time, or the intended victim of an
    attack may use a different software. Meanwhile, hardware
    vulnerabilities like Spectre and Rowhammer are always there while the
    hardware is not replaced with fixed hardware (and the inaction of
    hardware manufacturers ensures that no such replacement exists, and
    even when it exists, it will take many years until most of the
    hardware is replaced), so they are very attractive to attackers.

    I was looking for simpler and cheaper solutions that are optional and
    cover most situations.

    - The branch predictor tables be separated by user and super mode,
    and that OS's be advised to purge the user mode tables on thread switch.

    Now that's an expensive approach in both silicon and performance: The
    branch predictor, one of the biggest parts of a core would need to
    become twice as big to get the same accuracy for a program that spends
    almost all of its time in user mode, or almost all of its time in
    system mode. And a user-level program would still be slowed down a
    lot every time there is a thread switch.

    While being expensive, this approach does not make the CPU
    Spectre-immune. A thread can be attacked from within itself (e.g.,
    from a JavaScript program running in the same thread), and the
    attacker can train the branch predictor to do the attacker's bidding
    by passing the appropriate data to system calls or as input data to a user-level processing program.

    - Adding a conditional branch hint NoSpeculate which basically causes the >front end Dispatcher to stall and single step that one branch until the test >condition resolves. Stalling at Dispatch allows the front end to continue
    to fill with the predicted code path but does not allow any to execute.

    Yes, if you disable speculation by using that for all branches, it
    will help against Spectre. But the slowdown will be huge, slowing the
    CPU down almost to in-order levels (e.g., a Cortex-A55 or Bonnell).

    OTOH, if you want to apply this slowdown selectively, the question is
    where, and if the remaining speculation does not leave the window open
    for an attacker to use Spectre. The Linux kernel is trying to use
    selective software mitigation, and of course a remaining hole was
    found (and used by a security researcher for demonstrating not just
    that hole, but something else, that's how I learned of it), so this
    approach is everything but watertight.

    - I have an extensive set of conditional trap instructions intended for >bounds checks, asserts, etc. In OoO a load or store following a
    bounds check might execute before the check exception was delivered.
    These trap instructions could have an optional NoSpeculate flag
    which again stalls the Dispatcher and essentially single steps just
    that instruction until the test condition resolves.

    Also very expensive for programs (programming languages) that use
    these instructions; speculative load hardening (SLH), which prevents
    Spectre v1 by turning the control dependence on the bounds check into
    a data dependence costs a factore 2.3-2.5 (depending on the SLH
    variant) on the SPEC programs, and I expect your bounds-check trap
    instructions to be at least as expensive.

    By constrast, a proper invisible-speculation fix would be much cheaper
    in performance (papers on the memory access part of invisible
    speculation give slowdown factors up to 1.2 (with some papers giving
    smaller slowdowns and even occasional speedups), and I think that the memory-access part will have the biggest performance impact among the
    changes necessary for a full-blown invisible-speculation fix.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Anton Ertl on Mon Jan 1 14:19:19 2024
    On Mon, 01 Jan 2024 08:05:52 +0000, Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:

    I'm not overly concerned about Spectre. The people who are potentially >>affected are time-share services (cloud servers) that do not control
    what programs are running concurrently on their processors.

    This widespread belief is what has caused CPU manufacturers to not work
    on fixing Spectre.

    But is it true? Has everybody disabled JavaScript and Webassembly in
    their browser and their PDF viewer, and disabled macros in their
    Spreadsheet, Word processor and presentation program? And, looking at NetSpectre, has everybody disconnected their computers from the network?

    Here, you are completely correct.

    Spectre, and its other cousins, like Rowhammer, are hardware vulerabilities that can't be eradicated only in software, which makes them very serious.

    In the early days of computing, computer viruses were something you
    could get by running and installing software, and so it was relatively
    easy to practice hygenic computing.

    Today, though, our web browsers, E-mail clients, word processors, and
    numerous other programs invisibly execute code. As well, the common
    buffer overflow vulnerability has allowed computers to be taken over
    through Internet-facing applications that do not execute externally
    supplied code, and which, therefore, are not such as to normally be
    suspected of being dangerous.

    Given that what I have read claims that it is _not_ possible to completely protect against Spectre and its variants without a significant degradation
    of performance, my favored solution is to divide the CPU into two parts;
    one made immune to Spectre, which runs Internet-facing code which might be menaced by it, and another in which mitigations are not applied.

    This solution, though, basically has not been considered, and that is for
    an obvious reason: it is insecure. Once malicious software has found a way through some other vulnerability to insinuate itself into the "trusted"
    code of the computer, then it won't be stopped from making use of Spectre
    to further its progress.

    So it's not enough to put the Internet in a sandbox, we need new and better ideas about how to put a secure wall around that sandbox - to securely
    limit how it interacts with the rest of the computer.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Mon Jan 1 15:15:27 2024
    Quadibloc <quadibloc@servername.invalid> writes:
    Given that what I have read claims that it is _not_ possible to completely >protect against Spectre and its variants without a significant degradation
    of performance,

    Depends on what you mean by "protect" and "significant". The work on
    invisible speculation (a proper fix) reports slowdowns like (for the
    memory access component, which appears to have the biggest influence
    on performance) a factor 1.2 for a slower variant, or, for a faster
    variant, IIRC between a slowdown by 1.06 and a speedup of 1.04. By
    contrast, a software mitigation like speculative load hardening
    produces a slowdown factor 2.3-2.5, and that protects only against
    Spectre v1.

    my favored solution is to divide the CPU into two parts;
    one made immune to Spectre, which runs Internet-facing code which might be >menaced by it, and another in which mitigations are not applied.

    You can do that now, by buying an RK3588-based SBC (e.g., a Radxa
    Rock5B), which has 4 Cortex-A76 (OoO) and 4 Cortex-A55 (in-order)
    cores, and then run something like QubesOS on that, and use only the
    A55 for the Internet-facing stuff. Except that QubesOS for now only
    works on AMD64.

    Note that the A55 is more than three times slower than the A76
    (numbers are times in seconds):

    - Rock 5B (1805MHz A55) Debian 11 (texlive-latex-recommended) 2.105
    - Rock 5B (2257MHz A76) Debian 11 (texlive-latex-recommended) 0.638

    However, differentiating beween what is "internet-facing" and what is
    not is something you don't want to task a layman with. It's too easy
    to falsely classify something as "not internet-facing" when in reality
    it's a file that came from the 'net and might have been tampered with
    by an attacker.

    Anyway, we have not seen a surge in such systems. In particular, the
    Raspi4 and Respi5, which probably could have gone for such a
    big.LITTLE design, with the LITTLE component being Spectre-immune,
    both went for big-only designs. And on the OS side, we have not seen
    attempts at isolation beyond QubesOS, either.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Mon Jan 1 12:43:45 2024
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    I'm noting that speculation and transactions may interact badly.
    Implementations may need some mechanism to limit it,
    but that can affect concurrency and performance.

    Even if you want to build a Spectre-vulnerable CPU that actually
    changes cache states during speculation, I don't expect much effect on performance, because the branch predictor is usually right. And if it
    isn't, and the speculative access actually interferes with a
    transaction on another core, it will learn that that path was wrong
    and will stop the wrong speculation after one or two tries.

    OTOH, if you want to build a Spectre-immune CPU by delaying the state
    change until the memory access is committed, that won't hurt the
    performance much, either, because it's rare to need state changes, and because the waiting time is typically around 20 cycles or so, which is
    small compared to the time needed to access and change the state of a
    remote cache line.

    I'm not overly concerned about Spectre. The people who are potentially
    affected are time-share services (cloud servers) that do not control
    what programs are running concurrently on their processors.

    This widespread belief is what has caused CPU manufacturers to not
    work on fixing Spectre.

    But is it true? Has everybody disabled JavaScript and Webassembly in
    their browser and their PDF viewer, and disabled macros in their
    Spreadsheet, Word processor and presentation program? And, looking at NetSpectre, has everybody disconnected their computers from the
    network?

    There are many thousands of academic papers on speculative execution attacks.
    I cannot find a single example of even an attempt in the real world.

    On the other hand, there are many successful phishing attacks each day.

    In other words, if you could somehow fix all speculative execution vulnerabilities, it would have zero impact on the actual successful
    security breaches.

    But that is an expensive overkill for
    most situations.

    It would not be expensive (see below), and it's not overkill.

    While software vulnerabilities may be plenty and relatively easy to
    use, they can be fixed at any time, or the intended victim of an
    attack may use a different software. Meanwhile, hardware
    vulnerabilities like Spectre and Rowhammer are always there while the hardware is not replaced with fixed hardware (and the inaction of
    hardware manufacturers ensures that no such replacement exists, and
    even when it exists, it will take many years until most of the
    hardware is replaced), so they are very attractive to attackers.

    Rowhammer is different - its a memory corruption hardware error.

    I'm not convinced that speculative execution leaks can all be fixed.
    They are finding new mechanisms every day.
    (uOp caches now need to be flushed on thread switch,
    function unit or register port contention as a side channel,
    speculative load forwarding attacks).

    I think it will be an ongoing game of whack-a-mole.

    Just get rid of the low hanging fruit - the retention of branch predictors across security domain thread/process switches.

    And as far as I can tell these S.E. attacks are not attractive at all,
    in that I can find no reports of them at all in the real world.

    I was looking for simpler and cheaper solutions that are optional and
    cover most situations.

    - The branch predictor tables be separated by user and super mode,
    and that OS's be advised to purge the user mode tables on thread switch.

    Now that's an expensive approach in both silicon and performance: The
    branch predictor, one of the biggest parts of a core would need to
    become twice as big to get the same accuracy for a program that spends
    almost all of its time in user mode, or almost all of its time in
    system mode. And a user-level program would still be slowed down a
    lot every time there is a thread switch.

    Why would the branch predictions from a different thread/process
    be helpful to your thread?

    Retaining predictions across security domains *IS* the problem
    because it allows an attacker to influence/control a victim.
    The side channel leaks, while also important, are just a display mechanism.
    But with no control over a victim an attacker can make no use of
    side channels to display secrets.

    While being expensive, this approach does not make the CPU
    Spectre-immune. A thread can be attacked from within itself (e.g.,
    from a JavaScript program running in the same thread), and the
    attacker can train the branch predictor to do the attacker's bidding
    by passing the appropriate data to system calls or as input data to a user-level processing program.

    Javascript is not a HW security domain.
    It is the responsibility of the Javascript VM peddlers to ensure their
    runtime environment is secure, as they appear to have done.

    - Adding a conditional branch hint NoSpeculate which basically causes the
    front end Dispatcher to stall and single step that one branch until the test >> condition resolves. Stalling at Dispatch allows the front end to continue
    to fill with the predicted code path but does not allow any to execute.

    Yes, if you disable speculation by using that for all branches, it
    will help against Spectre. But the slowdown will be huge, slowing the
    CPU down almost to in-order levels (e.g., a Cortex-A55 or Bonnell).

    OTOH, if you want to apply this slowdown selectively, the question is
    where, and if the remaining speculation does not leave the window open
    for an attacker to use Spectre. The Linux kernel is trying to use
    selective software mitigation, and of course a remaining hole was
    found (and used by a security researcher for demonstrating not just
    that hole, but something else, that's how I learned of it), so this
    approach is everything but watertight.

    The question is how many IF statements are guarding leak vulnerable
    code pathways *when the attackers branch predictor controls are removed*?
    Can these IF's be automatically identified?

    - I have an extensive set of conditional trap instructions intended for
    bounds checks, asserts, etc. In OoO a load or store following a
    bounds check might execute before the check exception was delivered.
    These trap instructions could have an optional NoSpeculate flag
    which again stalls the Dispatcher and essentially single steps just
    that instruction until the test condition resolves.

    Also very expensive for programs (programming languages) that use
    these instructions; speculative load hardening (SLH), which prevents
    Spectre v1 by turning the control dependence on the bounds check into
    a data dependence costs a factore 2.3-2.5 (depending on the SLH
    variant) on the SPEC programs, and I expect your bounds-check trap instructions to be at least as expensive.

    I'm not sure - it depends on the frequency of occurence and
    the latency between Dispatch and branch condition resolution.

    Based on nothing, I'm assuming both to be small :-)

    By constrast, a proper invisible-speculation fix would be much cheaper
    in performance (papers on the memory access part of invisible
    speculation give slowdown factors up to 1.2 (with some papers giving
    smaller slowdowns and even occasional speedups), and I think that the memory-access part will have the biggest performance impact among the
    changes necessary for a full-blown invisible-speculation fix.

    - anton

    How much extra complexity and hardware does it take to essentially
    queue all internal state changes throughout a core and all its caches
    until speculative branches resolve?

    And how many stalls will that queuing latency introduce?
    Remember that it can't handle a cache miss for any load that is
    currently in the shadow of an unresolved conditional branch as that
    would allow coherence traffic to escape to the rest of a system.
    And if two cores each have their own D$L1 but a shared D$L2,
    then they can't even speculate a cache miss from L1 to L2.
    Or speculatively prefetch alternate path instructions because
    that could change the state of a cache line from exclusive to shared.

    And since I have no faith that this will actually fix the problem
    but rather just move it someplace else, I see this as asking me to
    spend a lot of time and money on a fix that isn't really a fix for
    something that isn't (so far) actually a real world problem.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Mon Jan 1 19:39:07 2024
    Quadibloc wrote:

    On Mon, 01 Jan 2024 08:05:52 +0000, Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:

    Given that what I have read claims that it is _not_ possible to completely protect against Spectre and its variants without a significant degradation
    of performance,

    I argue that one can design a processor that looses no performance and remains Spectré (and most other current attack strategies) immune.

    my favored solution is to divide the CPU into two parts;
    one made immune to Spectre, which runs Internet-facing code which might be menaced by it, and another in which mitigations are not applied.

    This solution, though, basically has not been considered, and that is for
    an obvious reason: it is insecure. Once malicious software has found a way through some other vulnerability to insinuate itself into the "trusted"
    code of the computer, then it won't be stopped from making use of Spectre
    to further its progress.

    So it's not enough to put the Internet in a sandbox, we need new and better ideas about how to put a secure wall around that sandbox - to securely
    limit how it interacts with the rest of the computer.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Mon Jan 1 14:17:36 2024
    MitchAlsup wrote:
    EricP wrote:

    1) I have validated the ATX protocol and it has one important
    optimization,
    but not given much thought to optimizations that a directory controller
    might help with. Currently it assumes that guard requests coherence
    messages
    will be broadcast to all nodes in a system, and all will Grant/Deny
    reply.

    This is intentional as it keeps the ATX coherence messages completely
    separate from cache coherence messages, and that is important because
    it means you don't have to re-validate your coherence protocol
    or change the cache subsystem or directory controller.

    All I added to the coherence protocol is NAK and its use is restricted
    to ATOMIC events; then later I added priority compare so that higher
    priority events are not NAKed by lower priority events. Thus, the
    protocol is the same except for NAKs.

    And this is where I think my ATX atomic transactions differs from ESM,
    it is in how transactions are negotiated.

    I also have Cache Coherence (CC) protocol managing the shared/exclusive/owned line state and transfer of whole lines into, out of, and between caches. However I don't need a NAK in CC because line movement is never denied.

    My ATX coherence messages are a completely different protocol from CC.
    ATX messages deal with *permission* to read and write *individual bytes*
    in cache lines, and knows nothing about the cache line state or where
    it is located. The Atomic Transaction Manager (ATM) uses ATX messages to
    talks to other peer ATM's about access to guard ranges, it intercepts
    stores from the LSQ to guarded byte ranges and tucks them aside,
    and triggers local aborts if a transaction permission is denied.

    Only at commit does ATM send guarded line updates to its local cache,
    which does its normal thing to check if line is present in an exclusive/modified/owned state, and if not does a read_exclusive.
    And the cache controller knows nothing about transactions,
    it just sees a burst of updates to local cache.

    When cache lines moves is a matter of performance optimization.
    It can happen all at commit, or gradually and concurrently as
    a transaction proceeds in anticipation of a commit.

    This is also why my transactions are not sensitive to cache associativity
    and conflict evicts - because it does not use cache evicts or invalidates
    to trigger transactions aborts.

    4) Guard requests can be either synchronous or asynchronous.
    The choice does not affect transaction validity, just performance.

    Do you envision mixing and matching synch and asynch guards ?

    Synchronous means the guard acts as a membar to following loads
    and stores to the guarded bytes while the guard is negotiated.
    Synchronous is intended to prevent you from too soon touching a
    cache line and grabbing it away from its current owner-updater.

    Asynchronous allows following ld/st to the guarded range to execute
    concurrent while the guard request is pending, allowing the transfer
    of a cache lines to overlap with guard negotiation.
    Asynchronous could cause cache lines to ping-pong.

    So can too little associativity in your data cache.

    Yes but at least it doesn't cause transaction aborts, as RTM does.

    The choice of synchronous or asynchronous is dependent on the algorithm
    and can be different for different objects in a transaction.

    So, yes to my question above.

    Why not allow speculative branches to cover asynch accesses to guarded
    lines ??

    Because I didn't think of it. So far I have been mostly concerned with
    how to make it work at all.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Mon Jan 1 19:51:09 2024
    EricP wrote:

    MitchAlsup wrote:
    EricP wrote:

    1) I have validated the ATX protocol and it has one important
    optimization,
    but not given much thought to optimizations that a directory controller
    might help with. Currently it assumes that guard requests coherence
    messages
    will be broadcast to all nodes in a system, and all will Grant/Deny
    reply.

    This is intentional as it keeps the ATX coherence messages completely
    separate from cache coherence messages, and that is important because
    it means you don't have to re-validate your coherence protocol
    or change the cache subsystem or directory controller.

    All I added to the coherence protocol is NAK and its use is restricted
    to ATOMIC events; then later I added priority compare so that higher
    priority events are not NAKed by lower priority events. Thus, the
    protocol is the same except for NAKs.

    And this is where I think my ATX atomic transactions differs from ESM,
    it is in how transactions are negotiated.

    I also have Cache Coherence (CC) protocol managing the shared/exclusive/owned line state and transfer of whole lines into, out of, and between caches. However I don't need a NAK in CC because line movement is never denied.

    My ATX coherence messages are a completely different protocol from CC.
    ATX messages deal with *permission* to read and write *individual bytes*
    in cache lines, and knows nothing about the cache line state or where
    it is located. The Atomic Transaction Manager (ATM) uses ATX messages to talks to other peer ATM's about access to guard ranges, it intercepts
    stores from the LSQ to guarded byte ranges and tucks them aside,
    and triggers local aborts if a transaction permission is denied.

    Can you speculate on the latency for a core to talk to an ATM and receive
    a response back ?? {{I am assuming all cores on a "chip" can use the same STM.}}

    Only at commit does ATM send guarded line updates to its local cache,
    which does its normal thing to check if line is present in an exclusive/modified/owned state, and if not does a read_exclusive.
    And the cache controller knows nothing about transactions,
    it just sees a burst of updates to local cache.

    When cache lines moves is a matter of performance optimization.
    It can happen all at commit, or gradually and concurrently as
    a transaction proceeds in anticipation of a commit.

    This is also why my transactions are not sensitive to cache associativity
    and conflict evicts - because it does not use cache evicts or invalidates
    to trigger transactions aborts.

    4) Guard requests can be either synchronous or asynchronous.
    The choice does not affect transaction validity, just performance.

    Do you envision mixing and matching synch and asynch guards ?

    Synchronous means the guard acts as a membar to following loads
    and stores to the guarded bytes while the guard is negotiated.
    Synchronous is intended to prevent you from too soon touching a
    cache line and grabbing it away from its current owner-updater.

    Asynchronous allows following ld/st to the guarded range to execute
    concurrent while the guard request is pending, allowing the transfer
    of a cache lines to overlap with guard negotiation.
    Asynchronous could cause cache lines to ping-pong.

    So can too little associativity in your data cache.

    Yes but at least it doesn't cause transaction aborts, as RTM does.

    The choice of synchronous or asynchronous is dependent on the algorithm
    and can be different for different objects in a transaction.

    So, yes to my question above.

    Why not allow speculative branches to cover asynch accesses to guarded
    lines ??

    Because I didn't think of it. So far I have been mostly concerned with
    how to make it work at all.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Mon Jan 1 20:10:11 2024
    EricP wrote:

    How much extra complexity and hardware does it take to essentially
    queue all internal state changes throughout a core and all its caches
    until speculative branches resolve?

    For an Opteron scale processor:: Not That Much--you end up using
    the ReOrder Buffer for little tweeks to CPU state, and you use the
    Miss Buffer for latent changes to TLB and Caches, leaving only the
    branch predictor stuff. Since these change more slowly than instruc-
    tion state you can probably add a few bits to RoB checkpointing to
    cover the branch predictor updates.

    And how many stalls will that queuing latency introduce?

    None (<1%) if the buffering is done correctly.

    Remember that it can't handle a cache miss for any load that is
    currently in the shadow of an unresolved conditional branch as that
    would allow coherence traffic to escape to the rest of a system.

    That is a requirement for Sequential Consistency but not for Causal Consistency.

    And if two cores each have their own D$L1 but a shared D$L2,
    then they can't even speculate a cache miss from L1 to L2.

    Careful, here. It is possible to speculatively access another CPUs
    cache; and before the data arrives, branch recovery cancels why that
    line is showing up. Here, you CAN send the line back to the provider
    or on to DRAM even if you cannot deposit it in your cache.
    {{The OWNED state (MOESI) creates this requirement.}}

    So, you CAN under CC speculatively do this, what you cannot do is to
    forget, like one does repairing mispredictions, and you need strategies
    for each thing you may have inbound that will not ultimately be used
    due to misspeculation.

    Or speculatively prefetch alternate path instructions because
    that could change the state of a cache line from exclusive to shared.

    You can't put HW prefetches in the CPU caches primarily because you
    cannot loose the line.state being reallocated for the arriving line.
    But you COULD hold onto that line in something like the Miss Buffer
    until retirement.

    You can put SW prefetches in a CPU cache after the prefetch instruction retires.

    And since I have no faith that this will actually fix the problem
    but rather just move it someplace else, I see this as asking me to
    spend a lot of time and money on a fix that isn't really a fix for
    something that isn't (so far) actually a real world problem.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Tue Jan 2 04:29:00 2024
    On Mon, 01 Jan 2024 19:39:07 +0000, MitchAlsup wrote:
    Quadibloc wrote:

    Given that what I have read claims that it is _not_ possible to
    completely protect against Spectre and its variants without a
    significant degradation of performance,

    I argue that one can design a processor that looses no performance and remains Spectré (and most other current attack strategies) immune.

    You may be right; I'm not going to argue with that claim. I'm only
    saying that it is _generally believed_ that it isn't possible to do
    that well in dealing with Spectre and related vulnerabilities.

    If you're right, then all we need to do is make CPUs secure after
    the fashion you recommend.

    But if the naysayers are right?

    In that depressing scenario, I think that I have finally come up
    with what is needed.

    A general purpose computer, that needs to be able to execute
    software locally with high performance, but which can also
    access the Internet?

    I think that what it needs is a _double_ sandbox.

    You have the primary computer, which is built for speed. Only
    security measures with next to no overhead are incorporated in
    it. But the primary computer can't talk to the Internet.

    Instead, connecting to the Internet is the job of a secondary
    computer. This computer is permanently hardwired to be unable
    to write to the only portion of memory that it is allowed to
    execute code from - like the early Bell Telephones Electronic
    Switching System.

    So it doesn't load programs into memory from disk. That is done
    by the primary computer, which _can_ write to the memory that the
    secondary computer can execute code from.

    The secondary computer has as its major security feature that it
    can't ever alter any executable code. (It also has no direct
    access to the hard disk either.) It can be a high-performance
    computer as well. The web browser, and other Internet-facing applications
    run on the secondary computer.

    Except: since it can't load or modify executable code, it cannot
    do just-in-time compilation, the only efficient way to run an
    interpreter. So any executable content from web sites and so on,
    like Java Script, goes somewhere else.

    And the sandbox for _that_ is the *ternary* computer. Not ternary
    like the SETUN, but just that it is the _third_ computer, after the
    primary and secondary ones.

    This computer is secured against Spectre and Rowhammer... by being
    built from 486-era technology. The CPU is in-order, so no Spectre.
    The clock rate is slow enough so that the memory is not vulnerable
    to Rowhammer-style attacks.

    When the secondary computer finds executable content in web sites,
    it doesn't pass it up to the primary computer, it passes it down
    to the ternary computer, which serves as a high-quality sandbox,
    being physically separate, and having secure hardware.

    So a malicious web site trying to use JavaScript to do bad things
    to this kind of computer... finds the JavaScript is running on a
    processor without the seemingly unavoidable Spectre and Rowhammer vulnerabilities. It *did* avoid them, completely.

    And if it finds some other way to influence the computer above the
    sandbox? That computer is a desert for attackers, as it contains no
    means of altering any code which it executes. So almost any attack
    is impossible there!

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Tue Jan 2 07:59:37 2024
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    But is it true? Has everybody disabled JavaScript and Webassembly in
    their browser and their PDF viewer, and disabled macros in their
    Spreadsheet, Word processor and presentation program? And, looking at
    NetSpectre, has everybody disconnected their computers from the
    network?

    There are many thousands of academic papers on speculative execution attacks. >I cannot find a single example of even an attempt in the real world.

    Maybe you were not looking? E.g., read <https://dustri.org/b/spectre-exploits-in-the-wild.html>.

    But, you might say, the publication of this exploit does not show that
    Spectre has been used in the wild.

    But attackers usually do not announce how they got at your secret
    keys.

    On the other hand, there are many successful phishing attacks each day.

    Yes, so?

    In other words, if you could somehow fix all speculative execution >vulnerabilities, it would have zero impact on the actual successful
    security breaches.

    How do you know that? If the intended victim of a phishing attack
    does not bite, and does not use the software that the attacker has
    exploits for, what does the attacker do? Let's say the attacker uses
    a Spectre exploit to get at a private key. The victim may not even
    notice this; but if the attacker later performs a ransomware attack
    that makes use of the secret key, how will the attack be explained?
    Given the myth that there are no real-world Spectre attacks, the
    investigation will settle on an unknown security breach, or, because
    "unknown" is so uncomfortable, on PEBCAK (e.g., a phishing attack).

    Rowhammer is different - its a memory corruption hardware error.

    The commonalities with Spectre are that both are hardware errors, and
    fixed hardware for both could have been released since their
    discovery, but that has not happened.

    I'm not convinced that speculative execution leaks can all be fixed.
    They are finding new mechanisms every day.
    (uOp caches now need to be flushed on thread switch,
    function unit or register port contention as a side channel,
    speculative load forwarding attacks).

    uOp caches are microarchitectural state, speculative load forwarding
    is another speculation variant.

    Whatever speculation is used (branches, exception, memory aliasing
    (speculative load forwarding), etc.), the hardware people have managed
    for three decades to avoid leaking speculative architectural state to
    committed architectural state on misspeculation (Zenbleed is the
    exception that proves the rule). They just need to apply the same
    discipline to microarchitectural state, and they will prevent Spectre
    exploits that work through microarchitectural state, whether it's
    cache, uop cache, branch predictor, etc.

    Yes, hardware also needs to avoid revealing through resource
    contention side channels what's going on in the speculative world.
    There is work on that, too. I don't have such a nice argument for why
    the hardware people will succeed with that as I have for
    microarchitectural state, but I expect that, if they put their minds
    to it, they will succeed. OTOH, the widespread idea that Spectre and especially Spectre using resource-contention side channels is
    irrelevant in the real world may prevent them from putting their mind
    to it.

    The fact that we still have Rowhammer because one "cannot find a
    single example of even an attempt in the real world", despite it being relatively easy to fix if the memory controller people accepted it as
    their responsibility to fix it, speaks for the scenario where the
    hardware people will not put their minds to fixing Spectre, not the resource-contention side channel, and not even the
    microarchitectural-state side channel.

    I think it will be an ongoing game of whack-a-mole.

    The current approach of leaving Spectre to software mitigations is
    certainly going to be that. A pervasive hardware fix covers all
    variants, because it eliminates all side channels from speculative
    state to committed state.

    Just get rid of the low hanging fruit - the retention of branch predictors >across security domain thread/process switches.

    As mentioned, Spectre can be exploited even in the scenario you have
    outlined.

    I was looking for simpler and cheaper solutions that are optional and
    cover most situations.

    - The branch predictor tables be separated by user and super mode,
    and that OS's be advised to purge the user mode tables on thread switch.

    Now that's an expensive approach in both silicon and performance: The
    branch predictor, one of the biggest parts of a core would need to
    become twice as big to get the same accuracy for a program that spends
    almost all of its time in user mode, or almost all of its time in
    system mode. And a user-level program would still be slowed down a
    lot every time there is a thread switch.

    Why would the branch predictions from a different thread/process
    be helpful to your thread?

    The typical scenario where a thread can benefit from not flushing the
    branch predictor is when there is a switch to a different thread for a
    short while and then a switch back.

    However, there are also scenarios where threads benefit from the
    branch predictions collected in a different thread:

    * if the thread is in the same process, and processes the same code or
    the same data.

    * if the thread is in a different process, and executes a common
    library (e.g., libc), or works on the same data (e.g., in a pipe).

    Retaining predictions across security domains *IS* the problem
    because it allows an attacker to influence/control a victim.
    The side channel leaks, while also important, are just a display mechanism. >But with no control over a victim an attacker can make no use of
    side channels to display secrets.

    Spectre can be fixed by either preventing the side channel from the
    speculative to the committed state (the approach I suggest), or by
    preventing speculation (what the people who want to turn off
    speculation suggest).

    You suggest that erasing the branch predictor on thread switches is
    just as good as preventing speculation. But it isn't. Even without
    training, as long as there is speculation and the side channel from
    the speculative to the committed world, some data will be leaked. Ok,
    you may be tempted to rely on your luck that it's not sensitive data,
    but that does not appear to be a very trustworthy approach.

    And that's especially the case because the attacker may be able to
    help luck in the attacker's direction by passing data to the victim
    process that results in training the branch predictor of the victim in
    a specific way. E.g., a PDF document processed by a browser will
    result in a lot of branches being taken in a certain way, which will
    train the branch predictor in a certain way.

    Javascript is not a HW security domain.
    It is the responsibility of the Javascript VM peddlers to ensure their >runtime environment is secure, as they appear to have done.

    The architecture manual defines what happens when certain instructions
    are executed. E.g., when you perform an architectural bounds check,
    the access does not happen. If the microarchitecture does it
    differently, it's the responsibility of the hardware people to ensure
    that this microarchitectural stuff does not leak data through side
    channels if it can be prevented; and it can.

    The security of at least one commerical operating system (for
    Burroughs large systems) is based on this concept.

    If you cannot rely on branch instructions doing what the architecture
    manual says, why should you rely on other hardware mechanisms (that
    you may or may not consider to be "HW security domain") to do what the architecture manual says. And indeed, e.g., the page-protection
    mechanism does not prevent speculative access and revealing that
    through a side channel on a lot of hardware, either (the original
    Meltdown).

    Do the software mitigations applied by the JavaScript implementors
    prevent Spectre completely? I have my doubts. As you write, there
    are a large number of Spectre variants that they have to protect
    against, including stuff like Inception that predicts branches that
    are not there.

    - I have an extensive set of conditional trap instructions intended for
    bounds checks, asserts, etc. In OoO a load or store following a
    bounds check might execute before the check exception was delivered.
    These trap instructions could have an optional NoSpeculate flag
    which again stalls the Dispatcher and essentially single steps just
    that instruction until the test condition resolves.

    Also very expensive for programs (programming languages) that use
    these instructions; speculative load hardening (SLH), which prevents
    Spectre v1 by turning the control dependence on the bounds check into
    a data dependence costs a factore 2.3-2.5 (depending on the SLH
    variant) on the SPEC programs, and I expect your bounds-check trap
    instructions to be at least as expensive.

    I'm not sure - it depends on the frequency of occurence and
    the latency between Dispatch and branch condition resolution.

    Based on nothing, I'm assuming both to be small :-)

    The main reason why OoO so vastly outperforms in-order for
    general-purpose code is that execution does not have to wait for the
    branches to be resolved; this is reflected in the size of the
    outstanding branches, which is, e.g., 128 for the Golden Cove < https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/#gracemont-s-out-of-order-engine>
    (look for "Branch Order Buffer"). If you throw in one NoSpeculate
    branch, this reduces this number to 0 for this branch. And given the
    number of loads in a program, you probably have to make most branches "NoSpeculate" to be safe. So you fall back to close to in-order
    performance.

    By constrast, a proper invisible-speculation fix would be much cheaper
    in performance (papers on the memory access part of invisible
    speculation give slowdown factors up to 1.2 (with some papers giving
    smaller slowdowns and even occasional speedups), and I think that the
    memory-access part will have the biggest performance impact among the
    changes necessary for a full-blown invisible-speculation fix.

    - anton

    How much extra complexity and hardware does it take to essentially
    queue all internal state changes throughout a core and all its caches
    until speculative branches resolve?

    For the caches, Mitch Alsup tells us that we just need the load
    buffers that we have anyway. In any case, it's at most one cache line
    for each outstanding load. But you also get a benefit: this works
    like an L1 that is that much larger. You can reduce the buffering
    necessary by keeping only management data (a few bits) for the loads
    that hit the L1 caches. For speculative stores, that has been in the
    store buffers since the beginning of modern OoO.

    For conditional branches, you just have to remember the outcome (1
    bit) and feed it to the branch predictor when the branch commits.
    For indirect branches, you just have to remember the target address
    (64 bits).

    And how many stalls will that queuing latency introduce?
    Remember that it can't handle a cache miss for any load that is
    currently in the shadow of an unresolved conditional branch as that
    would allow coherence traffic to escape to the rest of a system.

    Above you argued that NoSpeculate branches are cheap. I am not
    convinced of that, but if it is, making state-changing memory access NoSpeculate is certainly relatively cheap. That's because a
    state-changing memory access is relatively expensive anyway, so adding
    maybe 20 cycles the memory access is no longer speculative does not
    make it much more expensive. Also, such memory accesses are quite
    rare, with, as you say, cache misses probably being the most common
    cause.

    And if two cores each have their own D$L1 but a shared D$L2,
    then they can't even speculate a cache miss from L1 to L2.

    As long as the cache miss does not change the state of the L2 cache
    line and the resource-contention side channel is eliminated, you
    certainly can. But, yes, bigger private caches are likely to help.
    Now the trend in recent years has been towards bigger private caches
    for the P-cores; e.g., Raptor Lake has 2MB of private L2 cache.

    Actually, if it turns out to be a big issue, you can actually perform
    a speculative load from, e.g., a line that is exclusive to a different
    core without changing the state. That load then speculates on the
    content of the line not changing, and this speculation has to be
    verified (and the line changed to shared) when the load is about to be committed.

    Or speculatively prefetch alternate path instructions because
    that could change the state of a cache line from exclusive to shared.

    Please elaborate. What "alternative path"?

    The hardware prefetcher must only be trained on architectural memory
    accesses (to avoid Spectre), and any software prefetch instructions
    have the same speculation restrictions as the actual loads. What a
    software prefetch instruction would achieve here is to become
    architectural many cycles before the actual load and switch the cache
    line to shared and actually put the cache line into the local cache at
    that time.

    And since I have no faith that this will actually fix the problem
    but rather just move it someplace else, I see this as asking me to
    spend a lot of time and money on a fix that isn't really a fix for
    something that isn't (so far) actually a real world problem.

    A software guy can point to phishing and say the same thing about
    fixing software vulnerabilities before the vulnerability is proven to
    be exploited by real-world attackers.

    The typical approach taken in the software world is to not take claims
    of security problems serious without a demonstrated exploit, but to
    act when an exploit has been presented.

    In the case of Spectre, a lot of exploits have been presented (you
    wrote "many thousand"), and the hardware people should have started
    fixing the hardware in 2017, when they learned about Spectre, given
    the lead time of hardware designs. Instead, they sit on their hands
    and let the users tell each other that it's Somebody Else's Problem.

    And no, it's not too difficult. The pieces have been published years
    ago.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to MitchAlsup on Tue Jan 2 16:14:57 2024
    On Mon, 1 Jan 2024 19:39:07 +0000
    mitchalsup@aol.com (MitchAlsup) wrote:

    Quadibloc wrote:

    On Mon, 01 Jan 2024 08:05:52 +0000, Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:

    Given that what I have read claims that it is _not_ possible to
    completely protect against Spectre and its variants without a
    significant degradation of performance,

    I argue that one can design a processor that looses no performance
    and remains Spectré (and most other current attack strategies) immune.

    my favored solution is to divide the CPU into two
    parts; one made immune to Spectre, which runs Internet-facing code
    which might be menaced by it, and another in which mitigations are
    not applied.

    This solution, though, basically has not been considered, and that
    is for an obvious reason: it is insecure. Once malicious software
    has found a way through some other vulnerability to insinuate
    itself into the "trusted" code of the computer, then it won't be
    stopped from making use of Spectre to further its progress.

    So it's not enough to put the Internet in a sandbox, we need new
    and better ideas about how to put a secure wall around that sandbox
    - to securely limit how it interacts with the rest of the computer.


    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to MitchAlsup on Tue Jan 2 16:24:52 2024
    On Mon, 1 Jan 2024 19:39:07 +0000
    mitchalsup@aol.com (MitchAlsup) wrote:

    Quadibloc wrote:

    On Mon, 01 Jan 2024 08:05:52 +0000, Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:

    Given that what I have read claims that it is _not_ possible to
    completely protect against Spectre and its variants without a
    significant degradation of performance,

    I argue that one can design a processor that looses no performance
    and remains Spectré (and most other current attack strategies) immune.


    Most important 'current attack strategies' are the same as they were 30
    years ago. Side channels, row hammers etc. are good to write papers
    about and to scare few people. As far as real worlds threats goes, they
    are of very low importance.
    If attacker found a way to run arbitrary binary on my computer, at my
    normal non-root privilege, he has 1000 easier (than side channels) way
    to achieve his goals.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Wed Jan 3 10:38:58 2024
    MitchAlsup wrote:
    EricP wrote:

    And this is where I think my ATX atomic transactions differs from ESM,
    it is in how transactions are negotiated.

    I also have Cache Coherence (CC) protocol managing the
    shared/exclusive/owned
    line state and transfer of whole lines into, out of, and between caches.
    However I don't need a NAK in CC because line movement is never denied.

    My ATX coherence messages are a completely different protocol from CC.
    ATX messages deal with *permission* to read and write *individual bytes*
    in cache lines, and knows nothing about the cache line state or where
    it is located. The Atomic Transaction Manager (ATM) uses ATX messages to
    talks to other peer ATM's about access to guard ranges, it intercepts
    stores from the LSQ to guarded byte ranges and tucks them aside,
    and triggers local aborts if a transaction permission is denied.

    Can you speculate on the latency for a core to talk to an ATM and receive
    a response back ?? {{I am assuming all cores on a "chip" can use the same STM.}}

    Its difficult because there are so many options and possible optimizations.

    In the base design (no Directory Controller optimization) a guard range
    for bytes in a single line which does not have any shared bytes
    (that is, no read-read shared or adjacent false shared bytes)
    requires one request and one reply message to each peer core in a system.
    For C cores thats 2*(C-1) msgs.

    For unshared lines new guard requests in the same line require
    no new messages.

    If a line is shared between two different transactions, either both
    read share the same bytes, or read-write or write-write adjacent bytes,
    then each new guard range instructions requires a request and reply
    BUT only with the cores that are sharing lines.
    This would likely only happen if there were multiple objects that just
    happen to reside in the same cache line and are accessed by two or more transactions.

    The messages are sent from ATM to ATM over the coherence network.
    Because they do not interact with caches they do not need to travel
    down or up the cache hierarchy L1<=>L2<=>L3. Instead they can bypass
    all those cache comms queues and go directly between the ATM and network.
    This eliminates all the queuing that cache coherence messages must transit.

    When an ATX message arrives at a ATM, processing them requires no external information as it is all in the local lookup tables indexed by a CAM on
    the line physical address. The CAM and tables likely require at least
    2R2W ports, one for new local commands and one for inbound ATX messages. Processing ATX messages and sending a reply should be at least 1 per clock
    (and more table ports gets more messages per clock).

    At the instruction level a core can have multiple guard requests outstanding
    at once, and can optionally overlap this with probing the local cache for
    hits and fetching missed line data either read_share or read_excl.

    The cost that is most difficult to estimate is the collision rate.
    The ATX protocol is a try-fail-retry based mechanism with FIFO ordering.
    For N contenders each transaction can worst case fail N-1 times then succeed. So N contenders trying N times is worst case O(N^2) cost.
    But it also depends on how far each is into the transaction and how much investment in that transaction when it collides, loses, and retries.

    So best case, a transaction has perfect overlap for all its guard range requests plus fetching all its cache lines and no shared lines.
    The latency could be about the time to fetch a single line.

    Worst case, the guard ranges are serialized, line fetches are serialized,
    every line is shared, every transaction collides and retries max times.
    Because of FIFO ordering that number is bounded, but it might be big.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Wed Jan 3 17:13:59 2024
    EricP wrote:

    MitchAlsup wrote:
    EricP wrote:

    And this is where I think my ATX atomic transactions differs from ESM,
    it is in how transactions are negotiated.

    I also have Cache Coherence (CC) protocol managing the
    shared/exclusive/owned
    line state and transfer of whole lines into, out of, and between caches. >>> However I don't need a NAK in CC because line movement is never denied.

    My ATX coherence messages are a completely different protocol from CC.
    ATX messages deal with *permission* to read and write *individual bytes* >>> in cache lines, and knows nothing about the cache line state or where
    it is located. The Atomic Transaction Manager (ATM) uses ATX messages to >>> talks to other peer ATM's about access to guard ranges, it intercepts
    stores from the LSQ to guarded byte ranges and tucks them aside,
    and triggers local aborts if a transaction permission is denied.

    Can you speculate on the latency for a core to talk to an ATM and receive
    a response back ?? {{I am assuming all cores on a "chip" can use the same
    STM.}}

    Its difficult because there are so many options and possible optimizations.

    In the base design (no Directory Controller optimization) a guard range
    for bytes in a single line which does not have any shared bytes
    (that is, no read-read shared or adjacent false shared bytes)
    requires one request and one reply message to each peer core in a system.
    For C cores thats 2*(C-1) msgs.

    For unshared lines new guard requests in the same line require
    no new messages.

    If a line is shared between two different transactions, either both
    read share the same bytes, or read-write or write-write adjacent bytes,
    then each new guard range instructions requires a request and reply
    BUT only with the cores that are sharing lines.
    This would likely only happen if there were multiple objects that just
    happen to reside in the same cache line and are accessed by two or more transactions.

    The messages are sent from ATM to ATM over the coherence network.
    Because they do not interact with caches they do not need to travel
    down or up the cache hierarchy L1<=>L2<=>L3. Instead they can bypass
    all those cache comms queues and go directly between the ATM and network. This eliminates all the queuing that cache coherence messages must transit.

    When an ATX message arrives at a ATM, processing them requires no external information as it is all in the local lookup tables indexed by a CAM on
    the line physical address. The CAM and tables likely require at least
    2R2W ports, one for new local commands and one for inbound ATX messages. Processing ATX messages and sending a reply should be at least 1 per clock (and more table ports gets more messages per clock).

    At the instruction level a core can have multiple guard requests outstanding at once, and can optionally overlap this with probing the local cache for hits and fetching missed line data either read_share or read_excl.

    The cost that is most difficult to estimate is the collision rate.
    The ATX protocol is a try-fail-retry based mechanism with FIFO ordering.
    For N contenders each transaction can worst case fail N-1 times then succeed. So N contenders trying N times is worst case O(N^2) cost.
    But it also depends on how far each is into the transaction and how much investment in that transaction when it collides, loses, and retries.

    So best case, a transaction has perfect overlap for all its guard range requests plus fetching all its cache lines and no shared lines.
    The latency could be about the time to fetch a single line.

    Worst case, the guard ranges are serialized, line fetches are serialized, every line is shared, every transaction collides and retries max times. Because of FIFO ordering that number is bounded, but it might be big.


    Thanks for the lucid explanation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to EricP on Wed Jan 3 19:59:27 2024
    EricP wrote:
    MitchAlsup wrote:

    Can you speculate on the latency for a core to talk to an ATM and receive
    a response back ?? {{I am assuming all cores on a "chip" can use the same
    STM.}}

    The cost that is most difficult to estimate is the collision rate.
    The ATX protocol is a try-fail-retry based mechanism with FIFO ordering.
    For N contenders each transaction can worst case fail N-1 times then
    succeed.
    So N contenders trying N times is worst case O(N^2) cost.
    But it also depends on how far each is into the transaction and how much investment in that transaction when it collides, loses, and retries.

    Another thing is about failure retry loop latency.
    If the loser in a collision immediately loops around and tries again that
    might flood the coherence network with messages. If the loser backs off
    for a period of time it can spend too much time in the back off.

    The ATSTART instruction has an option to request a notification message
    after a collision that causes an abort.

    When there is a collision for access to bytes between two transactions
    the transaction number assigned at the start is used to decide who wins,
    the lower number is the older and has priority.

    The ATX guard message includes the 64-bit transaction number for deciding
    the winner, plus a 16-bit "try" number, the guard access read or write,
    a bit vector of which line bytes this applies to,
    and a bit requesting a notify message if access denied.

    If a notify is requested the winner in a collision remembers the
    transaction and try numbers of the loser and sends a Denied reply.
    The Denied causes the loser to abort and jump to its fail address where it executes a ATWAITNFY instruction to await a notification message with its transaction and try number (similar to the x86 MWAIT instruction).

    After the winner eventually commits, for each losing collider that
    requested notification it sends a message with the transaction and try
    number to the loser to wake it up as soon as possible. The ATWAITNFY instruction matches the transaction and try numbers and continues.

    It looks useful in theory but needs to be tested.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Fri Jan 5 15:23:31 2024
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:

    - The branch predictor tables be separated by user and super mode,
    and that OS's be advised to purge the user mode tables on thread switch. >>> Now that's an expensive approach in both silicon and performance: The
    branch predictor, one of the biggest parts of a core would need to
    become twice as big to get the same accuracy for a program that spends
    almost all of its time in user mode, or almost all of its time in
    system mode. And a user-level program would still be slowed down a
    lot every time there is a thread switch.

    That was the simplest form. A more sophisticated version could have
    a 2 or 3 bit tag like an ASID on each branch predictor entry.
    Tag 0 is for super mode, others are for the most recent 3 or 7 processes
    run on this core. If a lookup hits on an entry with a different tag
    then the entry is set to its uninitialized state.

    Though I'm not sure how this could work with the return stack predictor. Probably have to keep 4 or 8 copies of it.

    Why would the branch predictions from a different thread/process
    be helpful to your thread?

    The typical scenario where a thread can benefit from not flushing the
    branch predictor is when there is a switch to a different thread for a
    short while and then a switch back.

    However, there are also scenarios where threads benefit from the
    branch predictions collected in a different thread:

    * if the thread is in the same process, and processes the same code or
    the same data.

    Yes, only flush if the new thread is in a different process.

    * if the thread is in a different process, and executes a common
    library (e.g., libc), or works on the same data (e.g., in a pipe).

    IMO this "optimization" is not worth the security hole.

    Retaining predictions across security domains *IS* the problem
    because it allows an attacker to influence/control a victim.
    The side channel leaks, while also important, are just a display mechanism. >> But with no control over a victim an attacker can make no use of
    side channels to display secrets.

    Spectre can be fixed by either preventing the side channel from the speculative to the committed state (the approach I suggest), or by
    preventing speculation (what the people who want to turn off
    speculation suggest).

    You suggest that erasing the branch predictor on thread switches is
    just as good as preventing speculation. But it isn't. Even without training, as long as there is speculation and the side channel from
    the speculative to the committed world, some data will be leaked. Ok,
    you may be tempted to rely on your luck that it's not sensitive data,
    but that does not appear to be a very trustworthy approach.

    I would also rely on the NoSpeculate branch hint to stall branches
    that check array bounds.

    And that's especially the case because the attacker may be able to
    help luck in the attacker's direction by passing data to the victim
    process that results in training the branch predictor of the victim in
    a specific way. E.g., a PDF document processed by a browser will
    result in a lot of branches being taken in a certain way, which will
    train the branch predictor in a certain way.


    - I have an extensive set of conditional trap instructions intended for >>>> bounds checks, asserts, etc. In OoO a load or store following a
    bounds check might execute before the check exception was delivered.
    These trap instructions could have an optional NoSpeculate flag
    which again stalls the Dispatcher and essentially single steps just
    that instruction until the test condition resolves.
    Also very expensive for programs (programming languages) that use
    these instructions; speculative load hardening (SLH), which prevents
    Spectre v1 by turning the control dependence on the bounds check into
    a data dependence costs a factore 2.3-2.5 (depending on the SLH
    variant) on the SPEC programs, and I expect your bounds-check trap
    instructions to be at least as expensive.
    I'm not sure - it depends on the frequency of occurence and
    the latency between Dispatch and branch condition resolution.

    Based on nothing, I'm assuming both to be small :-)

    The main reason why OoO so vastly outperforms in-order for
    general-purpose code is that execution does not have to wait for the
    branches to be resolved; this is reflected in the size of the
    outstanding branches, which is, e.g., 128 for the Golden Cove < https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/#gracemont-s-out-of-order-engine>
    (look for "Branch Order Buffer"). If you throw in one NoSpeculate
    branch, this reduces this number to 0 for this branch. And given the
    number of loads in a program, you probably have to make most branches "NoSpeculate" to be safe. So you fall back to close to in-order
    performance.

    BTW near as I can tell the "Branch Order Buffer" is just Intel's name
    for what others call the register file rename checkpoint.

    As I saw it, the NoSpeculate hint only stalls to wait for its branch,
    not all pending branches. The idea being that most branches are not for
    array bounds checks so don't need guarding, and for those that are bounds checks the array access LD's and ST's likely immediately follow.
    But it could also be used to eliminate timing variations in crypto loops.

    But yes, it does stall all following instructions.
    This was intended as a simple option that user or compilers can apply.
    To guard just a subset of the following instructions would require
    a mechanism like full predication which would be much more expensive.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Fri Jan 5 20:57:37 2024
    EricP wrote:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:

    - The branch predictor tables be separated by user and super mode,
    and that OS's be advised to purge the user mode tables on thread switch. >>>> Now that's an expensive approach in both silicon and performance: The
    branch predictor, one of the biggest parts of a core would need to
    become twice as big to get the same accuracy for a program that spends >>>> almost all of its time in user mode, or almost all of its time in
    system mode. And a user-level program would still be slowed down a
    lot every time there is a thread switch.

    That was the simplest form. A more sophisticated version could have
    a 2 or 3 bit tag like an ASID on each branch predictor entry.

    Which halves the number of entries you can store or worse. Remember
    branch prediction states are un-tagged 2-bit saturating counters.

    Tag 0 is for super mode, others are for the most recent 3 or 7 processes
    run on this core. If a lookup hits on an entry with a different tag
    then the entry is set to its uninitialized state.

    Though I'm not sure how this could work with the return stack predictor. Probably have to keep 4 or 8 copies of it.

    Why would the branch predictions from a different thread/process
    be helpful to your thread?

    The typical scenario where a thread can benefit from not flushing the
    branch predictor is when there is a switch to a different thread for a
    short while and then a switch back.

    However, there are also scenarios where threads benefit from the
    branch predictions collected in a different thread:

    * if the thread is in the same process, and processes the same code or
    the same data.

    Yes, only flush if the new thread is in a different process.

    * if the thread is in a different process, and executes a common
    library (e.g., libc), or works on the same data (e.g., in a pipe).

    IMO this "optimization" is not worth the security hole.

    Retaining predictions across security domains *IS* the problem
    because it allows an attacker to influence/control a victim.
    The side channel leaks, while also important, are just a display mechanism. >>> But with no control over a victim an attacker can make no use of
    side channels to display secrets.

    Spectre can be fixed by either preventing the side channel from the
    speculative to the committed state (the approach I suggest), or by
    preventing speculation (what the people who want to turn off
    speculation suggest).

    You suggest that erasing the branch predictor on thread switches is
    just as good as preventing speculation. But it isn't. Even without
    training, as long as there is speculation and the side channel from
    the speculative to the committed world, some data will be leaked. Ok,
    you may be tempted to rely on your luck that it's not sensitive data,
    but that does not appear to be a very trustworthy approach.

    I would also rely on the NoSpeculate branch hint to stall branches
    that check array bounds.

    My 66000 PREDication does not use the branch prediction tables.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Sat Jan 6 13:16:03 2024
    MitchAlsup wrote:
    EricP wrote:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:

    - The branch predictor tables be separated by user and super mode, >>>>>> and that OS's be advised to purge the user mode tables on thread
    switch.
    Now that's an expensive approach in both silicon and performance: The >>>>> branch predictor, one of the biggest parts of a core would need to
    become twice as big to get the same accuracy for a program that spends >>>>> almost all of its time in user mode, or almost all of its time in
    system mode. And a user-level program would still be slowed down a
    lot every time there is a thread switch.

    That was the simplest form. A more sophisticated version could have
    a 2 or 3 bit tag like an ASID on each branch predictor entry.

    Which halves the number of entries you can store or worse. Remember
    branch prediction states are un-tagged 2-bit saturating counters.

    I am under the impression that modern conditional branch predictors
    use many more bits so they can track loop patterns.
    Many BP's already use CAM address tags to improve accuracy.

    A Survey of Techniques for Dynamic Branch Prediction, 2018 https://arxiv.org/abs/1804.00261

    But even if it is just costs 2 bits then so be it, that's the cost.
    This 2-bit ASID tag needs to be a CAM so you can invalidate
    all entries of a specific tag for recycle in one clock.

    The branch target buffers would also need CAM tags.
    Return stack predictor would need separate tables.

    Tag 0 is for super mode, others are for the most recent 3 or 7 processes
    run on this core. If a lookup hits on an entry with a different tag
    then the entry is set to its uninitialized state.

    Though I'm not sure how this could work with the return stack predictor.
    Probably have to keep 4 or 8 copies of it.

    Why would the branch predictions from a different thread/process
    be helpful to your thread?

    The typical scenario where a thread can benefit from not flushing the
    branch predictor is when there is a switch to a different thread for a
    short while and then a switch back.

    However, there are also scenarios where threads benefit from the
    branch predictions collected in a different thread:

    * if the thread is in the same process, and processes the same code or
    the same data.

    Yes, only flush if the new thread is in a different process.

    * if the thread is in a different process, and executes a common
    library (e.g., libc), or works on the same data (e.g., in a pipe).

    IMO this "optimization" is not worth the security hole.

    Retaining predictions across security domains *IS* the problem
    because it allows an attacker to influence/control a victim.
    The side channel leaks, while also important, are just a display
    mechanism.
    But with no control over a victim an attacker can make no use of
    side channels to display secrets.

    Spectre can be fixed by either preventing the side channel from the
    speculative to the committed state (the approach I suggest), or by
    preventing speculation (what the people who want to turn off
    speculation suggest).

    You suggest that erasing the branch predictor on thread switches is
    just as good as preventing speculation. But it isn't. Even without
    training, as long as there is speculation and the side channel from
    the speculative to the committed world, some data will be leaked. Ok,
    you may be tempted to rely on your luck that it's not sensitive data,
    but that does not appear to be a very trustworthy approach.

    I would also rely on the NoSpeculate branch hint to stall branches
    that check array bounds.

    My 66000 PREDication does not use the branch prediction tables.

    There has been research on interaction between conditional branches and predicated code, mostly from around the Itanium time. Basically, when you
    move some execution from conditional to predicated it changes the stats
    for the branches that remain.

    I have also seen mention of "predication predictors", I think it was to
    elide the predict-predicate-false instructions from the stream.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Sat Jan 6 18:58:51 2024
    EricP wrote:

    MitchAlsup wrote:

    I would also rely on the NoSpeculate branch hint to stall branches
    that check array bounds.

    My 66000 PREDication does not use the branch prediction tables.

    There has been research on interaction between conditional branches and predicated code, mostly from around the Itanium time. Basically, when you move some execution from conditional to predicated it changes the stats
    for the branches that remain.

    One would expect that. In My 66000 predication is <now> clause by clause.
    Under predication, one expects to fetch at least to the last instruction
    of the else-clause by the time you know the branch condition. So, fetch redirection is unwarranted (no need to branch, just skip the then-clause
    or skip the else-clause.

    I have also seen mention of "predication predictors", I think it was to
    elide the predict-predicate-false instructions from the stream.

    Sooner or later this will help, but I don't see a need at 6-wide, yet.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to EricP on Sun Jan 7 12:23:23 2024
    EricP wrote:
    MitchAlsup wrote:
    EricP wrote:

    And this is where I think my ATX atomic transactions differs from ESM,
    it is in how transactions are negotiated.

    I also have Cache Coherence (CC) protocol managing the
    shared/exclusive/owned
    line state and transfer of whole lines into, out of, and between caches. >>> However I don't need a NAK in CC because line movement is never denied.

    My ATX coherence messages are a completely different protocol from CC.
    ATX messages deal with *permission* to read and write *individual bytes* >>> in cache lines, and knows nothing about the cache line state or where
    it is located. The Atomic Transaction Manager (ATM) uses ATX messages to >>> talks to other peer ATM's about access to guard ranges, it intercepts
    stores from the LSQ to guarded byte ranges and tucks them aside,
    and triggers local aborts if a transaction permission is denied.


    The messages are sent from ATM to ATM over the coherence network.
    Because they do not interact with caches they do not need to travel
    down or up the cache hierarchy L1<=>L2<=>L3. Instead they can bypass
    all those cache comms queues and go directly between the ATM and network. This eliminates all the queuing that cache coherence messages must transit.

    I had something of an epiphany last night that I thought I'd pass on.

    The cache coherence protocol (CCP) supports communication between
    cache coherency managers (CCM) which they currently use to negotiate
    cache line ownership. The various L1, L2, L3 level managers pass CCP
    messages up and down the hierarchy between themselves over comms queues,
    and out over the inter-core network.

    I had been thinking that my Atomic Transaction Manager (ATM) would be
    located at the end of the Load Store Queue just before the cache and CCM.
    The ATM can intercept LSQ commands to the cache and modify them,
    to tuck aside a store in a transaction, or send commands to the
    LSQ itself, such as to command the LSQ dependency matrix to stall
    all memory ops to a particular cache line address.

    Since the ATX messages do not directly interact with the local cache
    they can bypass the level comms queues and flow directly between
    the ATM and the coherence network.

    Core-0 Core-1
    LSQ<=>ATM<=>L1_CCM L1_CCM<=>ATM<=>LSQ
    ^ | | ^
    | L2_CCM L2_CCM |
    | | | |
    | L3_CCM L3_CCM |
    | | | |
    v v v v
    network<------------>network


    Instead of thinking of the ATM as an independent unit attached to the LSQ,
    what if I see it as a sub-unit of the LSQ. That would make the ATX messages used for negotiating transactions just an example of a general concept of *messages sent between LSQ's to coordinate their operation*,
    just as CCP messages are sent between CCM's.

    In short, messaging directly between the LSQ's managers on different
    cores is potentially a whole new class of coherence and control.

    So the question is: besides my atomic transactions,
    what else might LSQ's want to say directly to each other?
    And remember the LSQ has other resources, the ordered queue of LD/ST ops,
    the address CAM's and op dependency matrix, pending store data, the TLB.

    For example, TLB shootdown is one that's already available on some cores
    but now could be seen as part of this general class of LSQ messaging.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Sun Jan 7 19:52:42 2024
    EricP wrote:

    EricP wrote:
    MitchAlsup wrote:
    EricP wrote:

    And this is where I think my ATX atomic transactions differs from ESM, >>>> it is in how transactions are negotiated.

    I also have Cache Coherence (CC) protocol managing the
    shared/exclusive/owned
    line state and transfer of whole lines into, out of, and between caches. >>>> However I don't need a NAK in CC because line movement is never denied. >>>
    My ATX coherence messages are a completely different protocol from CC. >>>> ATX messages deal with *permission* to read and write *individual bytes* >>>> in cache lines, and knows nothing about the cache line state or where
    it is located. The Atomic Transaction Manager (ATM) uses ATX messages to >>>> talks to other peer ATM's about access to guard ranges, it intercepts
    stores from the LSQ to guarded byte ranges and tucks them aside,
    and triggers local aborts if a transaction permission is denied.


    The messages are sent from ATM to ATM over the coherence network.
    Because they do not interact with caches they do not need to travel
    down or up the cache hierarchy L1<=>L2<=>L3. Instead they can bypass
    all those cache comms queues and go directly between the ATM and network.
    This eliminates all the queuing that cache coherence messages must transit.

    I had something of an epiphany last night that I thought I'd pass on.

    The cache coherence protocol (CCP) supports communication between
    cache coherency managers (CCM) which they currently use to negotiate
    cache line ownership. The various L1, L2, L3 level managers pass CCP
    messages up and down the hierarchy between themselves over comms queues,
    and out over the inter-core network.

    I had been thinking that my Atomic Transaction Manager (ATM) would be
    located at the end of the Load Store Queue just before the cache and CCM.
    The ATM can intercept LSQ commands to the cache and modify them,
    to tuck aside a store in a transaction, or send commands to the
    LSQ itself, such as to command the LSQ dependency matrix to stall
    all memory ops to a particular cache line address.

    Since the ATX messages do not directly interact with the local cache
    they can bypass the level comms queues and flow directly between
    the ATM and the coherence network.

    Core-0 Core-1
    LSQ<=>ATM<=>L1_CCM L1_CCM<=>ATM<=>LSQ
    ^ | | ^
    | L2_CCM L2_CCM |
    | | | |
    | L3_CCM L3_CCM |
    | | | |
    v v v v
    network<------------>network


    Instead of thinking of the ATM as an independent unit attached to the LSQ, what if I see it as a sub-unit of the LSQ. That would make the ATX messages used for negotiating transactions just an example of a general concept of *messages sent between LSQ's to coordinate their operation*,
    just as CCP messages are sent between CCM's.

    The only thing I would add is that the delay from ATM to Network may be multiple cycles since L2 and L3 are both larger in diameter than the
    speed of signal in wire per clock.

    In short, messaging directly between the LSQ's managers on different
    cores is potentially a whole new class of coherence and control.

    So the question is: besides my atomic transactions,
    what else might LSQ's want to say directly to each other?
    And remember the LSQ has other resources, the ordered queue of LD/ST ops,
    the address CAM's and op dependency matrix, pending store data, the TLB.

    You probably only want the translated PAs in LSQ not the TLB itself.

    For example, TLB shootdown is one that's already available on some cores
    but now could be seen as part of this general class of LSQ messaging.

    I found it better to just define the TLB as coherent--and get rid of
    TLB shootdowns entirely (no IPIs for PTE downgrades).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)