• Another security vulnerability

    From Stephen Fuld@21:1/5 to All on Sun Mar 24 10:40:18 2024
    https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/

    So, is there a way to fix this while maintaining the feature's
    performance advantage?

    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Sun Mar 24 18:20:06 2024
    Stephen Fuld wrote:

    https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/

    So, is there a way to fix this while maintaining the feature's
    performance advantage?

    They COULD start by not putting prefetched data into the cache
    until after the predicting instruction retires. {{I have a note
    from about 20 months ago where this feature was publicized and
    the note indicates a potential side-channel.}}

    An alternative is to notice that [*]cryption instructions are
    being processed and turn DMP off during those intervals of time.
    {Or both}.

    Principle:: an Architecturally visible unit of data can only become
    visible after the causing instruction retires. A high precision timer
    makes cache line [dis]placement visible; so either take away the HPT
    or don't alter cache visible state too early.

    And we are off to the races, again.....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Stephen Fuld on Sun Mar 24 21:47:17 2024
    On Sun, 24 Mar 2024 10:40:18 -0700, Stephen Fuld wrote:

    So, is there a way to fix this while maintaining the feature's
    performance advantage?

    Even if it’s fixed, how does that help existing users with broken
    machines?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stephen Fuld on Mon Mar 25 06:37:31 2024
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/

    It's Groundhog Day, all over again!

    So, is there a way to fix this while maintaining the feature's
    performance advantage?

    From what is written in the article, nothing is currently known.

    For new silicon, people could finally implement Mitch's suggestion
    of not committing speculative state before the instruction retires.
    (It would be interesting to see how much area and power this
    would cost with the hundreds of instructions in flight with
    modern micro-architectures).

    For existing silicon - run crypto on efficiency cores, or just
    make sure not to run untrusted code on your machine :-(

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stephen Fuld on Mon Mar 25 07:25:34 2024
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes: >https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/

    That's a pretty bad article, but at least one can read it without
    JavaScript, unlike the web page of the vulnerability
    <https://gofetch.fail/>.

    Classically hardware prefetchers have based their predictions on the
    addresses of earlier accesses. If they base their predictions only on architectural accesses and (by coding the cryptographic code
    appropriately) don't have information about the secrets in the
    addresses, the prefetcher cannot reveal these secrets through a side
    channel. Cryptographic code has been written that way for quite a
    while.

    These prefetchers are not so great on pointer-chasing code (unless the
    data happens to be allocated at regular distances), so apparently
    Apple engineers, and according to this article also Intel engineers
    added a prefetcher that prefetches based on the contents of data that
    it fetches. To anyone who knows how a cache side-channel works, it is crystal-clear that this makes it possible to reveal the *contents* of
    accessed memory through a cache side channel. Even if only
    architectural accesses are used for that prediction, the possibility
    is still there, because cryptographic code has to access the secrets.

    This should be clear to anyone who understands Spectre and actually
    anyone who understands classical (architectural) side-channel attacks;
    but it should have been on the minds of the hardware designers very
    much since Spectre has been discovered.

    The contribution of the GoFetch researchers is that they demonstrate
    that this is not just a theoretical possibility.

    If Intel added this vulnerability in Raptor Lake (as the article
    states), they have to take the full blame. On the positive side, the
    GoFetch researchers have not found a way to exploit Raptor Lake's data-dependent prefetcher. Yet. But I would not bet on there not
    being a way to exploit this.

    Apple's designers at least have the excuse that at the time when they
    laid the groundwork for the M1, Spectre was not known, and when it
    became known, it was too late to eliminate this prefetcher from the
    design (but not too late to disable it through a chicken bit).

    So, is there a way to fix this while maintaining the feature's
    performance advantage?

    What is the performance advantage? People who have tried to use
    software prefetching have often been disappointed by the results. I
    expect that a data-dependent prefetcher will usually not be
    beneficial, either. There will be a few cases where it may help, but
    the average performance advantage will be small.

    On the GoFetch web page the researchers suggest using the chicken bit
    in the cryptographic library. I would not be surprised if there was a combination of speculation and data-dependent prefetching, or of address-dependent prefetching and data-dependent prefetching that
    allows all code (not just cryptographic code) to perform
    data-dependent prefetches based on the secret data that only crypto
    code accesses architecturally. But whether that's the case depends on
    the hardware design; plus, if speculative accesses from other codes to
    this data are possible, the data can usually be revealed through a
    speculative load even in the absence of a data-dependent prefetcher
    (but there may be software mitigations for that scenario that the data-dependent prefetcher would circumvent).

    The web page also mentions "input blinding". I wonder how that can be
    made to work reliably. If the attacker has access to all the loaded
    data (through GoFetch) and knows how the blinded data is processed,
    the attacker can do everything that the crypto code can do,
    e.g. create a session key. Of course, if the attacker has to
    reconstruct the key from several pieces of data, the attack becomes
    more difficult, but relying on it being too difficult has not been a
    recipe for success in the past (e.g., before Bernstein's cache timing
    attack on AES it was thought to be too difficult to exploint).

    So what other solutions might there be? The results of the
    data-dependent prefetches could be loaded into a special cache so that
    they don't evict entries of other caches. If a load architecturally
    accesses this data, it is transferred to the regular cache. That
    special cache should be fully associative, to avoid revealing bits of
    the addresse of other accesses (i.e., data) through the accessed set.
    That leaves the possibility of revealing something by evicting
    something from this special cache just based on capacity, but I don't
    see how that could be exploited.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Mon Mar 25 08:37:51 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:
    For existing silicon - run crypto on efficiency cores

    Not the recent vulnerability that affects Intel's efficiency cores.
    Also, if the prefetcher works with data in a shared cache (I don't
    know whether the data-dependent prefetchers do that), it may not
    matter on which core the code runs.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Mon Mar 25 13:27:32 2024
    On Mon, 25 Mar 2024 06:37:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/


    It's Groundhog Day, all over again!

    So, is there a way to fix this while maintaining the feature's
    performance advantage?

    From what is written in the article, nothing is currently known.

    For new silicon, people could finally implement Mitch's suggestion
    of not committing speculative state before the instruction retires.
    (It would be interesting to see how much area and power this
    would cost with the hundreds of instructions in flight with
    modern micro-architectures).

    For existing silicon - run crypto on efficiency cores, or just
    make sure not to run untrusted code on your machine :-(

    I would trust your second solution better than the first one.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to sfuld@alumni.cmu.edu.invalid on Mon Mar 25 07:26:35 2024
    On Sun, 24 Mar 2024 10:40:18 -0700, Stephen Fuld
    <sfuld@alumni.cmu.edu.invalid> wrote:

    https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/

    So, is there a way to fix this while maintaining the feature's
    performance advantage?

    While the article was pessimistic about fixing it by a microcode fix
    that could be downloaded, the description didn't suggest to me that it
    would be hard to correct it in future designs. Maybe my memory is
    faulty.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to John Savard on Mon Mar 25 15:53:06 2024
    On Mon, 25 Mar 2024 07:26:35 -0600
    John Savard <quadibloc@servername.invalid> wrote:

    On Sun, 24 Mar 2024 10:40:18 -0700, Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

    https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/


    So, is there a way to fix this while maintaining the feature's
    performance advantage?

    While the article was pessimistic about fixing it by a microcode fix
    that could be downloaded, the description didn't suggest to me that it
    would be hard to correct it in future designs. Maybe my memory is
    faulty.

    John Savard

    The description suggests that it wouldn't be hard to correct by
    introduction of new mode bit that can be turned on and off by cryto
    libraries.
    That is not going to help existing binaries.

    Now, personally I don't believe that for single-user platform like Mac
    the threat is even remotely real and that any fix is needed. And I am
    not even an Apple fanboy, more like a mild hater.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Mar 25 12:26:30 2024
    So, is there a way to fix this while maintaining the feature's
    performance advantage?
    Even if it’s fixed, how does that help existing users with broken
    machines?

    Of course, the current computing world loves it even more if it pushes
    users to buy new machines.

    This said, until now manufacturers don't seem to consider side-channel
    attacks as something worth spending much effort to avoid, so I'd expect
    future machines to be just as vulnerable :-(


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Lawrence D'Oliveiro on Mon Mar 25 09:50:28 2024
    On 3/24/2024 2:47 PM, Lawrence D'Oliveiro wrote:
    On Sun, 24 Mar 2024 10:40:18 -0700, Stephen Fuld wrote:

    So, is there a way to fix this while maintaining the feature's
    performance advantage?

    Even if it’s fixed, how does that help existing users with broken
    machines?

    The article gives several suggestions, but they all come at a
    performance cost, and require some, presumably small, changes to crypto
    code.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Mon Mar 25 17:06:35 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/

    It's Groundhog Day, all over again!

    So, is there a way to fix this while maintaining the feature's
    performance advantage?

    From what is written in the article, nothing is currently known.

    For new silicon, people could finally implement Mitch's suggestion
    of not committing speculative state before the instruction retires.

    To be fair, that's been suggested for quite a few years by
    various semiconductor designers. The difficulties lie in
    efficient implementation thereof.

    (It would be interesting to see how much area and power this
    would cost with the hundreds of instructions in flight with
    modern micro-architectures).

    Yes.


    For existing silicon - run crypto on efficiency cores, or just
    make sure not to run untrusted code on your machine :-(

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Mar 25 12:18:18 2024
    Principle:: an Architecturally visible unit of data can only become
    visible after the causing instruction retires. A high precision timer
    makes cache line [dis]placement visible; so either take away the HPT
    or don't alter cache visible state too early.

    And parallelism (e.g. multicores) can be used to emulate HPT, so "take
    away the HPT" is not really an option.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Mon Mar 25 17:07:16 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    For existing silicon - run crypto on efficiency cores

    Not the recent vulnerability that affects Intel's efficiency cores.
    Also, if the prefetcher works with data in a shared cache (I don't
    know whether the data-dependent prefetchers do that), it may not
    matter on which core the code runs.

    Run it in non-cacheable memory. Slow but safe.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Stephen Fuld on Mon Mar 25 21:22:35 2024
    On Mon, 25 Mar 2024 09:50:28 -0700, Stephen Fuld wrote:

    On 3/24/2024 2:47 PM, Lawrence D'Oliveiro wrote:

    Even if it’s fixed, how does that help existing users with broken
    machines?

    The article gives several suggestions, but they all come at a
    performance cost ...

    The basic problem is that building all this complex, bug-prone
    functionality into monolithic, nonupgradeable hardware is not really a
    good idea.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Mon Mar 25 22:17:55 2024
    Lawrence D'Oliveiro wrote:

    On Mon, 25 Mar 2024 09:50:28 -0700, Stephen Fuld wrote:

    On 3/24/2024 2:47 PM, Lawrence D'Oliveiro wrote:

    Even if it’s fixed, how does that help existing users with broken
    machines?

    The article gives several suggestions, but they all come at a
    performance cost ...

    The basic problem is that building all this complex, bug-prone
    functionality into monolithic, nonupgradeable hardware is not really a
    good idea.

    Would you like to inform us of how it can be done otherwise ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Tue Mar 26 01:34:45 2024
    On Mon, 25 Mar 2024 21:22:35 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    On Mon, 25 Mar 2024 09:50:28 -0700, Stephen Fuld wrote:

    On 3/24/2024 2:47 PM, Lawrence D'Oliveiro wrote:

    Even if it’s fixed, how does that help existing users with broken
    machines?

    The article gives several suggestions, but they all come at a
    performance cost ...

    The basic problem is that building all this complex, bug-prone
    functionality into monolithic, nonupgradeable hardware is not really
    a good idea.

    May be, not a good idea. But the best.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Tue Mar 26 00:27:34 2024
    On Mon, 25 Mar 2024 22:17:55 +0000, MitchAlsup1 wrote:

    Lawrence D'Oliveiro wrote:

    The basic problem is that building all this complex, bug-prone
    functionality into monolithic, nonupgradeable hardware is not really a
    good idea.

    Would you like to inform us of how it can be done otherwise ?

    Upgradeable firmware/software, of course.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Tue Mar 26 02:31:38 2024
    On Mon, 25 Mar 2024 15:53:06 +0200, Michael S wrote:

    Now, personally I don't believe that for single-user platform like Mac
    the threat is even remotely real and that any fix is needed.

    It could be exploited via, for example, some future web browser or other
    app vulnerability.

    Remember Schneier’s dictum: attacks only ever get worse, they never get better.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Tue Mar 26 02:32:46 2024
    On Mon, 25 Mar 2024 17:07:16 GMT, Scott Lurndal wrote:

    Run it in non-cacheable memory. Slow but safe.

    But 99% of the performance speedups of the last 20-30 years have involved caching of some kind.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Tue Mar 26 09:18:36 2024
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Also, if the prefetcher works with data in a shared cache (I don't
    know whether the data-dependent prefetchers do that), it may not
    matter on which core the code runs.

    Run it in non-cacheable memory. Slow but safe.

    To eliminate this particular vulnerability, it's sufficient to disable
    the data-dependent prefetcher.

    I wonder where these ideas are coming from that call for the worst
    possible fix for a vulnerability?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Tue Mar 26 14:12:39 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Also, if the prefetcher works with data in a shared cache (I don't
    know whether the data-dependent prefetchers do that), it may not
    matter on which core the code runs.

    Run it in non-cacheable memory. Slow but safe.

    To eliminate this particular vulnerability, it's sufficient to disable
    the data-dependent prefetcher.

    That assumes that chicken bit(s) are available to do that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Tue Mar 26 14:12:09 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Mon, 25 Mar 2024 17:07:16 GMT, Scott Lurndal wrote:

    Run it in non-cacheable memory. Slow but safe.

    But 99% of the performance speedups of the last 20-30 years have involved >caching of some kind.

    So what? Running the crypto algorithms (when not offloaded to
    on-chip accelerators) using non-cacheable memory as a workaround
    until the hardware issues are ameliorated doesn't imply that
    all other code needs to run non-cachable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Tue Mar 26 16:36:26 2024
    scott@slp53.sl.home (Scott Lurndal) writes:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Mon, 25 Mar 2024 17:07:16 GMT, Scott Lurndal wrote:

    Run it in non-cacheable memory. Slow but safe.
    ...
    Running the crypto algorithms (when not offloaded to
    on-chip accelerators) using non-cacheable memory as a workaround
    until the hardware issues are ameliorated doesn't imply that
    all other code needs to run non-cachable.

    Then your crypto code is slow and unsafe. The attacker will use the
    rest of the application to extract the crypto keys, whether through
    the cache side-channel of Spectre, or a prefetcher-based side channel.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Tue Mar 26 18:29:41 2024
    On Mon, 25 Mar 2024 07:25:34 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes: >https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/


    That's a pretty bad article, but at least one can read it without
    JavaScript, unlike the web page of the vulnerability
    <https://gofetch.fail/>.


    In case you missed it, the web page contains link to pdf: https://gofetch.fail/files/gofetch.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Tue Mar 26 16:40:38 2024
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes: >>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Also, if the prefetcher works with data in a shared cache (I don't
    know whether the data-dependent prefetchers do that), it may not
    matter on which core the code runs.

    Run it in non-cacheable memory. Slow but safe.

    To eliminate this particular vulnerability, it's sufficient to disable
    the data-dependent prefetcher.

    That assumes that chicken bit(s) are available to do that.

    The hardware designers have put in the chicken bit(s); it's highly
    unlikely that they have unconditionally enabled the data-dependent
    prefetcher on M1 and M2, and only added a chicken bit on M3. Now that
    the hardware indeed turns out to be broken, they just need to activate
    it/them.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Stefan Monnier on Tue Mar 26 18:57:30 2024
    On Mon, 25 Mar 2024 12:18:18 -0400
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:

    Principle:: an Architecturally visible unit of data can only become
    visible after the causing instruction retires. A high precision
    timer makes cache line [dis]placement visible; so either take away
    the HPT or don't alter cache visible state too early.

    And parallelism (e.g. multicores) can be used to emulate HPT, so "take
    away the HPT" is not really an option.


    Stefan

    That's exactly how they measured time in PoC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Tue Mar 26 18:56:19 2024
    On Sun, 24 Mar 2024 18:20:06 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    Stephen Fuld wrote:

    https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/


    So, is there a way to fix this while maintaining the feature's
    performance advantage?

    They COULD start by not putting prefetched data into the cache
    until after the predicting instruction retires. {{I have a note
    from about 20 months ago where this feature was publicized and
    the note indicates a potential side-channel.}}

    An alternative is to notice that [*]cryption instructions are
    being processed and turn DMP off during those intervals of time.
    {Or both}.


    Their PoC attacks public key crypto algorithms - RSA-2048, DHKE-2048 and
    couple of exotic new algorithms that nobody uses.
    I think, neither RSA-2048 nor DHKE-2048 use any special crypto
    instructions.
    On Intel/AMD it's likely that thise crypto routines use MULX
    and ADX instruction much for often than non-crypto code, but that's not guaranteed.
    On ARM64 you don't even have that much, because equivalents of MULX and
    of ADX are part of base instruction set.

    Principle:: an Architecturally visible unit of data can only become
    visible after the causing instruction retires. A high precision timer
    makes cache line [dis]placement visible; so either take away the HPT
    or don't alter cache visible state too early.

    And we are off to the races, again.....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Wed Mar 27 08:20:49 2024
    Michael S <already5chosen@yahoo.com> schrieb:

    In case you missed it, the web page contains link to pdf: https://gofetch.fail/files/gofetch.pdf

    Looking the paper, it seems that a separate "load value" instruction
    (where it is guaranteed that no pointer prefetching will be done)
    could fix this particular issue. Compilers know what type is being
    loaded from memory, and could issue the corresponding instruction.
    This would not impact performance.

    Only works for new versions of an architecture, and supporting
    compilers, but no code change would be required. And, of course,
    it would eat up opcode space.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Thomas Koenig on Wed Mar 27 09:44:34 2024
    Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:

    In case you missed it, the web page contains link to pdf:
    https://gofetch.fail/files/gofetch.pdf

    Looking the paper, it seems that a separate "load value" instruction
    (where it is guaranteed that no pointer prefetching will be done)
    could fix this particular issue. Compilers know what type is being
    loaded from memory, and could issue the corresponding instruction.
    This would not impact performance.

    Only works for new versions of an architecture, and supporting
    compilers, but no code change would be required. And, of course,
    it would eat up opcode space.

    It doesn't need to eat opcode space if you only support one data type,
    64-bit ints, and one address mode, [register].
    Other address modes can be calculated using LEA.
    Since these are rare instructions to solve a particular problem,
    they won't be used that often, so a few extra instructions shouldn't matter.

    I used this approach for the Atomic Fetch-and-OP instructions.
    They only need one or two data types and one address mode.

    I also considered the same single [reg] address mode for privileged
    Load & Store to Physical Address, though these would need to
    support 1,2,4, and 8 byte ints, and need some cache control bits.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to EricP on Wed Mar 27 15:05:36 2024
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:

    In case you missed it, the web page contains link to pdf:
    https://gofetch.fail/files/gofetch.pdf

    Looking the paper, it seems that a separate "load value" instruction
    (where it is guaranteed that no pointer prefetching will be done)
    could fix this particular issue. Compilers know what type is being
    loaded from memory, and could issue the corresponding instruction.
    This would not impact performance.

    Only works for new versions of an architecture, and supporting
    compilers, but no code change would be required. And, of course,
    it would eat up opcode space.

    It doesn't need to eat opcode space if you only support one data type,
    64-bit ints, and one address mode, [register].
    Other address modes can be calculated using LEA.
    Since these are rare instructions to solve a particular problem,
    they won't be used that often, so a few extra instructions shouldn't matter.

    Hm, I'm not sure it would actually be used rarely, at least not
    the way I thought about it.

    I envisage a "ldp" (load pointer) instruction, which turns on
    prefetaching, for everything that looks like

    foo_t *p = some_expr;

    which could also mean something like

    *p = ptrarray[i];

    with a scaled and indexed load (for example), where prefixing
    is turned on, and a "ldd" (load double data) instruction where,
    explicitly, for

    long int n = some_other_expr;

    where prefetching is explicitly disabled. (Apart from the security implicatins, this could also save a tiny bit of power).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Thomas Koenig on Wed Mar 27 12:08:05 2024
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:

    In case you missed it, the web page contains link to pdf:
    https://gofetch.fail/files/gofetch.pdf
    Looking the paper, it seems that a separate "load value" instruction
    (where it is guaranteed that no pointer prefetching will be done)
    could fix this particular issue. Compilers know what type is being
    loaded from memory, and could issue the corresponding instruction.
    This would not impact performance.

    Only works for new versions of an architecture, and supporting
    compilers, but no code change would be required. And, of course,
    it would eat up opcode space.
    It doesn't need to eat opcode space if you only support one data type,
    64-bit ints, and one address mode, [register].
    Other address modes can be calculated using LEA.
    Since these are rare instructions to solve a particular problem,
    they won't be used that often, so a few extra instructions shouldn't matter.

    Hm, I'm not sure it would actually be used rarely, at least not
    the way I thought about it.

    I'm referring to your load with prefetch disable.
    For these particular loads it's users could likely tolerate the
    "overhead" of an extra LEA instruction to calculate the address,
    and don't need all 7 integer data types.

    I envisage a "ldp" (load pointer) instruction, which turns on
    prefetaching, for everything that looks like

    foo_t *p = some_expr;

    which could also mean something like

    *p = ptrarray[i];

    So this would be
    LEA r0,[rBase+rIndex*8+offset]
    LDAPQ r0,[r0] // load quad with auto pointer prefetch

    though I'm not really sold on the need for this if you have an instruction (below) that explicitly disables pointer auto-prefetch.
    Plus all this does is eliminate an explicit PREFCHR Prefetch-for-Read.

    with a scaled and indexed load (for example), where prefixing
    is turned on, and a "ldd" (load double data) instruction where,
    explicitly, for

    long int n = some_other_expr;

    where prefetching is explicitly disabled. (Apart from the security implicatins, this could also save a tiny bit of power).

    LEA r0,[rBase+offset]
    LDNPQ r0,[r0] // load quad no-auto-prefetch

    costs 1 opcode as it only supports int64 data type and
    doesn't need a corresponding STNPQ store.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Lawrence D'Oliveiro on Wed Mar 27 09:22:12 2024
    On 3/25/2024 5:27 PM, Lawrence D'Oliveiro wrote:
    On Mon, 25 Mar 2024 22:17:55 +0000, MitchAlsup1 wrote:

    Lawrence D'Oliveiro wrote:

    The basic problem is that building all this complex, bug-prone
    functionality into monolithic, nonupgradeable hardware is not really a
    good idea.

    Would you like to inform us of how it can be done otherwise ?

    Upgradeable firmware/software, of course.

    But microcode is generally slower than dedicated hardware, and most
    people seem to be unwilling to give up performance all the time to gain
    an advantage in a situation that occurs infrequently and mostly never.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Wed Mar 27 17:52:30 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Michael S <already5chosen@yahoo.com> schrieb:

    In case you missed it, the web page contains link to pdf:
    https://gofetch.fail/files/gofetch.pdf

    Looking the paper, it seems that a separate "load value" instruction
    (where it is guaranteed that no pointer prefetching will be done)
    could fix this particular issue. Compilers know what type is being
    loaded from memory, and could issue the corresponding instruction.
    This would not impact performance.

    Only works for new versions of an architecture, and supporting
    compilers, but no code change would be required. And, of course,
    it would eat up opcode space.

    The other way 'round seems to be a better approach: mark those loads
    that load addresses, and then prefetch at most based on the data
    loaded by these instructions (of course, the data-dependent prefetcher
    may choose to ignore the data based on history). That means that
    existing programs are immune to GoFetch, but also don't benefit from
    the data-dependent prefetcher (which is a minor issue IMO).

    As for opcode space, we already have prefetch instructions, so one
    could implement this by letting every load that actually loads an
    address be followed by a prefetch instruction. But of course that
    would consume more code space and decoding bandwidth than just adding
    a load-address instruction.

    In any case, passing the prefetch hints to hardware that may ignore
    the hint based on history may help reduce the performance
    disadvantages that have been seen when using prefetch hint
    instructions.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Wed Mar 27 18:14:11 2024
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    It doesn't need to eat opcode space if you only support one data type,
    64-bit ints, and one address mode, [register].
    Other address modes can be calculated using LEA.
    Since these are rare instructions to solve a particular problem,
    they won't be used that often, so a few extra instructions shouldn't matter.

    You lost me here. Do you mean that a load with address mode
    [register] is considered to be a non-address load and not followed by
    the data-dependent prefetcher? So how would an address load be
    encoded if the natural expression would be [register]?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Wed Mar 27 15:42:34 2024
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    It doesn't need to eat opcode space if you only support one data type,
    64-bit ints, and one address mode, [register].
    Other address modes can be calculated using LEA.
    Since these are rare instructions to solve a particular problem,
    they won't be used that often, so a few extra instructions shouldn't matter.

    You lost me here. Do you mean that a load with address mode
    [register] is considered to be a non-address load and not followed by
    the data-dependent prefetcher? So how would an address load be
    encoded if the natural expression would be [register]?

    - anton

    I'm pointing out that not all instructions need to be orthogonal.
    There can be savings in opcode space by tempering that based on
    expected frequency of occurrence.

    The normal LD and ST have all their address modes and data types
    because these functions occur frequently enough that we deem it
    worthwhile to support these all in one instruction,
    such as supporting both sign and zero extended loads
    or scaled index addressing.

    I note there is this class of relatively rarely used special purpose
    memory access instructions that don't need to have all singing and all
    dancing address modes and/or data types like the regular LD and ST.

    Since I need a LEA Load Effective Address instruction anyway
    which does rBase+rIndex*scale+offset calculation
    (plus I have others, like where rBase is RIP or an absolute address),
    then I can drop all but the [reg] address mode for these rare instructions
    and in many cases drop some sign or zero extend types for loads.

    For example, I use just two opcodes for Atomic Fetch Add int64 and int32
    AFADD8 rDst,rSrc,[rAddr]
    AFADD4 rDst,rSrc,[rAddr]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Wed Mar 27 20:52:13 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Michael S <already5chosen@yahoo.com> schrieb:

    In case you missed it, the web page contains link to pdf:
    https://gofetch.fail/files/gofetch.pdf

    Looking the paper, it seems that a separate "load value" instruction
    (where it is guaranteed that no pointer prefetching will be done)
    could fix this particular issue. Compilers know what type is being
    loaded from memory, and could issue the corresponding instruction.
    This would not impact performance.

    It is worth noting (from the paper's Introduction):

    In particular, Augury reported that the [Apple M-series ed.] DMP only activates
    in the presence of a rather idiosyncratic program memory
    access pattern (where the program streams through an array
    of pointers and architecturally dereferences those pointers).
    This access pattern is not typically found in security critical
    software such as side-channel hardened constant-time code--
    hence making that code impervious to leakage through the
    DMP.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Scott Lurndal on Wed Mar 27 20:53:25 2024
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Michael S <already5chosen@yahoo.com> schrieb:

    In case you missed it, the web page contains link to pdf:
    https://gofetch.fail/files/gofetch.pdf

    Looking the paper, it seems that a separate "load value" instruction >>>(where it is guaranteed that no pointer prefetching will be done)
    could fix this particular issue. Compilers know what type is being >>>loaded from memory, and could issue the corresponding instruction.
    This would not impact performance.

    It is worth noting (from the paper's Introduction):

    In particular, Augury reported that the [Apple M-series ed.] DMP only activates
    in the presence of a rather idiosyncratic program memory
    access pattern (where the program streams through an array
    of pointers and architecturally dereferences those pointers).
    This access pattern is not typically found in security critical
    software such as side-channel hardened constant-time code--
    hence making that code impervious to leakage through the
    DMP.

    Reminder to self, read rest of article before commenting.

    Never mind.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to EricP on Wed Mar 27 20:15:25 2024
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:

    In case you missed it, the web page contains link to pdf:
    https://gofetch.fail/files/gofetch.pdf
    Looking the paper, it seems that a separate "load value" instruction
    (where it is guaranteed that no pointer prefetching will be done)
    could fix this particular issue. Compilers know what type is being
    loaded from memory, and could issue the corresponding instruction.
    This would not impact performance.

    Only works for new versions of an architecture, and supporting
    compilers, but no code change would be required. And, of course,
    it would eat up opcode space.
    It doesn't need to eat opcode space if you only support one data type,
    64-bit ints, and one address mode, [register].
    Other address modes can be calculated using LEA.
    Since these are rare instructions to solve a particular problem,
    they won't be used that often, so a few extra instructions shouldn't matter.

    Hm, I'm not sure it would actually be used rarely, at least not
    the way I thought about it.

    I'm referring to your load with prefetch disable.
    For these particular loads it's users could likely tolerate the
    "overhead" of an extra LEA instruction to calculate the address,
    and don't need all 7 integer data types.

    If it was LEA-only, it would need some kind of pragma in the code
    which said "use this more cumbersome and slower, but more
    safe version".

    For that reason, I would probably prefer a separate version
    which is implicitly safe and does not have any other drawbacks
    for performance, with no code changes.

    If it's worth the opcode space...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Wed Mar 27 21:27:16 2024
    EricP wrote:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    It doesn't need to eat opcode space if you only support one data type,
    64-bit ints, and one address mode, [register].
    Other address modes can be calculated using LEA.
    Since these are rare instructions to solve a particular problem,
    they won't be used that often, so a few extra instructions shouldn't matter.

    You lost me here. Do you mean that a load with address mode
    [register] is considered to be a non-address load and not followed by
    the data-dependent prefetcher? So how would an address load be
    encoded if the natural expression would be [register]?

    - anton

    I'm pointing out that not all instructions need to be orthogonal.
    There can be savings in opcode space by tempering that based on
    expected frequency of occurrence.

    The normal LD and ST have all their address modes and data types
    because these functions occur frequently enough that we deem it
    worthwhile to support these all in one instruction,
    such as supporting both sign and zero extended loads
    or scaled index addressing.

    I note there is this class of relatively rarely used special purpose
    memory access instructions that don't need to have all singing and all dancing address modes and/or data types like the regular LD and ST.

    Since I need a LEA Load Effective Address instruction anyway
    which does rBase+rIndex*scale+offset calculation
    (plus I have others, like where rBase is RIP or an absolute address),
    then I can drop all but the [reg] address mode for these rare instructions and in many cases drop some sign or zero extend types for loads.

    It seems to me that once the core has identified an address and an offset
    from that address contains another address (foo->next, foo->prev) that
    only those are prefetched. So this depends on placing next as the first container in a structure and remains dependent on chasing next a lot more
    often than chasing prev.

    Otherwise, knowing a loaded value contains a pointer to a structure (or array) one cannot predict what to prefetch unless one can assume the offset into the struct (or array).

    Now Note:: If there were an instruction that loaded the value known to be
    a pointer and prefetched based on the received pointer, then the prefetch
    is now architectural not µArchitectural and you are allowed to damage the cache or TLB when/after the instruction retires.

    For example, I use just two opcodes for Atomic Fetch Add int64 and int32
    AFADD8 rDst,rSrc,[rAddr]
    AFADD4 rDst,rSrc,[rAddr]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Stephen Fuld on Thu Mar 28 00:46:13 2024
    On Wed, 27 Mar 2024 09:22:12 -0700, Stephen Fuld wrote:

    On 3/25/2024 5:27 PM, Lawrence D'Oliveiro wrote:

    On Mon, 25 Mar 2024 22:17:55 +0000, MitchAlsup1 wrote:

    Lawrence D'Oliveiro wrote:

    The basic problem is that building all this complex, bug-prone
    functionality into monolithic, nonupgradeable hardware is not really
    a good idea.

    Would you like to inform us of how it can be done otherwise ?

    Upgradeable firmware/software, of course.

    But microcode is generally slower than dedicated hardware, and most
    people seem to be unwilling to give up performance all the time to gain
    an advantage in a situation that occurs infrequently and mostly never.

    Bruce Schneier has a saying: “attacks never get worse, they can only get better”.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Mar 28 01:22:06 2024
    Lawrence D'Oliveiro wrote:

    On Wed, 27 Mar 2024 09:22:12 -0700, Stephen Fuld wrote:

    On 3/25/2024 5:27 PM, Lawrence D'Oliveiro wrote:

    On Mon, 25 Mar 2024 22:17:55 +0000, MitchAlsup1 wrote:

    Lawrence D'Oliveiro wrote:

    The basic problem is that building all this complex, bug-prone
    functionality into monolithic, nonupgradeable hardware is not really >>>>> a good idea.

    Would you like to inform us of how it can be done otherwise ?

    Upgradeable firmware/software, of course.

    But microcode is generally slower than dedicated hardware, and most
    people seem to be unwilling to give up performance all the time to gain
    an advantage in a situation that occurs infrequently and mostly never.

    S.E.L 32/65, 32/67, 32/87 were all microcoded. 95% of the instructions*
    ran down the pipeline without using the microcode (which was only there
    to pick up the pieces after HW logic sequencers got tooo complicated.

    (*) closer to 97% of the dynamic instruction stream.

    Microcode IS generally slower, but not always. PDP-11 or VAX microcode
    is too much, IBM, S.E.L. is not too much.

    Bruce Schneier has a saying: “attacks never get worse, they can only get better”.

    Which is why to have to design for attackers (to be thwarted}.

    Observing the last 4-odd years, it appears to me that CPU designers will
    not be in a position to "give a rat's ass" until there is performance competitive, cost competitive, alternatives. Given Apple, AMD, ARM, and
    Intel not giving a rat's ass, it's going to have to come from somewhere
    else.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Thu Mar 28 12:52:07 2024
    MitchAlsup1 wrote:
    EricP wrote:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    It doesn't need to eat opcode space if you only support one data type, >>>> 64-bit ints, and one address mode, [register].
    Other address modes can be calculated using LEA.
    Since these are rare instructions to solve a particular problem,
    they won't be used that often, so a few extra instructions shouldn't
    matter.

    You lost me here. Do you mean that a load with address mode
    [register] is considered to be a non-address load and not followed by
    the data-dependent prefetcher? So how would an address load be
    encoded if the natural expression would be [register]?

    - anton

    I'm pointing out that not all instructions need to be orthogonal.
    There can be savings in opcode space by tempering that based on
    expected frequency of occurrence.

    The normal LD and ST have all their address modes and data types
    because these functions occur frequently enough that we deem it
    worthwhile to support these all in one instruction,
    such as supporting both sign and zero extended loads
    or scaled index addressing.

    I note there is this class of relatively rarely used special purpose
    memory access instructions that don't need to have all singing and all
    dancing address modes and/or data types like the regular LD and ST.

    Since I need a LEA Load Effective Address instruction anyway
    which does rBase+rIndex*scale+offset calculation
    (plus I have others, like where rBase is RIP or an absolute address),
    then I can drop all but the [reg] address mode for these rare
    instructions
    and in many cases drop some sign or zero extend types for loads.

    It seems to me that once the core has identified an address and an offset from that address contains another address (foo->next, foo->prev) that
    only those are prefetched. So this depends on placing next as the first container in a structure and remains dependent on chasing next a lot more often than chasing prev.

    Otherwise, knowing a loaded value contains a pointer to a structure (or array)
    one cannot predict what to prefetch unless one can assume the offset
    into the
    struct (or array).

    Right, this is the problem that these "data memory-dependent" prefetchers
    like described in that Intel Programmable and Integrated Unified Memory Architecture (PIUMA)" paper referenced by Paul Clayton are trying to solve.

    The pointer field to chase can be
    (a) at an +- offset from the current pointer virtual address
    (b) at a different offset for each iteration
    (c) conditional on some other field at some other offset

    and most important:

    (d) any new pointers are virtual address that have to start back at
    the Load Store Queue for VA translation and forwarding testing
    after applying (a),(b) and (c) above.

    Since each chased pointer starts back at LSQ, the cost is no different
    than an explicit Prefetch instruction, except without (a),(b) and (c)
    having been applied first.

    So I find the simplistic, blithe data-dependent auto prefetching
    described as questionable.

    Now Note:: If there were an instruction that loaded the value known to be
    a pointer and prefetched based on the received pointer, then the prefetch
    is now architectural not µArchitectural and you are allowed to damage the cache or TLB when/after the instruction retires.

    In the PIUMA case those pointers were to sparse data sets
    so part of the problem was rolling over the cache, as well as
    (and the PIUMA paper didn't mention this) the TLB.

    After reading the PIUMA paper I had an idea for a small modification
    to the PTE cache control bits to handle sparse data. The PTE's 3 CC bits
    can specify the upper page table levels are cached in the TLB but
    lower levels are not because they would always roll over the TLB.
    However the non-TLB cached PTE's may optionally still be cached
    in L1 or L2, or not at all.

    This allows one to hold the top page table levels in the TLB,
    the upper middle levels in L1, lower middle levels in L2,
    and leaf PTE's and sparse code/data not cached at all.
    BUT, as PIUMA proposes, we also allow the memory subsystem to read and write individual aligned 8-byte values from DRAM, rather than whole cache lines,
    so we only move that actual 8 bytes values we need.

    Also note that page table walks are also graph structure walks
    but chasing physical addresses at some simple calculated offsets.
    These physical addresses might be cached in L1 or L2 so we can't
    just chase these pointers in the memory controller but,
    if one wants to do this, have to do so in the cache controller.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Mar 28 19:59:45 2024
    EricP wrote:

    MitchAlsup1 wrote:

    It seems to me that once the core has identified an address and an offset
    from that address contains another address (foo->next, foo->prev) that
    only those are prefetched. So this depends on placing next as the first
    container in a structure and remains dependent on chasing next a lot more
    often than chasing prev.

    Otherwise, knowing a loaded value contains a pointer to a structure (or
    array)
    one cannot predict what to prefetch unless one can assume the offset
    into the
    struct (or array).

    Right, this is the problem that these "data memory-dependent" prefetchers like described in that Intel Programmable and Integrated Unified Memory Architecture (PIUMA)" paper referenced by Paul Clayton are trying to solve.

    The pointer field to chase can be
    (a) at an +- offset from the current pointer virtual address
    (b) at a different offset for each iteration
    (c) conditional on some other field at some other offset

    and most important:

    (d) any new pointers are virtual address that have to start back at
    the Load Store Queue for VA translation and forwarding testing
    after applying (a),(b) and (c) above.

    This is the tidbit that prevents doing prefetches at/in the DRAM controller. The address so fetched needs translation !! And this requires dragging
    stuff over to DRC that is not normally done.

    Since each chased pointer starts back at LSQ, the cost is no different
    than an explicit Prefetch instruction, except without (a),(b) and (c)
    having been applied first.

    Latency cost is identical, instruction issue/retire costs are lower.

    So I find the simplistic, blithe data-dependent auto prefetching
    described as questionable.

    K9 built a SW model of such a prefetcher. For the first 1B cycles of a
    SPEC benchmark from ~2004 it performed quite well. {{We later figured out
    that this was initialization that built a GB data structure.}} Late on
    in the benchmark the pointers got scrambled and the performance from the prefetched fell on its face.

    Moral, you need close to 1T cycle of simulation to qualify a prefetcher.

    Now Note:: If there were an instruction that loaded the value known to be
    a pointer and prefetched based on the received pointer, then the prefetch
    is now architectural not µArchitectural and you are allowed to damage the >> cache or TLB when/after the instruction retires.

    In the PIUMA case those pointers were to sparse data sets
    so part of the problem was rolling over the cache, as well as
    (and the PIUMA paper didn't mention this) the TLB.

    After reading the PIUMA paper I had an idea for a small modification
    to the PTE cache control bits to handle sparse data. The PTE's 3 CC bits
    can specify the upper page table levels are cached in the TLB but
    lower levels are not because they would always roll over the TLB.
    However the non-TLB cached PTE's may optionally still be cached
    in L1 or L2, or not at all.

    This allows one to hold the top page table levels in the TLB,
    the upper middle levels in L1, lower middle levels in L2,
    and leaf PTE's and sparse code/data not cached at all.

    Given the 2-level TLBs currently in vogue, the fist level might
    have 32-64 PTEs, while the second might have 2048 PTEs. With this
    number of PTEs available, does you scheme still give benefit ??

    BUT, as PIUMA proposes, we also allow the memory subsystem to read and write individual aligned 8-byte values from DRAM, rather than whole cache lines,
    so we only move that actual 8 bytes values we need.

    Busses on cores are reaching the stage where an entire cache line
    is transferred in 1-cycle. With such busses, why define anything
    smaller than a cache line ?? {other than uncacheable accesses}

    Also note that page table walks are also graph structure walks
    but chasing physical addresses at some simple calculated offsets.
    These physical addresses might be cached in L1 or L2 so we can't
    just chase these pointers in the memory controller but,
    if one wants to do this, have to do so in the cache controller.

    Yes, this is why the K9 prefetcher was in the L2 where it had access
    to the L2 TLB.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Paul A. Clayton on Fri Mar 29 14:15:23 2024
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 3/28/24 3:59 PM, MitchAlsup1 wrote:
    EricP wrote:
    [snip]
    (d) any new pointers are virtual address that have to start back at
         the Load Store Queue for VA translation and forwarding testing
         after applying (a),(b) and (c) above.

    This is the tidbit that prevents doing prefetches at/in the DRAM controller. >> The address so fetched needs translation !! And this requires dragging
    stuff over to DRC that is not normally done.

    With multiple memory channels having independent memory
    controllers (a reasonable design I suspect), a memory controller
    may have to send the prefetch request to another memory controller
    anyway.

    Which is usually handled by the LLC when the address space is
    striped across multiple memory controllers.



    Busses on cores are reaching the stage where an entire cache line
    is transferred in 1-cycle. With such busses, why define anything
    smaller than a cache line ?? {other than uncacheable accesses}

    The Intel research chip was special-purpose targeting
    cache-unfriendly code. Reading 64 bytes when 99% of the time 56
    bytes would be unused is rather wasteful (and having more memory
    channels helps under high thread count).

    Given the lack of both spatial and temporal locality in that
    workload, one wonders if the data should be cached at all.


    However, even for a "general purpose" processor, "word"-granular
    atomic operations could justify not having all data transfers be
    cache line size. (Such are rare compared with cache line loads
    from memory or other caches, but a design might have narrower
    connections for coherence, interrupts, etc. that could be used for
    small data communication.)

    So long as the data transfer is cachable, the atomics can be handled
    at the LLC, rather than the memory controller.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Paul A. Clayton on Fri Mar 29 22:34:44 2024
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 3/29/24 10:15 AM, Scott Lurndal wrote:
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 3/28/24 3:59 PM, MitchAlsup1 wrote:
    [snip]

    However, even for a "general purpose" processor, "word"-granular
    atomic operations could justify not having all data transfers be
    cache line size. (Such are rare compared with cache line loads
    from memory or other caches, but a design might have narrower
    connections for coherence, interrupts, etc. that could be used for
    small data communication.)

    So long as the data transfer is cachable, the atomics can be handled
    at the LLC, rather than the memory controller.

    Yes, but if the width of the on-chip network — which is what Mitch
    was referring to in transferring a cache line in one cycle — is
    c.72 bytes (64 bytes for the data and 8 bytes for control
    information) it seems that short messages would either have to be
    grouped (increasing latency) or waste a significant fraction of
    the potential bandwidth for that transfer. Compressed cache lines
    would also not save bandwidth. These may not be significant
    considerations, but this is an answer to "why define anything
    smaller than a cache line?", i.e., seemingly reasonable
    motivations may exist.


    It's not uncommon for the bus/switch/mesh -structure- to be 512-bits wide, which indeed will support a full cache line transfer in a single transaction; it also supports high-volume DMA operations (either memory to memory or
    device to memory).

    Most of the interconnect (bus, switched or point-to-point) implementations
    have an overlaying protocol (including the cache coherency
    protocol) and are effectively message based, with agents posting requests
    that don't need a reply and expecting a reply for the rest.

    That doesn't require that every transaction over that bus to
    utilize the full width of the bus.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Mar 30 01:06:23 2024
    Scott Lurndal wrote:

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 3/29/24 10:15 AM, Scott Lurndal wrote:
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 3/28/24 3:59 PM, MitchAlsup1 wrote:
    [snip]

    However, even for a "general purpose" processor, "word"-granular
    atomic operations could justify not having all data transfers be
    cache line size. (Such are rare compared with cache line loads
    from memory or other caches, but a design might have narrower
    connections for coherence, interrupts, etc. that could be used for
    small data communication.)

    So long as the data transfer is cachable, the atomics can be handled
    at the LLC, rather than the memory controller.

    Yes, but if the width of the on-chip network — which is what Mitch
    was referring to in transferring a cache line in one cycle — is
    c.72 bytes (64 bytes for the data and 8 bytes for control
    information) it seems that short messages would either have to be
    grouped (increasing latency) or waste a significant fraction of
    the potential bandwidth for that transfer. Compressed cache lines
    would also not save bandwidth. These may not be significant
    considerations, but this is an answer to "why define anything
    smaller than a cache line?", i.e., seemingly reasonable
    motivations may exist.


    It's not uncommon for the bus/switch/mesh -structure- to be 512-bits wide, which indeed will support a full cache line transfer in a single transaction;

    It is not the transaction it is a single beat of the clock. One can have narrower bus widths and simply divide the cache line size by the bus width
    to get the number of required beats.

    it also supports high-volume DMA operations (either memory to memory or device to memory).

    Most of the interconnect (bus, switched or point-to-point) implementations have an

    or more than one

    overlaying protocol (including the cache coherency
    protocol) and are effectively message based, with agents posting requests that don't need a reply and expecting a reply for the rest.

    Many older busses read PTP and PTEs from memory sizeof( PTE ) at a time,
    some of them requesting write permission so that used and modified bits
    can be written back immediately.{{Which skirts the distinction between cacheable and uncacheable in several ways.}}

    That doesn't require that every transaction over that bus to
    utilize the full width of the bus.

    In my wide bus situation, the line width is used to gang up multiple
    responses (from different end-points) into a single beat==message.
    For example the chip-to-chip transport can carry multiple independent
    SNOOP responses in a single beat (saving cycles and lowering latency).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Apr 3 14:10:41 2024
    Since each chased pointer starts back at LSQ, the cost is no different
    than an explicit Prefetch instruction, except without (a),(b) and (c)
    having been applied first.

    I thought the important difference is that the decision to prefetch or
    not can be done dynamically based on past history.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stefan Monnier on Wed Apr 3 20:05:02 2024
    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    Since each chased pointer starts back at LSQ, the cost is no different
    than an explicit Prefetch instruction, except without (a),(b) and (c)
    having been applied first.

    I thought the important difference is that the decision to prefetch or
    not can be done dynamically based on past history.

    Programmers and compilers are notoriously bad at predicting
    branches (except for error branches), but ought to be quite good
    at predicting prefetches. If a pointer is loaded, chances are
    very high that are it will be dereferenced.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Wed Apr 3 21:37:13 2024
    Thomas Koenig wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    Since each chased pointer starts back at LSQ, the cost is no different
    than an explicit Prefetch instruction, except without (a),(b) and (c)
    having been applied first.

    I thought the important difference is that the decision to prefetch or
    not can be done dynamically based on past history.

    Programmers and compilers are notoriously bad at predicting
    branches (except for error branches),

    Which are always predicted to have no error.

    but ought to be quite good
    at predicting prefetches.

    What makes you think programmers understand prefetches any better than exceptions ??

    If a pointer is loaded, chances are
    very high that are it will be dereferenced.

    What if the value loaded is NULL.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Thu Apr 4 17:51:19 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    Since each chased pointer starts back at LSQ, the cost is no different >>>> than an explicit Prefetch instruction, except without (a),(b) and (c)
    having been applied first.

    I thought the important difference is that the decision to prefetch or
    not can be done dynamically based on past history.

    Programmers and compilers are notoriously bad at predicting
    branches (except for error branches),

    Which are always predicted to have no error.

    On the second or third time, certainly. Hmmm... given hot/cold
    splitting which is fairly standard by now, do branch predictors
    take this into account?


    but ought to be quite good
    at predicting prefetches.

    What makes you think programmers understand prefetches any better than exceptions ??

    Pointers are used in many common data structures; linked list,
    trees, ... A programmer who does not know about dereferencing
    pointers should be kept away from computer keyboards, preferably
    at a distance of at least 3 m.


    If a pointer is loaded, chances are
    very high that are it will be dereferenced.

    What if the value loaded is NULL.

    Then it should be trivially predicted that it should not be prefetched.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Thu Apr 4 20:09:39 2024
    Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    Since each chased pointer starts back at LSQ, the cost is no different >>>>> than an explicit Prefetch instruction, except without (a),(b) and (c) >>>>> having been applied first.

    I thought the important difference is that the decision to prefetch or >>>> not can be done dynamically based on past history.

    Programmers and compilers are notoriously bad at predicting
    branches (except for error branches),

    Which are always predicted to have no error.

    There I mean that the programmer wrote the code::

    if( no error so far )
    {
    then continue
    }
    else
    {
    deal with the error
    }

    Many times, the "deal with the error" code is never even fetched.

    On the second or third time, certainly. Hmmm... given hot/cold
    splitting which is fairly standard by now, do branch predictors
    take this into account?

    First we are talking about predicting branches at compile time and
    the way the programmer writes the source code, not about the dynamic predictions of HW.

    Given that it is compile time, one either predicts it is taken
    (loops) or not taken (errors and initialization) and arrange
    the code such that fall through is the predicted pattern (except
    for loops).

    Then at run time, all these branches are predicted with the standard
    predictors present in the core.
    Initialization stuff is mispredicted once or twice
    error code is only mispredicted when an error occurs
    loops are mispredicted once or twice per entrance.

    Also note:: With an ISA like My 66000, one can preform branching using predication and neither predict the branch nor modify where FETCH is
    fetching. Ideally, predication should deal with hard to predict branches
    and all flow control where the then and else clauses are short. When
    these are removed from the predictor, prediction should improve--maybe
    not in the number of predictions that are correct, but in the total time
    wasted on branching (including both repair and misfetching overheads).

    but ought to be quite good
    at predicting prefetches.

    What makes you think programmers understand prefetches any better than
    exceptions ??

    Pointers are used in many common data structures; linked list,
    trees, ... A programmer who does not know about dereferencing
    pointers should be kept away from computer keyboards, preferably
    at a distance of at least 3 m.

    3Km ??


    If a pointer is loaded, chances are
    very high that are it will be dereferenced.

    What if the value loaded is NULL.

    Then it should be trivially predicted that it should not be prefetched.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Apr 4 20:57:45 2024
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    Since each chased pointer starts back at LSQ, the cost is no different >>>>>>> than an explicit Prefetch instruction, except without (a),(b) and (c) >>>>>>> having been applied first.

    I thought the important difference is that the decision to prefetch or >>>>>> not can be done dynamically based on past history.

    Programmers and compilers are notoriously bad at predicting
    branches (except for error branches),

    Which are always predicted to have no error.

    There I mean that the programmer wrote the code::

    if( no error so far )
    {
    then continue
    }
    else
    {
    deal with the error
    }

    Many times, the "deal with the error" code is never even fetched.

    On the second or third time, certainly. Hmmm... given hot/cold
    splitting which is fairly standard by now, do branch predictors
    take this into account?

    First we are talking about predicting branches at compile time and
    the way the programmer writes the source code, not about the dynamic >>predictions of HW.

    gcc provides a way to "annotate" a condition with the expected
    common result:

    #define likely(x) __builtin_expect(!!(x), 1)
    #define unlikely(x) __builtin_expect(!!(x), 0)


    if (likely(bus_enable.s.enabled)) {
    do something
    } else {
    do something else
    }

    This will affect the layout of the code (e.g. deferring generation
    of the else clause with the result that it ends up in a different
    cache line or page).

    It's used in the linux kernel, and in certain cpu bound applications.


    Thank you for pointing this out.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Apr 4 20:35:12 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    Since each chased pointer starts back at LSQ, the cost is no different >>>>>> than an explicit Prefetch instruction, except without (a),(b) and (c) >>>>>> having been applied first.

    I thought the important difference is that the decision to prefetch or >>>>> not can be done dynamically based on past history.

    Programmers and compilers are notoriously bad at predicting
    branches (except for error branches),

    Which are always predicted to have no error.

    There I mean that the programmer wrote the code::

    if( no error so far )
    {
    then continue
    }
    else
    {
    deal with the error
    }

    Many times, the "deal with the error" code is never even fetched.

    On the second or third time, certainly. Hmmm... given hot/cold
    splitting which is fairly standard by now, do branch predictors
    take this into account?

    First we are talking about predicting branches at compile time and
    the way the programmer writes the source code, not about the dynamic >predictions of HW.

    gcc provides a way to "annotate" a condition with the expected
    common result:

    #define likely(x) __builtin_expect(!!(x), 1)
    #define unlikely(x) __builtin_expect(!!(x), 0)


    if (likely(bus_enable.s.enabled)) {
    do something
    } else {
    do something else
    }

    This will affect the layout of the code (e.g. deferring generation
    of the else clause with the result that it ends up in a different
    cache line or page).

    It's used in the linux kernel, and in certain cpu bound applications.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Fri Apr 5 12:54:50 2024
    Since each chased pointer starts back at LSQ, the cost is no different
    than an explicit Prefetch instruction, except without (a),(b) and (c)
    having been applied first.
    I thought the important difference is that the decision to prefetch or
    not can be done dynamically based on past history.
    Programmers and compilers are notoriously bad at predicting
    branches (except for error branches), but ought to be quite good
    at predicting prefetches. If a pointer is loaded, chances are
    very high that are it will be dereferenced.

    I don't think it's that simple: prefetches only bring the data into L1
    cache, so they're only useful if:

    - The data is not already in L1.
    - The data will be used soon (i.e. before it gets thrown away from the cache). - The corresponding load doesn't occur right away.

    In all other cases, the prefetch will be just wasted work.

    It's easy for programmers to "predict" those (dependent) loads which will occur right away, but those don't really benefit from a prefetch.
    E.g. if the dependent load is done 2 cycles later, performing a prefetch
    lets you start the memory access 2 cycles early, but since that access
    is not in L1 it'll take more than 10 cycles, so shaving
    2 cycles off isn't of great benefit.

    Given that we're talking about performing a prefetch on the result of
    a previous load, and loads tend to already have a fairly high latency
    (3-5 cycles), "2 cycles later" really means "5-7 cycles after the
    beginning of the load of that pointer". That can easily translate to 20 instructions later.

    My gut feeling is that it's difficult for programmers to predict what
    will happen more than 20 instructions further without looking at
    detailed profiling.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Fri Apr 5 21:28:59 2024
    Stefan Monnier wrote:

    Since each chased pointer starts back at LSQ, the cost is no different >>>> than an explicit Prefetch instruction, except without (a),(b) and (c)
    having been applied first.
    I thought the important difference is that the decision to prefetch or
    not can be done dynamically based on past history.
    Programmers and compilers are notoriously bad at predicting
    branches (except for error branches), but ought to be quite good
    at predicting prefetches. If a pointer is loaded, chances are
    very high that are it will be dereferenced.

    I don't think it's that simple: prefetches only bring the data into L1
    cache, so they're only useful if:

    - The data is not already in L1.
    - The data will be used soon (i.e. before it gets thrown away from the cache).
    - The corresponding load doesn't occur right away.

    In all other cases, the prefetch will be just wasted work.

    It's easy for programmers to "predict" those (dependent) loads which will occur
    right away, but those don't really benefit from a prefetch.
    E.g. if the dependent load is done 2 cycles later, performing a prefetch
    lets you start the memory access 2 cycles early, but since that access
    is not in L1 it'll take more than 10 cycles, so shaving
    2 cycles off isn't of great benefit.

    Given that we're talking about performing a prefetch on the result of
    a previous load, and loads tend to already have a fairly high latency
    (3-5 cycles), "2 cycles later" really means "5-7 cycles after the
    beginning of the load of that pointer". That can easily translate to 20 instructions later.

    My gut feeling is that it's difficult for programmers to predict what
    will happen more than 20 instructions further without looking at
    detailed profiling.

    Difficult becomes impossible when the code has to operate "well" over
    a range of implementations. {{With some suitable definition of "well"}}

    Consider deciding how far into the future (counting instructions) a
    prefetch has to be placed so that the data arrives before use. Then
    consider the smallest execution window is 16 instructions and the
    largest execution window is 300 instructions; and you want the same
    code to be semi-optimal on both.

    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Sun Apr 7 23:50:25 2024
    Paul A. Clayton wrote:

    On 4/4/24 4:09 PM, MitchAlsup1 wrote:> Thomas Koenig wrote:
    [snip]

    Also note:: With an ISA like My 66000, one can preform
    branching using
    predication and neither predict the branch nor modify where
    FETCH is
    fetching. Ideally, predication should deal with hard to predict
    branches
    and all flow control where the then and else clauses are short.
    When
    these are removed from the predictor, prediction should
    improve--maybe
    not in the number of predictions that are correct, but in the
    total time
    wasted on branching (including both repair and misfetching
    overheads).

    Rarely-executed blocks should presumably use branches even when
    short to remove the rarely-executed code from the normal
    instruction stream. I would guess that exceptional actions are
    typically longer/more complicated.

    If you will arrive at the join point by simply fetching there
    is no reason to use a branch.

    (Consistent timing would also be important for some real-time
    tasks and for avoiding timing side channels.)

    Predication is closer to fixed timing than any branching.

    The best performing choice would also seem to be potentially microarchitecture-dependent. Obviously the accuracy of branch
    prediction and the cost of aliasing would matter (and
    perversely mispredicting a branch can _potentially_ improve
    performance, though not on My 66000, I think, because more
    persistent microarchitectural state is not updated until
    instruction commitment).

    While not committed, it is still available.

    If the predicate value is delayed and predicated operations
    wait in the scheduler for this operand and the operands of one
    path are available before the predicate value, branch prediction
    might allow deeper speculation.

    Just like data forwarding, branch prediction forwarding is
    easily achieved in the pipeline.

    < For highly dynamically
    predictable but short branches, deeper speculation might help
    more than fetch irregularity hurts.

    What can possibly be more regular than constant time ??

    (The predicate could be
    predicted — and this prediction is needed later in the pipeline —

    One HAS TO assume that GBOoO machines will predict predicates.

    but not distinguishing between prediction-allowed predication
    and normal predication might prevent prediction from being
    implemented to avoid data-dependent timing of predication.)

    (The cost of speculation can also be variable. With underutilized
    resources (thermal, memory bandwidth, etc.) speculation would
    generally be less expensive than with high demand on resources.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Tue Jun 4 22:57:59 2024
    I am resurrecting this thread to talk about a different cache that
    may or may not be vulnerable to Spectré like attacks.

    Consider an attack strategy that measures whether a disk sector/block
    is in (or not in) the OS disk cache. {Very similar to attacks that
    figure out if a cache line is in the Data Cache (or not).}

    Any ideas ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Jun 4 23:55:53 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    I am resurrecting this thread to talk about a different cache that
    may or may not be vulnerable to Spectré like attacks.

    Consider an attack strategy that measures whether a disk sector/block
    is in (or not in) the OS disk cache. {Very similar to attacks that
    figure out if a cache line is in the Data Cache (or not).}

    Don't forget to factor in a hit or miss on the on-drive controller cache.

    Linux uses free memory as an ad-hoc file cache (caching file blocks
    from all disks); I suspect that location in the file cache
    for any particular 4k group of sectors (or 4k sector on more modern
    disks) will not be predictable for any given workload.

    Certain accesses (O_DIRECT, Unix raw devices) bypass the operating
    system cache (and for raw devices the target of the DMA is the userspace buffer).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Wed Jun 5 08:38:59 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    I am resurrecting this thread to talk about a different cache that
    may or may not be vulnerable to Spectré like attacks.

    Consider an attack strategy that measures whether a disk sector/block
    is in (or not in) the OS disk cache. {Very similar to attacks that
    figure out if a cache line is in the Data Cache (or not).}

    Any ideas ??

    If you want to claim that there is a vulnerability, it's your job to demonstrate it.

    That being said, filling the OS disk cache requires I/O commands,
    typically effected through stores that make it to the I/O device. In
    cases where the I/O is effected through a load, the I/O device is in
    an uncacheable part of the address space (otherwise even
    non-speculative accesses would not work correctly, and speculative
    accesses would have caused havoc long ago), and the load is delayed
    until it is no longer speculative.

    Ok, you might say, what about just the preparation of the buffer in
    ordinary write-back-cached memory? Yes, that can happen
    speculatively, so speculatively executed code might somehow see a
    buffer that should later be used for some I/O. But where is the
    security vulnerability in this scenario? Once the speculation turns
    out to be wrong, all the changes in the store buffer will be canceled.
    The only trace that remains will (on Spectre-vulnerable hardware) be
    in the caches and other microarchitectural features, just as with
    existing Spectre attacks. There is nothing special about disk buffers
    here.

    And on Spectre-invulnerable hardware (which still does not exist 7
    years after the vulnerability has been reported to Intel and AMD:-(,
    and the recent announcements of Zen5, Lunar Lake and the new ARM cores
    are not promising) no trace of the speculation will be left in microarchitectural state when it turns out to be wrong.

    BTW, it's Spectre without accent, see <https://spectreattack.com/>

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Wed Jun 5 11:43:18 2024
    MitchAlsup1 wrote:
    I am resurrecting this thread to talk about a different cache that may
    or may not be vulnerable to Spectré like attacks.

    Consider an attack strategy that measures whether a disk sector/block
    is in (or not in) the OS disk cache. {Very similar to attacks that
    figure out if a cache line is in the Data Cache (or not).}

    Any ideas ??

    It won't be vulnerable to a direct speculation attack because
    the cpu does not trigger page faults on mispredicted paths.
    So you can't use the presence in a file cache to probe code paths
    or data values to leak secrets.

    Also the 4kB resolution would be problematic to correlate back to
    particular branches taken and infer secret values.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Mon Jun 10 22:06:01 2024
    Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    I am resurrecting this thread to talk about a different cache that
    may or may not be vulnerable to Spectré like attacks.

    Consider an attack strategy that measures whether a disk sector/block
    is in (or not in) the OS disk cache. {Very similar to attacks that
    figure out if a cache line is in the Data Cache (or not).}

    Any ideas ??

    If you want to claim that there is a vulnerability, it's your job to demonstrate it.

    That being said, filling the OS disk cache requires I/O commands,
    typically effected through stores that make it to the I/O device. In
    cases where the I/O is effected through a load, the I/O device is in
    an uncacheable part of the address space (otherwise even
    non-speculative accesses would not work correctly, and speculative
    accesses would have caused havoc long ago), and the load is delayed
    until it is no longer speculative.

    It seems to me that there are 4 service intervals:
    a) in application buffering--this suffers no supervisor call latency
    b) hit in the disk cache--this suffers no I/O latency
    c) hit in the SATA drive cache--this suffers only I/O latency
    d) Drive positions head and waits for the sector to arrive

    I believe that the service intervals are fairly disjoint so there can
    be identified with a high precision timer. My guesses::
    a) 50ns
    b) 300ns
    c) 3µs
    d) 10ms

    {{I remember that just pushing I/O commands out PCIe takes µs amounts
    of time, while head and rotational delays of disks are ms.}}

    Ok, you might say, what about just the preparation of the buffer in
    ordinary write-back-cached memory? Yes, that can happen
    speculatively, so speculatively executed code might somehow see a
    buffer that should later be used for some I/O. But where is the
    security vulnerability in this scenario? Once the speculation turns
    out to be wrong, all the changes in the store buffer will be canceled.

    Spectré works by filling its data cache with data it controls, and then measuring which cache lines got displaced by other activities; and then inferring the data in those lines.

    What I am proposing here is that the attacker fills the disk cache with
    his file data, then measuring what parts of the disk cache got
    overwritten.
    That mush is straightforward. The problem is in trying to infer the bit

    patterns of that overwriting data.

    Spectré makes 2 back to back accesses and sensitizes the various caches
    and predictors by doing completely legal code and then using the came
    code to access something that should not be accessible--that is the microarchitectural opening in the security. With LD latency increasing
    from 2 cycles (MIPS) to 5 cycles (latest Intel) the desire to process
    back to back dependent loads, means the second LD starts when the data
    is accessed, but all of the checks that (PPN = PTE.PPN & access checks)
    have not been completed. This second address is presented to the cache
    for an access and will be cancelled later, but if it takes a miss, the
    miss buffer will service the miss and install (alter data) in the cache displacing data that should have remained in the cache. By determining
    Which cache lines got displaces infers the data accessed that should
    not have been.

    So the question is how to setup data in the disk cache such that that
    kind of inference can be made.


    The only trace that remains will (on Spectre-vulnerable hardware) be
    in the caches and other microarchitectural features, just as with
    existing Spectre attacks. There is nothing special about disk buffers
    here.

    And on Spectre-invulnerable hardware (which still does not exist 7
    years after the vulnerability has been reported to Intel and AMD:-(,
    and the recent announcements of Zen5, Lunar Lake and the new ARM cores
    are not promising) no trace of the speculation will be left in microarchitectural state when it turns out to be wrong.

    BTW, it's Spectre without accent, see <https://spectreattack.com/>

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Mon Jun 10 22:09:26 2024
    EricP wrote:

    MitchAlsup1 wrote:
    I am resurrecting this thread to talk about a different cache that may
    or may not be vulnerable to Spectré like attacks.

    Consider an attack strategy that measures whether a disk sector/block
    is in (or not in) the OS disk cache. {Very similar to attacks that
    figure out if a cache line is in the Data Cache (or not).}

    Any ideas ??

    It won't be vulnerable to a direct speculation attack because
    the cpu does not trigger page faults on mispredicted paths.

    Effectively, the CPU puts the PAGEFAULT into the execution pipeline
    and only takes the exception if it reaches the retire point without
    getting flushed by a mispredict repair.

    So you can't use the presence in a file cache to probe code paths
    or data values to leak secrets.

    Also the 4kB resolution would be problematic to correlate back to
    particular branches taken and infer secret values.

    That just slows down the rate of (BW) of the inference, and does
    nothing
    about closing any existing hole.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Mon Jun 10 22:36:27 2024
    MitchAlsup1 wrote:

    Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    I am resurrecting this thread to talk about a different cache
    that may or may not be vulnerable to Spectré like attacks.

    Consider an attack strategy that measures whether a disk
    sector/block is in (or not in) the OS disk cache. {Very similar
    to attacks that figure out if a cache line is in the Data Cache
    (or not).}

    Any ideas ??

    If you want to claim that there is a vulnerability, it's your job to demonstrate it.

    That being said, filling the OS disk cache requires I/O commands,
    typically effected through stores that make it to the I/O device.
    In cases where the I/O is effected through a load, the I/O device
    is in an uncacheable part of the address space (otherwise even non-speculative accesses would not work correctly, and speculative
    accesses would have caused havoc long ago), and the load is delayed
    until it is no longer speculative.

    It seems to me that there are 4 service intervals:
    a) in application buffering--this suffers no supervisor call latency
    b) hit in the disk cache--this suffers no I/O latency
    c) hit in the SATA drive cache--this suffers only I/O latency
    d) Drive positions head and waits for the sector to arrive


    That's all true except if the "disk" is actually an SSD.



    I believe that the service intervals are fairly disjoint so there can
    be identified with a high precision timer. My guesses::
    a) 50ns
    b) 300ns
    c) 3µs
    d) 10ms

    An SSD reduces d significantly. Probably still disjoint, but not
    nearly as much.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stephen Fuld on Tue Jun 11 00:04:44 2024
    "Stephen Fuld" <SFuld@alumni.cmu.edu.invalid> writes:
    MitchAlsup1 wrote:


    I believe that the service intervals are fairly disjoint so there can
    be identified with a high precision timer. My guesses::
    a) 50ns
    b) 300ns
    c) 3µs
    d) 10ms

    An SSD reduces d significantly. Probably still disjoint, but not
    nearly as much.

    You can get modern dimms that include flash-on dimm for persistent
    (a la magnetic core) memory.

    https://lenovopress.lenovo.com/tips1141-exflash-dimms

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Jun 11 00:02:36 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    I am resurrecting this thread to talk about a different cache that
    may or may not be vulnerable to Spectré like attacks.

    Consider an attack strategy that measures whether a disk sector/block
    is in (or not in) the OS disk cache. {Very similar to attacks that
    figure out if a cache line is in the Data Cache (or not).}

    Any ideas ??

    If you want to claim that there is a vulnerability, it's your job to
    demonstrate it.

    That being said, filling the OS disk cache requires I/O commands,
    typically effected through stores that make it to the I/O device. In
    cases where the I/O is effected through a load, the I/O device is in

    It is good, then, that no modern disk I/O is 'effected through
    a load'. It's DMA all the way down, and the inbound DMA transfer
    is scheduled by the SATA/NVme/SAS controller. The timing of the
    start of the DMA will vary based on the host controller, the drive cache
    status and drive head scheduler, etc.

    IDE is dead, Jim.

    an uncacheable part of the address space (otherwise even
    non-speculative accesses would not work correctly, and speculative
    accesses would have caused havoc long ago), and the load is delayed
    until it is no longer speculative.

    It seems to me that there are 4 service intervals:
    a) in application buffering--this suffers no supervisor call latency
    b) hit in the disk cache--this suffers no I/O latency
    c) hit in the SATA drive cache--this suffers only I/O latency
    d) Drive positions head and waits for the sector to arrive

    I believe that the service intervals are fairly disjoint so there can
    be identified with a high precision timer. My guesses::
    a) 50ns
    b) 300ns
    c) 3µs
    d) 10ms

    {{I remember that just pushing I/O commands out PCIe takes µs amounts
    of time, while head and rotational delays of disks are ms.}}

    Modern disk drives have significant dram cache on-drive, so you
    cannot rely on seek time and rotational latency. Although if one
    knows the drive cache algorithm, it may be possible to game the
    transfers such that one can guarantee a miss (and disable any
    on-drive speculative reads into the drive cache).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Jun 11 02:10:26 2024
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    I am resurrecting this thread to talk about a different cache that
    may or may not be vulnerable to Spectré like attacks.

    Consider an attack strategy that measures whether a disk sector/block >>>>is in (or not in) the OS disk cache. {Very similar to attacks that >>>>figure out if a cache line is in the Data Cache (or not).}

    Any ideas ??

    If you want to claim that there is a vulnerability, it's your job to
    demonstrate it.

    That being said, filling the OS disk cache requires I/O commands,
    typically effected through stores that make it to the I/O device. In
    cases where the I/O is effected through a load, the I/O device is in

    It is good, then, that no modern disk I/O is 'effected through
    a load'. It's DMA all the way down, and the inbound DMA transfer
    is scheduled by the SATA/NVme/SAS controller. The timing of the
    start of the DMA will vary based on the host controller, the drive
    cache
    status and drive head scheduler, etc.

    I am operating under that model. It still takes OS significant time*
    to go from read(file, buffer, length) to the point SR-IOV receives
    the command(s) at the drive itself. And then there is the interrupt
    routing delay, and ISR + softIRQ delays.

    (*) remember we are using a timer that increments at core frequency
    essentially counting instructions.

    IDE is dead, Jim.

    And for a long time.

    an uncacheable part of the address space (otherwise even
    non-speculative accesses would not work correctly, and speculative
    accesses would have caused havoc long ago), and the load is delayed
    until it is no longer speculative.

    It seems to me that there are 4 service intervals:
    a) in application buffering--this suffers no supervisor call latency
    b) hit in the disk cache--this suffers no I/O latency
    c) hit in the SATA drive cache--this suffers only I/O latency
    d) Drive positions head and waits for the sector to arrive

    I believe that the service intervals are fairly disjoint so there can
    be identified with a high precision timer. My guesses::
    a) 50ns
    b) 300ns
    c) 3µs
    d) 10ms

    {{I remember that just pushing I/O commands out PCIe takes µs amounts
    of time, while head and rotational delays of disks are ms.}}

    Modern disk drives have significant dram cache on-drive, so you
    cannot rely on seek time and rotational latency.

    Yes, that is the 3µs number. The OS has to send the I/O to the
    device, and the device has to send its response back, but there
    is no head positioning latency or rotational delay.

    Although if one
    knows the drive cache algorithm, it may be possible to game the
    transfers such that one can guarantee a miss (and disable any
    on-drive speculative reads into the drive cache).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Tue Jun 11 02:11:42 2024
    Stephen Fuld wrote:

    MitchAlsup1 wrote:

    Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    I am resurrecting this thread to talk about a different cache
    that may or may not be vulnerable to Spectré like attacks.

    Consider an attack strategy that measures whether a disk
    sector/block is in (or not in) the OS disk cache. {Very similar
    to attacks that figure out if a cache line is in the Data Cache
    (or not).}

    Any ideas ??

    If you want to claim that there is a vulnerability, it's your job to
    demonstrate it.

    That being said, filling the OS disk cache requires I/O commands,
    typically effected through stores that make it to the I/O device.
    In cases where the I/O is effected through a load, the I/O device
    is in an uncacheable part of the address space (otherwise even
    non-speculative accesses would not work correctly, and speculative
    accesses would have caused havoc long ago), and the load is delayed
    until it is no longer speculative.

    It seems to me that there are 4 service intervals:
    a) in application buffering--this suffers no supervisor call latency
    b) hit in the disk cache--this suffers no I/O latency
    c) hit in the SATA drive cache--this suffers only I/O latency
    d) Drive positions head and waits for the sector to arrive


    That's all true except if the "disk" is actually an SSD.



    I believe that the service intervals are fairly disjoint so there can
    be identified with a high precision timer. My guesses::
    a) 50ns
    b) 300ns
    c) 3µs
    d) 10ms

    An SSD reduces d significantly. Probably still disjoint, but not
    nearly as much.

    An SSD should reduce d to c.

    Do SSDs have their own cache ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Tue Jun 11 04:30:39 2024
    MitchAlsup1 wrote:

    Stephen Fuld wrote:

    MitchAlsup1 wrote:

    Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    I am resurrecting this thread to talk about a different cache
    that may or may not be vulnerable to Spectré like attacks.
    Consider an attack strategy that measures whether a disk
    sector/block is in (or not in) the OS disk cache. {Very similar
    to attacks that figure out if a cache line is in the Data Cache
    (or not).}
    Any ideas ??

    If you want to claim that there is a vulnerability, it's your job
    to >>> demonstrate it.

    That being said, filling the OS disk cache requires I/O commands,
    typically effected through stores that make it to the I/O device.
    In cases where the I/O is effected through a load, the I/O device
    is in an uncacheable part of the address space (otherwise even
    non-speculative accesses would not work correctly, and speculative
    accesses would have caused havoc long ago), and the load is
    delayed >>> until it is no longer speculative.

    It seems to me that there are 4 service intervals:
    a) in application buffering--this suffers no supervisor call
    latency b) hit in the disk cache--this suffers no I/O latency
    c) hit in the SATA drive cache--this suffers only I/O latency
    d) Drive positions head and waits for the sector to arrive


    That's all true except if the "disk" is actually an SSD.



    I believe that the service intervals are fairly disjoint so there
    can be identified with a high precision timer. My guesses::
    a) 50ns
    b) 300ns
    c) 3µs
    d) 10ms

    An SSD reduces d significantly. Probably still disjoint, but not
    nearly as much.

    An SSD should reduce d to c.


    Not quite. There is still a controller withuin the SSD that executes
    some code to perform its functionality (e.g. mapping, wear leveling,
    bad block bypassing, etdc). So d will be greater than c, but not by as
    much as it would be if it were a spinning disk,




    Do SSDs have their own cache ??


    Usually, yes.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Stephen Fuld on Tue Jun 11 11:39:50 2024
    Stephen Fuld wrote:
    MitchAlsup1 wrote:

    Do SSDs have their own cache ??

    Usually, yes.

    It has to have a substantial amount of RAM in order to coalesce writes,
    after applying all those remapping/wear leveling layers. This write-back
    cache has a limited amount of dirty buffers, preferably low enough that
    they can all be flushed to persistent storage in case of power loss.

    Every single thumb drive produced (so not just SSDs) contain a little
    32-bit CPU which does all that behind the curtain processing.

    The main task however is that when first turned on, the CPU will run a substantial amount of burn-in testing, and then decide how many flash
    pages are actually usable. This way (according to "Bunny" Chang) every
    single flash chip manufactured is sold, be it at full or some much
    reduced capacity.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to while the on Tue Jun 11 16:20:59 2024
    It has to have a substantial amount of RAM in order to coalesce writes,
    after applying all those remapping/wear leveling layers. This write-back cache has a limited amount of dirty buffers, preferably low enough that they can all be flushed to persistent storage in case of power loss.

    IIUC some don't have much RAM to speak of (they probably have some
    on-CPU cache-size RAM, of course) and lend some DRAM from the host system instead (accessed over PCI). The technique is called HMB (Host Memory
    Buffer), and it's apparently quite popular.
    According to https://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0229645
    the HMB is used mostly for the remapping table, while the write buffer
    used for coalescing is presumably some STL flash.

    The main task however is that when first turned on, the CPU will run
    a substantial amount of burn-in testing, and then decide how many flash
    pages are actually usable.

    IIUC, another time-consuming task at startup can be to find the location
    of the logical->physical mapping table (since it can't be written at
    a fixed place in the flash, for wear-leveling reasons).


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)