• Memory dependency microbenchmark

    From Anton Ertl@21:1/5 to All on Fri Nov 3 09:15:58 2023
    I have written a microbenchmark for measuring how memory dependencies
    affect the performance of various microarchitectures. You can find it
    along with a description and results on <http://www.complang.tuwien.ac.at/anton/memdep/>.

    You can find the text portion of this page below; there are a few
    links on the page that are not produced below.

    If you have information on any of the things I don't know (or don't
    remember), please let me know about it.

    - anton

    Memory Dependency Microbenchmark

    This microbenchmark tests some aspects of how hardware deals with
    dependences through memory. It performs the following loop:

    for (i=0; i<250000000; i++) {
    *b = *a+1;
    *d = *c+1;
    *f = *e+1;
    *h = *g+1;
    }

    The eight parameters of the binary allow you to determine to which
    memory location the pointers a b c d e f g h refer, in particular, if
    some of them refer to the same memory location or not. Each parameter corresponds to one pointer, and two pointers refer to the same memory
    location iff the corresponding parameters are the same.

    This microbenchmark uses pointers that have been in their registers
    for a long time, so it does not test speculative alias detection by
    the hardware.

    I have measured using the following parameter combinations (in the
    order a b c d e f g h):

    * A 0 1 2 3 4 5 6 7: All computations (statements in the loop body)
    are completely independent. Ideally they can be performed as fast
    as the resources allow.
    * B 0 1 1 2 2 3 3 4: A sequence of 4 dependent computations, but
    the next iteration does not depend on the results of the previous
    one, so again, the work can be performed as fast as the resources
    allow. However, in order to achieve this performance, the loads
    of the next iteration have to be started while several
    architecturally earlier stores still have to be executed, and,
    comparing the results of B to those of A, all measured cores have
    more difficulty with that, resulting in slower performance.
    * C 0 0 2 2 4 4 6 6: 1-computation recurrences (i.e., the
    computation in the next iteration depends on a computation in the
    current iteration), four of those. So at most 4 of these
    computations (plus the loop overhead) can be performed in
    parallel.
    * D 0 1 1 0 2 3 3 2: 2-computation recurrences (i.e., two dependent
    computations in an iteration, and the first of those in the
    current iteration depends on the second one in the previous
    iteration), 2 of those. So at most two of these computations
    (plus the loop overhead) can be performed in parallel.
    * E 0 1 2 3 1 0 3 2: The same data flow as D, but the computations
    are arranged differently: Here we first have two independent
    computations, and then two computations that depend on the
    earlier computations.
    * F 0 1 1 2 2 0 3 3: A 3-computation recurrence and a 1-computation
    recurrence. In the best case you see the latency of three
    computations per iteration.
    * G 0 1 1 2 3 3 2 0: The same data flow as F, but the computations
    are arranged differently.
    * H 0 0 0 0 0 0 0 0: One 4-computation recurrence. These
    computations can only be performed sequentialy, only the loop
    overhead can be performed in parallel to them.

    The results for different CPU cores are shown in the following. The
    numbers are cycles per computation (statement in the loop body). To
    get the cycles per iteration, multiply by 4.

    A B C D E F G H microarchitecture CPU
    1.17 2.11 1.46 3.28 3.67 4.50 4.50 6.93 K8 Athlon 64 X2 4600+
    1.00 1.26 2.00 4.00 4.00 6.01 6.01 8.00 Zen Ryzen 5 1600X
    1.00 1.19 2.00 4.00 4.00 6.00 6.01 8.00 Zen2 Ryzen 9 3900X
    0.75 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Zen3 Ryzen 7 5800X
    1.00 1.52 1.99 3.70 3.66 4.65 4.62 7.42 Sandy Bridge Xeon E3-1220
    1.00 1.26 1.61 3.06 3.00 4.50 4.50 6.85 Haswell Core i7-4790K
    1.00 1.12 1.43 2.90 2.84 4.03 4.05 5.53 Skylake Core i5-6600K
    0.78 0.92 1.01 1.81 1.13 2.00 2.00 2.25 Rocket Lake Xeon W-1370P
    0.81 0.92 0.83 0.86 0.81 0.82 0.81 1.01 Tiger Lake Core i5-1135G7
    3.25 4.00 3.25 3.75 3.25 3.75 3.50 4.00 Cortex-A53 Amlogic S905
    3.25 4.00 3.25 3.75 3.25 3.75 3.50 4.00 Cortex-A55 RK3588
    1.25 2.26 2.06 3.95 4.00 6.18 6.18 7.67 Cortex-A72 RK3399
    1.50 1.77 1.63 3.00 3.03 4.10 4.07 5.51 Cortex-A76 RK3588
    1.03 1.66 1.50 3.00 3.00 4.50 4.50 7.84 IceStorm Apple M1 (variations from 6.0-8.75 for H)
    3.01 4.52 3.01 4.01 3.01 4.01 4.01 5.02 U74 JH7100 (RISC-V)

    There can be different sophistication levels of CPU cores, both in
    terms of dealing with aliases, and in terms of forwarding from stores
    to loads. And of course there is the difference between in-order
    execution and out-of-order execution.

    Aliases

    The simplest approach would be to assume that all memory accesses may
    refer to the same memory location. In that case we would expect that
    the core performs all parameter combinations as badly as H. None of
    the measured cores exhibits this behaviour, so obviously they are
    more sophisticated than that.

    The next approach is to allow to perform architecturally later loads
    at the same time, or, with OoO, earlier than stores to a different
    address. All tested cores seem to do this.

    Finally, there is the issue of what to do when the load is to the
    same address as an architecturally preceding store. This brings us to
    the next section:

    Store to load forwarding

    The simplest approach is to wait until the data arrives in the cache
    and then load it from there. I dimly rememeber seeing 15 cycles per
    iteration for a loop that incremented a memory location in every
    iteration, but none of the cores measured above take that long
    (although, for Zen and Zen2 one might wonder).

    The next approach is to let the load read from the store buffer if
    the data is there (and first wait until it is there). In this case
    the whole sequence of store-load has a total latency that is not much
    higher than the load latency. It seems that most cores measured here
    do that. We see this best in the H results; e.g., a Skylake has 4
    cycles of load latency and 1 cycle of add latency, and we see 5.53
    cycles of latency for store-load-add, meaning that the store
    contributes an average latency of 0.53 cycles. There are some
    complications in case the load only partially overlaps the store (I
    should put an appropriate chipsncheese link here).

    Finally, the core might detect that the data that is loaded is coming
    from a store that has not been retired yet, so the physical register
    used by the store can be directly renamed into the register of the
    load result, and the load does not need to access the store buffer or
    cache at all (what is the name of this feature?). As a result, in the
    ideal case we see only the 1-cycle latency of the add in case H. In
    the measured cores, Zen3 and Tiger Lake exhibit this behaviour fully;
    Rocket Lake probably also does that, but either does not succeed in
    all cases (the different results between D and E speak for this
    theory), or there is an additional source of latency.

    I expect that Firestorm (the performance core of the Apple M1) also
    has this feature, but unfortunately the cycles performance counter
    does not work for Firestorm on Linux 6.4.0-asahi-00571-gda70cd78bc50

    In-order vs. out-of-order execution

    With in-order execution (on Cortex A53 and A55, and on the U74), the
    loads cannot be executed before architecturally earlier stores, even
    if both refer to different memory locations. So even A is relatively
    slow on in-order cores. In-order execution also means that B is
    almost as slow as H, while with out-of-order execution it can
    theoretically be executed as fast as A (but in practice they exhibit
    slightly slower performance, but certainly much faster than H).

    With OoO, we see much better performance in cases where there are
    independent computation chains. Given the size of the buffers in the
    various OoO microarchitectures (hundreds of instructions in the
    reorder buffer, dozens in schedulers), it is surprising that B is
    slower than A given the small size of each iteration (~15
    instructions); and even D is slower than E on some
    microarchitectures, most notably Rocket Lake.

    Measuring your own hardware

    You can download the contents of this directory and run the benchmark
    with

    wget http://www.complang.tuwien.ac.at/anton/memdep/memdep.zip
    unzip memdep.zip
    cd memdep
    make

    If you want to do your own parameter combinations, you can run the
    binary with

    ./memdep-`uname -m` a b c d e f g h

    where a b c d e f g h are integers and correspond to the pointers in
    the loop. If you want to get results like in the table above, run it
    like this:

    perf stat --log-fd 3 -x, -e $(CYCLES) ./memdep-$(ARCH) a b c d e f g h 3>&1 | awk -F, '{printf("%5.2f\n",$$1/1000000000)}'

    Future work

    Make a good microbenchmark that produces the addresses so late that
    either the core has to wait for the addresses or has to speculate
    whether the load accesses the same address as the store or not. Do it
    for predictable aliasing and for unpredictable aliasing. I remember a
    good article about this predictor (that actually looked at Intel's
    patent), but don't remember the URL; Intel calls this technique as
    implemented in Ice Lake and later P-cores Fast Store Forwarding
    Predictor (FSFP) (but my memory says that the article I read about
    looked at older microarchitectures that have a similar feature). AMD
    describes such a hardware feature under the name predictive store
    forwarding (PSF), which they added in Zen3.

    Related

    One thing I remember is that I have done a microbenchmark that was
    intended to measure predictive store forwarding, but it (also?)
    measured the forward-at-register-level technique described above.

    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Fri Nov 3 13:13:41 2023
    Anton Ertl wrote:
    I have written a microbenchmark for measuring how memory dependencies
    affect the performance of various microarchitectures. You can find it
    along with a description and results on <http://www.complang.tuwien.ac.at/anton/memdep/>.

    You can find the text portion of this page below; there are a few
    links on the page that are not produced below.

    If you have information on any of the things I don't know (or don't remember), please let me know about it.

    - anton

    Memory Dependency Microbenchmark

    This microbenchmark tests some aspects of how hardware deals with
    dependences through memory. It performs the following loop:

    for (i=0; i<250000000; i++) {
    *b = *a+1;
    *d = *c+1;
    *f = *e+1;
    *h = *g+1;
    }

    The eight parameters of the binary allow you to determine to which
    memory location the pointers a b c d e f g h refer, in particular, if
    some of them refer to the same memory location or not. Each parameter corresponds to one pointer, and two pointers refer to the same memory location iff the corresponding parameters are the same.

    This microbenchmark uses pointers that have been in their registers
    for a long time, so it does not test speculative alias detection by
    the hardware.

    I have measured using the following parameter combinations (in the
    order a b c d e f g h):

    * A 0 1 2 3 4 5 6 7: All computations (statements in the loop body)
    are completely independent. Ideally they can be performed as fast
    as the resources allow.
    * B 0 1 1 2 2 3 3 4: A sequence of 4 dependent computations, but
    the next iteration does not depend on the results of the previous
    one, so again, the work can be performed as fast as the resources
    allow. However, in order to achieve this performance, the loads
    of the next iteration have to be started while several
    architecturally earlier stores still have to be executed, and,
    comparing the results of B to those of A, all measured cores have
    more difficulty with that, resulting in slower performance.
    * C 0 0 2 2 4 4 6 6: 1-computation recurrences (i.e., the
    computation in the next iteration depends on a computation in the
    current iteration), four of those. So at most 4 of these
    computations (plus the loop overhead) can be performed in
    parallel.
    * D 0 1 1 0 2 3 3 2: 2-computation recurrences (i.e., two dependent
    computations in an iteration, and the first of those in the
    current iteration depends on the second one in the previous
    iteration), 2 of those. So at most two of these computations
    (plus the loop overhead) can be performed in parallel.
    * E 0 1 2 3 1 0 3 2: The same data flow as D, but the computations
    are arranged differently: Here we first have two independent
    computations, and then two computations that depend on the
    earlier computations.
    * F 0 1 1 2 2 0 3 3: A 3-computation recurrence and a 1-computation
    recurrence. In the best case you see the latency of three
    computations per iteration.
    * G 0 1 1 2 3 3 2 0: The same data flow as F, but the computations
    are arranged differently.
    * H 0 0 0 0 0 0 0 0: One 4-computation recurrence. These
    computations can only be performed sequentialy, only the loop
    overhead can be performed in parallel to them.

    The results for different CPU cores are shown in the following. The
    numbers are cycles per computation (statement in the loop body). To
    get the cycles per iteration, multiply by 4.

    A B C D E F G H microarchitecture CPU
    1.17 2.11 1.46 3.28 3.67 4.50 4.50 6.93 K8 Athlon 64 X2 4600+
    1.00 1.26 2.00 4.00 4.00 6.01 6.01 8.00 Zen Ryzen 5 1600X
    1.00 1.19 2.00 4.00 4.00 6.00 6.01 8.00 Zen2 Ryzen 9 3900X
    0.75 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Zen3 Ryzen 7 5800X
    1.00 1.52 1.99 3.70 3.66 4.65 4.62 7.42 Sandy Bridge Xeon E3-1220
    1.00 1.26 1.61 3.06 3.00 4.50 4.50 6.85 Haswell Core i7-4790K
    1.00 1.12 1.43 2.90 2.84 4.03 4.05 5.53 Skylake Core i5-6600K
    0.78 0.92 1.01 1.81 1.13 2.00 2.00 2.25 Rocket Lake Xeon W-1370P
    0.81 0.92 0.83 0.86 0.81 0.82 0.81 1.01 Tiger Lake Core i5-1135G7
    3.25 4.00 3.25 3.75 3.25 3.75 3.50 4.00 Cortex-A53 Amlogic S905
    3.25 4.00 3.25 3.75 3.25 3.75 3.50 4.00 Cortex-A55 RK3588
    1.25 2.26 2.06 3.95 4.00 6.18 6.18 7.67 Cortex-A72 RK3399
    1.50 1.77 1.63 3.00 3.03 4.10 4.07 5.51 Cortex-A76 RK3588
    1.03 1.66 1.50 3.00 3.00 4.50 4.50 7.84 IceStorm Apple M1 (variations from 6.0-8.75 for H)
    3.01 4.52 3.01 4.01 3.01 4.01 4.01 5.02 U74 JH7100 (RISC-V)

    There can be different sophistication levels of CPU cores, both in
    terms of dealing with aliases, and in terms of forwarding from stores
    to loads. And of course there is the difference between in-order
    execution and out-of-order execution.

    Aliases

    The simplest approach would be to assume that all memory accesses may
    refer to the same memory location. In that case we would expect that
    the core performs all parameter combinations as badly as H. None of
    the measured cores exhibits this behaviour, so obviously they are
    more sophisticated than that.

    The next approach is to allow to perform architecturally later loads
    at the same time, or, with OoO, earlier than stores to a different
    address. All tested cores seem to do this.

    Finally, there is the issue of what to do when the load is to the
    same address as an architecturally preceding store. This brings us to
    the next section:

    Store to load forwarding

    The simplest approach is to wait until the data arrives in the cache
    and then load it from there. I dimly rememeber seeing 15 cycles per
    iteration for a loop that incremented a memory location in every
    iteration, but none of the cores measured above take that long
    (although, for Zen and Zen2 one might wonder).

    The next approach is to let the load read from the store buffer if
    the data is there (and first wait until it is there). In this case
    the whole sequence of store-load has a total latency that is not much
    higher than the load latency. It seems that most cores measured here
    do that. We see this best in the H results; e.g., a Skylake has 4
    cycles of load latency and 1 cycle of add latency, and we see 5.53
    cycles of latency for store-load-add, meaning that the store
    contributes an average latency of 0.53 cycles. There are some
    complications in case the load only partially overlaps the store (I
    should put an appropriate chipsncheese link here).

    Finally, the core might detect that the data that is loaded is coming
    from a store that has not been retired yet, so the physical register
    used by the store can be directly renamed into the register of the
    load result, and the load does not need to access the store buffer or
    cache at all (what is the name of this feature?). As a result, in the
    ideal case we see only the 1-cycle latency of the add in case H. In
    the measured cores, Zen3 and Tiger Lake exhibit this behaviour fully;
    Rocket Lake probably also does that, but either does not succeed in
    all cases (the different results between D and E speak for this
    theory), or there is an additional source of latency.

    I expect that Firestorm (the performance core of the Apple M1) also
    has this feature, but unfortunately the cycles performance counter
    does not work for Firestorm on Linux 6.4.0-asahi-00571-gda70cd78bc50

    In-order vs. out-of-order execution

    With in-order execution (on Cortex A53 and A55, and on the U74), the
    loads cannot be executed before architecturally earlier stores, even
    if both refer to different memory locations. So even A is relatively
    slow on in-order cores. In-order execution also means that B is
    almost as slow as H, while with out-of-order execution it can
    theoretically be executed as fast as A (but in practice they exhibit
    slightly slower performance, but certainly much faster than H).

    With OoO, we see much better performance in cases where there are
    independent computation chains. Given the size of the buffers in the
    various OoO microarchitectures (hundreds of instructions in the
    reorder buffer, dozens in schedulers), it is surprising that B is
    slower than A given the small size of each iteration (~15
    instructions); and even D is slower than E on some
    microarchitectures, most notably Rocket Lake.

    Measuring your own hardware

    You can download the contents of this directory and run the benchmark
    with

    wget http://www.complang.tuwien.ac.at/anton/memdep/memdep.zip
    unzip memdep.zip
    cd memdep
    make

    If you want to do your own parameter combinations, you can run the
    binary with

    ./memdep-`uname -m` a b c d e f g h

    where a b c d e f g h are integers and correspond to the pointers in
    the loop. If you want to get results like in the table above, run it
    like this:

    perf stat --log-fd 3 -x, -e $(CYCLES) ./memdep-$(ARCH) a b c d e f g h 3>&1 | awk -F, '{printf("%5.2f\n",$$1/1000000000)}'

    Future work

    Make a good microbenchmark that produces the addresses so late that
    either the core has to wait for the addresses or has to speculate
    whether the load accesses the same address as the store or not. Do it
    for predictable aliasing and for unpredictable aliasing. I remember a
    good article about this predictor (that actually looked at Intel's
    patent), but don't remember the URL; Intel calls this technique as implemented in Ice Lake and later P-cores Fast Store Forwarding
    Predictor (FSFP) (but my memory says that the article I read about
    looked at older microarchitectures that have a similar feature). AMD describes such a hardware feature under the name predictive store
    forwarding (PSF), which they added in Zen3.

    Related

    One thing I remember is that I have done a microbenchmark that was
    intended to measure predictive store forwarding, but it (also?)
    measured the forward-at-register-level technique described above.


    I notice that all your 8 references are within the same cache line.
    Store to load forwarding might behave differently across lines
    and 4kB page boundaries.
    Instead of 0..7 you might try 0*64..7*64 and 0*4096..7*4096

    extern void memdep(long *a, long *b, long *c, long *d,
    long *e, long *f, long *g, long *h);

    for (i=0; i<1000; i++)
    a[i] = 0;
    memdep(a+offsets[0], a+offsets[1], a+offsets[2], a+offsets[3],
    a+offsets[4], a+offsets[5], a+offsets[6], a+offsets[7]);

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Anton Ertl on Fri Nov 3 15:29:32 2023
    On 11/3/2023 2:15 AM, Anton Ertl wrote:
    I have written a microbenchmark for measuring how memory dependencies
    affect the performance of various microarchitectures. You can find it
    along with a description and results on <http://www.complang.tuwien.ac.at/anton/memdep/>.
    [...]

    Is the only arch out there that does not require an explicit memory
    barrier for data-dependent loads a DEC Alpha? I think so.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Chris M. Thomasson on Sat Nov 4 17:40:57 2023
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/3/2023 2:15 AM, Anton Ertl wrote:
    I have written a microbenchmark for measuring how memory dependencies
    affect the performance of various microarchitectures. You can find it
    along with a description and results on
    <http://www.complang.tuwien.ac.at/anton/memdep/>.
    [...]

    Is the only arch out there that does not require an explicit memory
    barrier for data-dependent loads a DEC Alpha? I think so.

    I don't know any architecture that requires memory barriers for
    single-threaded programs that access just memory, not even Alpha.

    You may be thinking of the memory consistency model of Alpha, which is
    even weaker than everything else I know of. This is not surprising,
    given that a prominent advocacy paper for weak consistency [adve&gharachorloo95] came out of DEC.

    @TechReport{adve&gharachorloo95,
    author = {Sarita V. Adve and Kourosh Gharachorloo},
    title = {Shared Memory Consistency Models: A Tutorial},
    institution = {Digital Western Research Lab},
    year = {1995},
    type = {WRL Research Report},
    number = {95/7},
    annote = {Gives an overview of architectural features of
    shared-memory computers such as independent memory
    banks and per-CPU caches, and how they make the (for
    programmers) most natural consistency model hard to
    implement, giving examples of programs that can fail
    with weaker consistency models. It then discusses
    several categories of weaker consistency models and
    actual consistency models in these categories, and
    which ``safety net'' (e.g., memory barrier
    instructions) programmers need to use to work around
    the deficiencies of these models. While the authors
    recognize that programmers find it difficult to use
    these safety nets correctly and efficiently, it
    still advocates weaker consistency models, claiming
    that sequential consistency is too inefficient, by
    outlining an inefficient implementation (which is of
    course no proof that no efficient implementation
    exists). Still the paper is a good introduction to
    the issues involved.}
    }

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Sat Nov 4 17:01:32 2023
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    I notice that all your 8 references are within the same cache line.
    Store to load forwarding might behave differently across lines
    and 4kB page boundaries.
    Instead of 0..7 you might try 0*64..7*64 and 0*4096..7*4096

    The numbers are in 8-byte granularity, so to get different cache lines
    n*8 is good enough, and for different pagres n*512. However, the
    array is only 8000 bytes long, so you can use only numbers in the
    range 0...999. Anyway, I made another make target "ericp", so when
    you "make ericp", you get the following parameter combinations:

    X 0 8 16 24 32 40 48 56

    Different cache lines, but always the same "bank" in a cache line;
    this should hurt K8, maybe also others.

    Y 0 9 18 27 36 45 54 63

    Different cache lines, and different banks for different accesses.

    Z 0 513 994 11 524 989 22 535

    All different cache lines and different banks; the second access is on
    a different page than the first, and the third is likely on a
    different page. Then start with the first page again. These are all independent accesses.

    Results:

    X Y Z A B C D E F G H
    1.00 1.00 1.00 0.75 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Zen3
    1.00 1.00 1.00 0.78 0.91 1.02 1.81 1.13 2.00 2.00 2.25 Rocketlake
    2.00 1.17 1.17 1.17 2.03 1.25 3.30 3.69 4.50 4.50 6.91 K8

    We indeed see a slowdown on Zen3 and Rocketlake on X Y Z compare to
    the dependence-wise equivalent A. My gyess is that these CPUs can
    only store to one cache line per cycle, and can optimize the stores in
    some cases for A. If that is the case, that is a resource thing, not
    a failure to recognize independence.

    We see a slowdown for X on K8, which is somewhat expected; however,
    thinking about t, I wonder: It seems as if the K8 can do only one load
    or store per cycle, if they are to the same bank, but several to
    different banks; it has been too longs since a read how it worked. Y
    and Z are the same speed as A, showing that the distance between
    addresses does not influence the no-alias detection (at least for
    these distances).

    Is this what you had in mind?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Anton Ertl on Sat Nov 4 14:48:04 2023
    On 11/4/2023 12:40 PM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/3/2023 2:15 AM, Anton Ertl wrote:
    I have written a microbenchmark for measuring how memory dependencies
    affect the performance of various microarchitectures. You can find it
    along with a description and results on
    <http://www.complang.tuwien.ac.at/anton/memdep/>.
    [...]

    Is the only arch out there that does not require an explicit memory
    barrier for data-dependent loads a DEC Alpha? I think so.

    I don't know any architecture that requires memory barriers for single-threaded programs that access just memory, not even Alpha.

    You may be thinking of the memory consistency model of Alpha, which is
    even weaker than everything else I know of. This is not surprising,
    given that a prominent advocacy paper for weak consistency [adve&gharachorloo95] came out of DEC.


    Hmm, one could maybe save some logic and also make timing a little
    easier if they disallowed RAW and WAW (in the same cache line), and made
    them undefined behavior...

    However, this would suck for software (and would effectively either
    mandate strict stack alignment to keep prologs/epilogs working, or
    require saving registers in a convoluted order). And, in my case, I can
    tell that code steps on these scenarios very often (particularly in
    prolog sequences).


    This is why I bother with a "somewhat expensive" feature of forwarding
    stored cache lines back to load in cases where one is accessing a "just accessed" cache line.

    Disabling this forwarding (and forcing a stall instead), tends to have a
    more noticeable adverse effect on performance (but is often needed to
    try to pass timing at higher clock speeds).

    Granted, I guess it is possible I could fiddle with the compiler to try
    to improve this situation, say:
    Only use MOV.X with a 16B alignment;
    Stagger even/odd stores in pairs when possible.

    Say, as opposed to:
    MOV.X R12, (SP, 160)
    MOV.X R10, (SP, 144)
    MOV.X R8, (SP, 128)
    MOV.X R30, (SP, 112)
    MOV.X R28, (SP, 96)
    MOV.X R26, (SP, 80)
    MOV.X R24, (SP, 64)
    Say:
    MOV.X R10, (SP, 144)
    MOV.X R12, (SP, 160)
    MOV.X R30, (SP, 112)
    MOV.X R8, (SP, 128)
    MOV.X R26, (SP, 80)
    MOV.X R28, (SP, 96)
    MOV.X R24, (SP, 64)

    Say, to try to avoid two adjacent stores within the same 32-byte
    paired-line (for 64-bit load, one would try to avoid two adjacent stores
    within the same 32 bytes).


    But, I sort of ended my 75MHz experiment for now, and fell back to
    50MHz, where I can more easily afford to have this forwarding (in which
    case the above becomes mostly N/A).


    As for cache-coherence between cores, this basically still does not
    exist yet.

    It also seems like it would require a bit more ringbus traffic to pull
    off, say:
    If a core wants to write to a line, it needs to flag its intention to do so;
    If a line was previously fetched for read, but is now being written, the
    line would need to be flushed and re-fetched with the new intention;
    If a line is being fetched for write, we somehow need to signal it to be flushed from whatever L1 cache was last holding it;
    ...



    But, the alternative (what I currently have), effectively disallows
    traditional forms of (writable) memory sharing between threads (and
    "volatile" memory access seems to pretty slow at present). So, for multithreaded code, one would need to have the threads work mostly independently (without any shared mutable structures), and then trigger
    an L1 flush when it is done with its work (and the receiving thread also
    needs to trigger an L1 flush).

    Granted, similar ended up happening with the rasterizer module with
    TKRA-GL, where cache-flushing is needed whenever handing off the
    framebuffer between the main CPU and the rasterizer module (otherwise, graphical garbage and incompletely drawn geometry may result); and one
    also needs to flush the L1 cache and trigger the rasterizer module to
    flush its texture cache whenever uploading a new texture.

    And, as-is, texture uploading has ended up being horridly slow. So, even
    if I can (sort of) afford to run GLQuake with lightmaps now, dynamic
    lightmaps need to be disabled as it is too slow.

    Though, an intermediate possibility could be to store lightmaps with
    dynamic lights in (only) one of 2 states, with two different textures:
    Dynamic light ON/Max, Dynamic light OFF/Min). However, this would
    effectively disallow effects like "slow pulsate" (which could only
    alternate between fully-on and fully-off); as well as updates from
    dynamic light sources like rockets (where, as-is, firing a rocket with
    dynamic lightmaps, eats the CPU something hard).

    Not entirely sure how something like a Pentium-1 managed all this either.

    Granted, faster in this case would still be to do everything with vertex lighting (as is currently still the default in my GLQuake port), if
    albeit (not as good), and a bit of a hack as Quake wasn't really
    designed for vertex lighting.



    @TechReport{adve&gharachorloo95,
    author = {Sarita V. Adve and Kourosh Gharachorloo},
    title = {Shared Memory Consistency Models: A Tutorial},
    institution = {Digital Western Research Lab},
    year = {1995},
    type = {WRL Research Report},
    number = {95/7},
    annote = {Gives an overview of architectural features of
    shared-memory computers such as independent memory
    banks and per-CPU caches, and how they make the (for
    programmers) most natural consistency model hard to
    implement, giving examples of programs that can fail
    with weaker consistency models. It then discusses
    several categories of weaker consistency models and
    actual consistency models in these categories, and
    which ``safety net'' (e.g., memory barrier
    instructions) programmers need to use to work around
    the deficiencies of these models. While the authors
    recognize that programmers find it difficult to use
    these safety nets correctly and efficiently, it
    still advocates weaker consistency models, claiming
    that sequential consistency is too inefficient, by
    outlining an inefficient implementation (which is of
    course no proof that no efficient implementation
    exists). Still the paper is a good introduction to
    the issues involved.}
    }

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Sun Nov 5 10:09:34 2023
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    I notice that all your 8 references are within the same cache line.
    Store to load forwarding might behave differently across lines
    and 4kB page boundaries.
    Instead of 0..7 you might try 0*64..7*64 and 0*4096..7*4096

    The numbers are in 8-byte granularity, so to get different cache lines
    n*8 is good enough, and for different pagres n*512. However, the
    array is only 8000 bytes long, so you can use only numbers in the
    range 0...999. Anyway, I made another make target "ericp", so when
    you "make ericp", you get the following parameter combinations:

    X 0 8 16 24 32 40 48 56

    Different cache lines, but always the same "bank" in a cache line;
    this should hurt K8, maybe also others.

    Y 0 9 18 27 36 45 54 63

    Different cache lines, and different banks for different accesses.

    Z 0 513 994 11 524 989 22 535

    All different cache lines and different banks; the second access is on
    a different page than the first, and the third is likely on a
    different page. Then start with the first page again. These are all independent accesses.

    Results:

    X Y Z A B C D E F G H
    1.00 1.00 1.00 0.75 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Zen3
    1.00 1.00 1.00 0.78 0.91 1.02 1.81 1.13 2.00 2.00 2.25 Rocketlake
    2.00 1.17 1.17 1.17 2.03 1.25 3.30 3.69 4.50 4.50 6.91 K8

    We indeed see a slowdown on Zen3 and Rocketlake on X Y Z compare to
    the dependence-wise equivalent A. My gyess is that these CPUs can
    only store to one cache line per cycle, and can optimize the stores in
    some cases for A. If that is the case, that is a resource thing, not
    a failure to recognize independence.

    It might be that A is occasionally combining the multiple stores
    to the same line in the store buffer whereas X Y Z do not.
    So maybe A needs 25% less cache accesses.

    We see a slowdown for X on K8, which is somewhat expected; however,
    thinking about t, I wonder: It seems as if the K8 can do only one load
    or store per cycle, if they are to the same bank, but several to
    different banks; it has been too longs since a read how it worked. Y
    and Z are the same speed as A, showing that the distance between
    addresses does not influence the no-alias detection (at least for
    these distances).

    Is this what you had in mind?

    - anton

    Yes. My thought was that by targeting the same cache line you
    might be triggering alternate mechanisms that cause serialization.

    First was that x86-TSO coherence allows a younger load to bypass (execute before) an older store to a non-overlapping address, otherwise it is serial. The detection of "same address" could be as high resolution as 8-byte
    operand or as low as a cache line. So by targeting separate cache lines
    it could allow more load-store bypassing and concurrency.

    Also, as you noted, by targeting the same cache line it would serialize
    on the same cache bank port, if it has multiple banks.

    And I just suggested different pages to disable any "same page"
    virtual address translation optimizations (if there are any).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to EricP on Sun Nov 5 12:46:36 2023
    On 11/5/2023 7:09 AM, EricP wrote:
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    I notice that all your 8 references are within the same cache line.
    Store to load forwarding might behave differently across lines
    and 4kB page boundaries.
    Instead of 0..7 you might try 0*64..7*64 and 0*4096..7*4096

    The numbers are in 8-byte granularity, so to get different cache lines
    n*8 is good enough, and for different pagres n*512.  However, the
    array is only 8000 bytes long, so you can use only numbers in the
    range 0...999.  Anyway, I made another make target "ericp", so when
    you "make ericp", you get the following parameter combinations:

    X 0 8 16 24 32 40 48 56

    Different cache lines, but always the same "bank" in a cache line;
    this should hurt K8, maybe also others.

    Y 0 9 18 27 36 45 54 63

    Different cache lines, and different banks for different accesses.

    Z 0 513 994 11 524 989 22 535

    All different cache lines and different banks; the second access is on
    a different page than the first, and the third is likely on a
    different page.  Then start with the first page again.  These are all
    independent accesses.

    Results:

      X     Y     Z     A     B     C     D     E     F     G     H
     1.00  1.00  1.00  0.75  1.00  1.00  1.00  1.00  1.00  1.00  1.00 Zen3
     1.00  1.00  1.00  0.78  0.91  1.02  1.81  1.13  2.00  2.00  2.25 >> Rocketlake
     2.00  1.17  1.17  1.17  2.03  1.25  3.30  3.69  4.50  4.50  6.91 K8
    We indeed see a slowdown on Zen3 and Rocketlake on X Y Z compare to
    the dependence-wise equivalent A.  My gyess is that these CPUs can
    only store to one cache line per cycle, and can optimize the stores in
    some cases for A.  If that is the case, that is a resource thing, not
    a failure to recognize independence.

    It might be that A is occasionally combining the multiple stores
    to the same line in the store buffer whereas X Y Z do not.
    So maybe A needs 25% less cache accesses.

    We see a slowdown for X on K8, which is somewhat expected; however,
    thinking about t, I wonder: It seems as if the K8 can do only one load
    or store per cycle, if they are to the same bank, but several to
    different banks; it has been too longs since a read how it worked.  Y
    and Z are the same speed as A, showing that the distance between
    addresses does not influence the no-alias detection (at least for
    these distances).

    Is this what you had in mind?

    - anton

    Yes. My thought was that by targeting the same cache line you
    might be triggering alternate mechanisms that cause serialization.

    First was that x86-TSO coherence allows a younger load to bypass (execute before) an older store to a non-overlapping address, otherwise it is
    serial.
    The detection of "same address" could be as high resolution as 8-byte
    operand or as low as a cache line. So by targeting separate cache lines
    it could allow more load-store bypassing and concurrency.

    Also, as you noted, by targeting the same cache line it would serialize
    on the same cache bank port, if it has multiple banks.

    And I just suggested different pages to disable any "same page"
    virtual address translation optimizations (if there are any).

    Iirc, one can release a spinlock using an atomic store on x86, no
    LOCK'ED RMW. Btw, have you ever tried to implement hazard pointers on an
    x86? It requires an explicit memory barrier.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Chris M. Thomasson on Wed Nov 8 09:26:23 2023
    Chris M. Thomasson wrote:
    On 11/5/2023 7:09 AM, EricP wrote:

    Yes. My thought was that by targeting the same cache line you
    might be triggering alternate mechanisms that cause serialization.

    First was that x86-TSO coherence allows a younger load to bypass (execute
    before) an older store to a non-overlapping address, otherwise it is
    serial.
    The detection of "same address" could be as high resolution as 8-byte
    operand or as low as a cache line. So by targeting separate cache lines
    it could allow more load-store bypassing and concurrency.

    Also, as you noted, by targeting the same cache line it would serialize
    on the same cache bank port, if it has multiple banks.

    And I just suggested different pages to disable any "same page"
    virtual address translation optimizations (if there are any).

    Iirc, one can release a spinlock using an atomic store on x86, no
    LOCK'ED RMW.

    Sure, because its not a RMW. Its just a store which under x86-TSO
    becomes visible after prior stores to the protected section.

    This rule seems to imposes some particular design complexities on the
    order a Load-Store Queue and cache can perform store operations.

    Say we have two cache lines A and B.

    If there is a store to cache line ST [A+0]
    then to a different line ST [B+0], then a another ST [A+1],
    and if the first ST [A+0] hits cache but second ST [B+0] misses,
    then under TSO the third ST [A+1] must appear to stall so that it
    does not become visible until after the ST [B+0] has been performed,
    even though line A is in the cache.

    ST [A+0],r1 <- this hits cache
    ST [B+0],r2 <- this misses cache
    ST [A+1],r3 <- this waits for B to arrive and store to [B+0] to finish

    On core C1, if ST [A+1] was allowed to be performed before [B+0] then an invalidate msg might get in and transfer ownership of line A to a different core C2, allowing the new value of [A+1] to be visible at C2 before [B+0].

    In order to prevent this under TSO, either LSQ actually stalls ST [A+1],
    or it allows it to proceed to the cache but pins line A until the
    update to B is done. If it uses the second pinning approach then it
    must also deal with all the potential deadlock/livelock possibilities.

    And the cache access is pipelined so all of this is asynchronous.
    When ST [B+0] misses cache, ST [A+1] might already be in the pipeline.
    So even in the simple "stall until older stores done" approach it needs
    even more logic to detect this and NAK the following stores back to LSQ,
    and later wake them up and replay them when the ST [B+0] is done.

    Btw, have you ever tried to implement hazard pointers on an
    x86? It requires an explicit memory barrier.

    That lock-free stuff makes my brain hurt.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Fri Nov 10 01:42:15 2023
    EricP wrote:

    Chris M. Thomasson wrote:
    On 11/5/2023 7:09 AM, EricP wrote:

    Yes. My thought was that by targeting the same cache line you
    might be triggering alternate mechanisms that cause serialization.

    First was that x86-TSO coherence allows a younger load to bypass (execute >>> before) an older store to a non-overlapping address, otherwise it is
    serial.
    The detection of "same address" could be as high resolution as 8-byte
    operand or as low as a cache line. So by targeting separate cache lines
    it could allow more load-store bypassing and concurrency.

    Also, as you noted, by targeting the same cache line it would serialize
    on the same cache bank port, if it has multiple banks.

    And I just suggested different pages to disable any "same page"
    virtual address translation optimizations (if there are any).

    Iirc, one can release a spinlock using an atomic store on x86, no
    LOCK'ED RMW.

    Sure, because its not a RMW. Its just a store which under x86-TSO
    becomes visible after prior stores to the protected section.

    This rule seems to imposes some particular design complexities on the
    order a Load-Store Queue and cache can perform store operations.

    Say we have two cache lines A and B.

    If there is a store to cache line ST [A+0]
    then to a different line ST [B+0], then a another ST [A+1],
    and if the first ST [A+0] hits cache but second ST [B+0] misses,
    then under TSO the third ST [A+1] must appear to stall so that it
    does not become visible until after the ST [B+0] has been performed,
    even though line A is in the cache.

    ST [A+0],r1 <- this hits cache
    ST [B+0],r2 <- this misses cache
    ST [A+1],r3 <- this waits for B to arrive and store to [B+0] to finish

    On core C1, if ST [A+1] was allowed to be performed before [B+0] then an invalidate msg might get in and transfer ownership of line A to a different core C2, allowing the new value of [A+1] to be visible at C2 before [B+0].

    In order to prevent this under TSO, either LSQ actually stalls ST [A+1],
    <
    If you have an MDM+TC == CC, the CPU can perform the ST into CC where it
    awaits "ordering" while the external world is left believing it has not
    started yet {This is 1991 technology}. CC can effectively eliminate ST
    ordering stalls seen from the SPU while preserving all of the TSO-ness
    the external observers need.
    <
    or it allows it to proceed to the cache but pins line A until the
    update to B is done. If it uses the second pinning approach then it
    must also deal with all the potential deadlock/livelock possibilities.
    <
    In a conditional Cache, every instructions has (at least a portion) of
    its Data Cache associated line. So every ST has a place to deposit its
    data; and that place can be subject to backup and cancellation (based on external stuff happening}.
    <
    After the ST reached the complete state (ready to retire) CC data is
    migrated to DC data as porting and banking permiti.
    <
    And the cache access is pipelined so all of this is asynchronous.
    When ST [B+0] misses cache, ST [A+1] might already be in the pipeline.
    So even in the simple "stall until older stores done" approach it needs
    even more logic to detect this and NAK the following stores back to LSQ,
    and later wake them up and replay them when the ST [B+0] is done.

    Btw, have you ever tried to implement hazard pointers on an
    x86? It requires an explicit memory barrier.

    That lock-free stuff makes my brain hurt.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Fri Nov 10 01:36:31 2023
    BGB wrote:

    On 11/4/2023 12:40 PM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/3/2023 2:15 AM, Anton Ertl wrote:
    I have written a microbenchmark for measuring how memory dependencies
    affect the performance of various microarchitectures. You can find it >>>> along with a description and results on
    <http://www.complang.tuwien.ac.at/anton/memdep/>.
    [...]

    Is the only arch out there that does not require an explicit memory
    barrier for data-dependent loads a DEC Alpha? I think so.

    I don't know any architecture that requires memory barriers for
    single-threaded programs that access just memory, not even Alpha.

    You may be thinking of the memory consistency model of Alpha, which is
    even weaker than everything else I know of. This is not surprising,
    given that a prominent advocacy paper for weak consistency
    [adve&gharachorloo95] came out of DEC.


    Hmm, one could maybe save some logic and also make timing a little
    easier if they disallowed RAW and WAW (in the same cache line), and made
    them undefined behavior...
    <
    Intractable in practice.
    <
    What you CAN do is to dynamically solve this problem in HW. By constructing
    a memory dependency matrix and not allowing a memory reference to complete unless it is safe to do so.
    <
    However, this would suck for software (and would effectively either
    mandate strict stack alignment to keep prologs/epilogs working, or
    require saving registers in a convoluted order). And, in my case, I can
    tell that code steps on these scenarios very often (particularly in
    prolog sequences).
    <
    It only suck when there is a REAL dependency and saves you neck every time
    a dependency is not found (mid-to-high 90%-ile).
    <
    This is why I bother with a "somewhat expensive" feature of forwarding
    stored cache lines back to load in cases where one is accessing a "just accessed" cache line.
    <
    I use a MDM (above) attached to a temporal cache.
    <
    When an instruction is inserted in the MDM/TC==CC, it is made dependent on all instructions that are already in the CC that have not completed. and installed in the Unknown Address state.
    <
    When an address is AGENed, the address is sent to the data cache and a portion is sent to the CC and that portion CAMs against all the other addresses. The result of the comparison is either "Is the same line" or "Can't be the same line"
    and the state changes to Known Address.
    <
    While the compares are taking place, the MDM is relaxed. And entry that "can't be the same as" is removed as a dependency. When all dependencies have been removed, the instruction can complete.
    <
    If either CC or DC returns with data, the state advances to Known Data. Stores with Known data are allowed to complete into CC. Modified CC data migrates back to DC when the ST instruction becomes retireable.
    <
    At this point the ST has completed as seen by the CPU, but has not completed
    as seen by "interested Observers", and is subject to prediction repair, AGEN replay, and those rare situations where one changes the memory ordering model {ATOMICs, MMI/O, Config space}. This is the key point--the CPU can think "its done" while the external world can think "It has not happened yet".
    <
    Thus, memory references that alias on a cache line basis are performed in order,
    while those that are not run independently.
    <
    The conditional (temporal) cache holds the address and <at least a portion> of the associated cache line. Any reference can hit on that data in CC even if it is port or bank blocked in the DC. CC hits a bit over 50% of the time, effectively
    reducing cache port/bank conflicts that have been touched in the time of the temporal cache.
    <
    Disabling this forwarding (and forcing a stall instead), tends to have a
    more noticeable adverse effect on performance (but is often needed to
    try to pass timing at higher clock speeds).

    Granted, I guess it is possible I could fiddle with the compiler to try
    to improve this situation, say:
    Only use MOV.X with a 16B alignment;
    Stagger even/odd stores in pairs when possible.

    Say, as opposed to:
    MOV.X R12, (SP, 160)
    MOV.X R10, (SP, 144)
    MOV.X R8, (SP, 128)
    MOV.X R30, (SP, 112)
    MOV.X R28, (SP, 96)
    MOV.X R26, (SP, 80)
    MOV.X R24, (SP, 64)
    Say:
    MOV.X R10, (SP, 144)
    MOV.X R12, (SP, 160)
    MOV.X R30, (SP, 112)
    MOV.X R8, (SP, 128)
    MOV.X R26, (SP, 80)
    MOV.X R28, (SP, 96)
    MOV.X R24, (SP, 64)
    <
    I will use this as an example as to Why you want save/restore instructions
    in ISA::
    a) so the compiler does not need to deal with ordering problems
    b) so fewer instructions are produced.

    Say, to try to avoid two adjacent stores within the same 32-byte
    paired-line (for 64-bit load, one would try to avoid two adjacent stores within the same 32 bytes).
    <
    Solved by CC in my case.
    <


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Fri Nov 10 01:43:04 2023
    Anton Ertl wrote:

    I have written a microbenchmark for measuring how memory dependencies
    affect the performance of various microarchitectures.
    <
    Absolutely Brilliant.
    <
    Well Done, and Thanks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to MitchAlsup on Thu Nov 9 22:10:54 2023
    On 11/9/2023 7:36 PM, MitchAlsup wrote:
    BGB wrote:

    On 11/4/2023 12:40 PM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/3/2023 2:15 AM, Anton Ertl wrote:
    I have written a microbenchmark for measuring how memory dependencies >>>>> affect the performance of various microarchitectures.  You can find it >>>>> along with a description and results on
    <http://www.complang.tuwien.ac.at/anton/memdep/>.
    [...]

    Is the only arch out there that does not require an explicit memory
    barrier for data-dependent loads a DEC Alpha? I think so.

    I don't know any architecture that requires memory barriers for
    single-threaded programs that access just memory, not even Alpha.

    You may be thinking of the memory consistency model of Alpha, which is
    even weaker than everything else I know of.  This is not surprising,
    given that a prominent advocacy paper for weak consistency
    [adve&gharachorloo95] came out of DEC.


    Hmm, one could maybe save some logic and also make timing a little
    easier if they disallowed RAW and WAW (in the same cache line), and
    made them undefined behavior...
    <
    Intractable in practice.
    <
    What you CAN do is to dynamically solve this problem in HW. By constructing
    a memory dependency matrix and not allowing a memory reference to complete unless it is safe to do so.
    <

    Probably true.

    My options were one of:
    Forward results from an in-progress Store back to a following Load/Store
    (more expensive, but faster);
    Stall the pipeline until the prior Store can complete (cheaper but slower).


    Granted, the "not bother" option could be cheaper still, but carries an unreasonable level of risk (and in practice would likely mean that any
    store through a free pointer followed by another memory access would end
    up needing to use NOP padding or similar).

    Or, handling it in the CPU, by generating an interlock stall whenever a
    Store is followed by another memory operation. But, this would probably
    be the worst possible option in terms of performance...


    However, this would suck for software (and would effectively either
    mandate strict stack alignment to keep prologs/epilogs working, or
    require saving registers in a convoluted order). And, in my case, I
    can tell that code steps on these scenarios very often (particularly
    in prolog sequences).
    <
    It only suck when there is a REAL dependency and saves you neck every time
    a dependency is not found (mid-to-high 90%-ile).
    <

    OK.


    This is why I bother with a "somewhat expensive" feature of forwarding
    stored cache lines back to load in cases where one is accessing a
    "just accessed" cache line.
    <
    I use a MDM (above) attached to a temporal cache. <
    When an instruction is inserted in the MDM/TC==CC, it is made dependent
    on all
    instructions that are already in the CC that have not completed. and installed
    in the Unknown Address state.
    <
    When an address is AGENed, the address is sent to the data cache and a portion
    is sent to the CC and that portion CAMs against all the other addresses.
    The
    result of the comparison is either "Is the same line" or "Can't be the
    same line"
    and the state changes to Known Address.
    <
    While the compares are taking place, the MDM is relaxed. And entry that "can't
    be the same as" is removed as a dependency. When all dependencies have been removed, the instruction can complete.
    <
    If either CC or DC returns with data, the state advances to Known Data. Stores
    with Known data are allowed to complete into CC. Modified CC data
    migrates back
    to DC when the ST instruction becomes retireable. <
    At this point the ST has completed as seen by the CPU, but has not
    completed
    as seen by "interested Observers", and is subject to prediction repair,
    AGEN
    replay, and those rare situations where one changes the memory ordering
    model
    {ATOMICs, MMI/O, Config space}. This is the key point--the CPU can think
    "its
    done" while the external world can think "It has not happened yet".
    <
    Thus, memory references that alias on a cache line basis are performed
    in order,
    while those that are not run independently.
    <
    The conditional (temporal) cache holds the address and <at least a
    portion> of
    the associated cache line. Any reference can hit on that data in CC even
    if it
    is port or bank blocked in the DC. CC hits a bit over 50% of the time, effectively
    reducing cache port/bank conflicts that have been touched in the time of
    the
    temporal cache.
    <

    In my pipeline, it is simpler...

    Just, the L1 cache sees that the next stage holds the results of a store
    to the same cache lines, and either forwards the result or generates a
    stall until the stall completes.

    Where, forwarding is faster, but stalling is cheaper.



    Disabling this forwarding (and forcing a stall instead), tends to have
    a more noticeable adverse effect on performance (but is often needed
    to try to pass timing at higher clock speeds).

    Granted, I guess it is possible I could fiddle with the compiler to
    try to improve this situation, say:
       Only use MOV.X with a 16B alignment;
       Stagger even/odd stores in pairs when possible.

    Say, as opposed to:
       MOV.X R12, (SP, 160)
       MOV.X R10, (SP, 144)
       MOV.X R8,  (SP, 128)
       MOV.X R30, (SP, 112)
       MOV.X R28, (SP, 96)
       MOV.X R26, (SP, 80)
       MOV.X R24, (SP, 64)
    Say:
       MOV.X R10, (SP, 144)
       MOV.X R12, (SP, 160)
       MOV.X R30, (SP, 112)
       MOV.X R8,  (SP, 128)
       MOV.X R26, (SP, 80)
       MOV.X R28, (SP, 96)
       MOV.X R24, (SP, 64)
    <
    I will use this as an example as to Why you want save/restore instructions
    in ISA::
    a) so the compiler does not need to deal with ordering problems
    b) so fewer instructions are produced.


    Still haven't gotten around to this, but it could potentially help
    performance with the cheaper cache options (namely, configurations
    without the internal forwarding).

    But, yeah, the compiler does still need to deal with this sometimes.


    Say, to try to avoid two adjacent stores within the same 32-byte
    paired-line (for 64-bit load, one would try to avoid two adjacent
    stores within the same 32 bytes).
    <
    Solved by CC in my case.

    OK.

    <


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to EricP on Fri Nov 10 12:57:36 2023
    On 11/8/2023 6:26 AM, EricP wrote:
    Chris M. Thomasson wrote:
    On 11/5/2023 7:09 AM, EricP wrote:

    Yes. My thought was that by targeting the same cache line you
    might be triggering alternate mechanisms that cause serialization.

    First was that x86-TSO coherence allows a younger load to bypass
    (execute
    before) an older store to a non-overlapping address, otherwise it is
    serial.
    The detection of "same address" could be as high resolution as 8-byte
    operand or as low as a cache line. So by targeting separate cache lines
    it could allow more load-store bypassing and concurrency.

    Also, as you noted, by targeting the same cache line it would serialize
    on the same cache bank port, if it has multiple banks.

    And I just suggested different pages to disable any "same page"
    virtual address translation optimizations (if there are any).

    Iirc, one can release a spinlock using an atomic store on x86, no
    LOCK'ED RMW.

    Sure, because its not a RMW. Its just a store which under x86-TSO
    becomes visible after prior stores to the protected section.

    This rule seems to imposes some particular design complexities on the
    order a Load-Store Queue and cache can perform store operations.

    Say we have two cache lines A and B.

    If there is a store to cache line ST [A+0]
    then to a different line ST [B+0], then a another ST [A+1],
    and if the first ST [A+0] hits cache but second ST [B+0] misses,
    then under TSO the third ST [A+1] must appear to stall so that it
    does not become visible until after the ST [B+0] has been performed,
    even though line A is in the cache.

      ST [A+0],r1   <- this hits cache
      ST [B+0],r2   <- this misses cache
      ST [A+1],r3   <- this waits for B to arrive and store to [B+0] to finish

    On core C1, if ST [A+1] was allowed to be performed before [B+0] then an invalidate msg might get in and transfer ownership of line A to a different core C2, allowing the new value of [A+1] to be visible at C2 before [B+0].

    In order to prevent this under TSO, either LSQ actually stalls ST [A+1],
    or it allows it to proceed to the cache but pins line A until the
    update to B is done. If it uses the second pinning approach then it
    must also deal with all the potential deadlock/livelock possibilities.

    And the cache access is pipelined so all of this is asynchronous.
    When ST [B+0] misses cache, ST [A+1] might already be in the pipeline.
    So even in the simple "stall until older stores done" approach it needs
    even more logic to detect this and NAK the following stores back to LSQ,
    and later wake them up and replay them when the ST [B+0] is done.

    Btw, have you ever tried to implement hazard pointers on an x86? It
    requires an explicit memory barrier.

    That lock-free stuff makes my brain hurt.

    Iirc, hazard pointers require a store followed by a load to another
    location to be honored. This requires a membar on x86.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Fri Nov 10 23:35:55 2023
    Chris M. Thomasson wrote:

    On 11/8/2023 6:26 AM, EricP wrote:


    Btw, have you ever tried to implement hazard pointers on an x86? It
    requires an explicit memory barrier.

    That lock-free stuff makes my brain hurt.

    Iirc, hazard pointers require a store followed by a load to another
    location to be honored. This requires a membar on x86.
    <
    It is stuff like this that made My 66000 architecture define changes
    to memory order based on several pieces of state (thus no membars)
    <
    Accesses to ROM are unordered
    Accesses to config space is strongly ordered
    Accesses to MMI/O space is sequentially consistent
    Participating accesses (ATOMIC) are sequentially consistent
    everything else is causal.
    And the HW tracks this on a per memory reference basis--in effect
    all orderings are in effect all the time.
    <
    Performance guys get what they want,
    Lamport guys (atomic) get what they want
    Device drivers get what they want on the accesses that need it
    ..

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Fri Nov 10 23:31:20 2023
    BGB wrote:

    On 11/9/2023 7:36 PM, MitchAlsup wrote:

    Granted, I guess it is possible I could fiddle with the compiler to
    try to improve this situation, say:
       Only use MOV.X with a 16B alignment;
       Stagger even/odd stores in pairs when possible.

    Say, as opposed to:
       MOV.X R12, (SP, 160)
       MOV.X R10, (SP, 144)
       MOV.X R8,  (SP, 128)
       MOV.X R30, (SP, 112)
       MOV.X R28, (SP, 96)
       MOV.X R26, (SP, 80)
       MOV.X R24, (SP, 64)
    Say:
       MOV.X R10, (SP, 144)
       MOV.X R12, (SP, 160)
       MOV.X R30, (SP, 112)
       MOV.X R8,  (SP, 128)
       MOV.X R26, (SP, 80)
       MOV.X R28, (SP, 96)
       MOV.X R24, (SP, 64)
    <
    I will use this as an example as to Why you want save/restore instructions >> in ISA::
    a) so the compiler does not need to deal with ordering problems
    b) so fewer instructions are produced.


    Still haven't gotten around to this, but it could potentially help performance with the cheaper cache options (namely, configurations
    without the internal forwarding).
    <
    Imagine a scenario where you are returning from one function call only
    to call another function*. Since EXIT performs all the state reload
    (and the RET), and the return IP is read first so the front end can fetch-decode to feed the machine..... (*) possibly moving registers
    around to form the argument list.
    <
    NOW imagine that the front end encounters a CALL and the instruction
    at the target of the CALL is ENTER.
    a) the restore process can stop-or-complete
    b) the save process can mostly be skipped since memory already contains
    the correct bit patterns (My 66000 ABI)
    <
    Saving cycles a compiler cannot.....
    <
    But, yeah, the compiler does still need to deal with this sometimes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to MitchAlsup on Fri Nov 10 20:07:21 2023
    On 11/10/2023 3:35 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/8/2023 6:26 AM, EricP wrote:


    Btw, have you ever tried to implement hazard pointers on an x86? It
    requires an explicit memory barrier.

    That lock-free stuff makes my brain hurt.

    Iirc, hazard pointers require a store followed by a load to another
    location to be honored. This requires a membar on x86.
    <
    It is stuff like this that made My 66000 architecture define changes to memory order based on several pieces of state (thus no membars)
    <
    Accesses to ROM are unordered
    Accesses to config space is strongly ordered
    Accesses to MMI/O space is sequentially consistent
    Participating accesses (ATOMIC) are sequentially consistent
    everything else is causal.
    And the HW tracks this on a per memory reference basis--in effect
    all orderings are in effect all the time.
    <
    Performance guys get what they want,
    Lamport guys (atomic) get what they want
    Device drivers get what they want on the accesses that need it
    ..

    Nice! Well, there is a way to avoid the explicit membar on x86. It
    involves a marriage of RCU and Hazard Pointers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Sat Nov 11 16:43:54 2023
    MitchAlsup wrote:
    EricP wrote:

    Chris M. Thomasson wrote:
    On 11/5/2023 7:09 AM, EricP wrote:

    Yes. My thought was that by targeting the same cache line you
    might be triggering alternate mechanisms that cause serialization.

    First was that x86-TSO coherence allows a younger load to bypass
    (execute
    before) an older store to a non-overlapping address, otherwise it is
    serial.
    The detection of "same address" could be as high resolution as 8-byte
    operand or as low as a cache line. So by targeting separate cache lines >>>> it could allow more load-store bypassing and concurrency.

    Also, as you noted, by targeting the same cache line it would serialize >>>> on the same cache bank port, if it has multiple banks.

    And I just suggested different pages to disable any "same page"
    virtual address translation optimizations (if there are any).

    Iirc, one can release a spinlock using an atomic store on x86, no
    LOCK'ED RMW.

    Sure, because its not a RMW. Its just a store which under x86-TSO
    becomes visible after prior stores to the protected section.

    This rule seems to imposes some particular design complexities on the
    order a Load-Store Queue and cache can perform store operations.

    Say we have two cache lines A and B.

    If there is a store to cache line ST [A+0]
    then to a different line ST [B+0], then a another ST [A+1],
    and if the first ST [A+0] hits cache but second ST [B+0] misses,
    then under TSO the third ST [A+1] must appear to stall so that it
    does not become visible until after the ST [B+0] has been performed,
    even though line A is in the cache.

    ST [A+0],r1 <- this hits cache
    ST [B+0],r2 <- this misses cache
    ST [A+1],r3 <- this waits for B to arrive and store to [B+0] to
    finish

    On core C1, if ST [A+1] was allowed to be performed before [B+0] then an
    invalidate msg might get in and transfer ownership of line A to a
    different
    core C2, allowing the new value of [A+1] to be visible at C2 before
    [B+0].

    In order to prevent this under TSO, either LSQ actually stalls ST [A+1],
    <
    If you have an MDM+TC == CC, the CPU can perform the ST into CC where it awaits "ordering" while the external world is left believing it has not started yet {This is 1991 technology}. CC can effectively eliminate ST ordering stalls seen from the SPU while preserving all of the TSO-ness
    the external observers need.

    MDM = Memory Dependency Matrix
    TC = Temporal Cache
    CC = Conditional Cache

    The Conditional Cache sounds similar to the Store Buffers that other
    designs refer to but with multiple versions, as you outline below.
    This seems to duplicate many existing functions of the Load Store Queue.

    I'm thinking it would be simpler to keep everything in one circular
    load/store queue and hold store data there after retire until it
    can be handed off to the cache. See below.

    <
    or it allows it to proceed to the cache but pins line A until the
    update to B is done. If it uses the second pinning approach then it
    must also deal with all the potential deadlock/livelock possibilities.
    <
    In a conditional Cache, every instructions has (at least a portion) of
    its Data Cache associated line. So every ST has a place to deposit its
    data; and that place can be subject to backup and cancellation (based on external stuff happening}.
    <
    After the ST reached the complete state (ready to retire) CC data is
    migrated to DC data as porting and banking permiti.
    <

    Yes, but that seems a rather expensive approach.
    The problem I have with the CC is that it requires some pretty complex
    logic to track multiple versions of cache lines, journal before-image copies, track their inter-dependencies, and decide when to commit updates.
    And much of this duplicates functionality that the LSQ already has to
    support store-to-load forwarding and load-store bypass address matching.
    All those buffer need free lists and allocators, another dependency matrix, CAM's to match addresses to the assigned buffers.

    Its not clear to me that all this is significantly better than
    a simpler approach.

    Instead I was thinking of having a unified LSQ as a single circular buffer
    with all the load and store entries in (circular) program order.
    Store data is held in the LSQ after the store instruction retires
    until it is accepted by the cache. While it is in the LSQ it can still be forwarded to younger loads to the same address.

    LSQ stores are sent to the pipelined cache which indicates back to LSQ
    a hit/miss after the pipeline latency.
    If it hits then the entry is removed from LSQ.
    If it misses then the store data is held in LSQ and cache blocks further
    stores until the miss resolves. However LSQ continues to send future
    store address to trigger line prefetches until all miss buffers are busy.
    If a load misses then all further loads must stop because
    load-load bypassing is not allowed until TSO.

    When the missed line arrives, cache sends a wakeup signal to LSQ
    which restarts sending entries from where it left off.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Sat Nov 11 23:32:30 2023
    EricP wrote:

    MitchAlsup wrote:
    EricP wrote:

    Chris M. Thomasson wrote:
    On 11/5/2023 7:09 AM, EricP wrote:

    Yes. My thought was that by targeting the same cache line you
    might be triggering alternate mechanisms that cause serialization.

    First was that x86-TSO coherence allows a younger load to bypass
    (execute
    before) an older store to a non-overlapping address, otherwise it is >>>>> serial.
    The detection of "same address" could be as high resolution as 8-byte >>>>> operand or as low as a cache line. So by targeting separate cache lines >>>>> it could allow more load-store bypassing and concurrency.

    Also, as you noted, by targeting the same cache line it would serialize >>>>> on the same cache bank port, if it has multiple banks.

    And I just suggested different pages to disable any "same page"
    virtual address translation optimizations (if there are any).

    Iirc, one can release a spinlock using an atomic store on x86, no
    LOCK'ED RMW.

    Sure, because its not a RMW. Its just a store which under x86-TSO
    becomes visible after prior stores to the protected section.

    This rule seems to imposes some particular design complexities on the
    order a Load-Store Queue and cache can perform store operations.

    Say we have two cache lines A and B.

    If there is a store to cache line ST [A+0]
    then to a different line ST [B+0], then a another ST [A+1],
    and if the first ST [A+0] hits cache but second ST [B+0] misses,
    then under TSO the third ST [A+1] must appear to stall so that it
    does not become visible until after the ST [B+0] has been performed,
    even though line A is in the cache.

    ST [A+0],r1 <- this hits cache
    ST [B+0],r2 <- this misses cache
    ST [A+1],r3 <- this waits for B to arrive and store to [B+0] to
    finish

    On core C1, if ST [A+1] was allowed to be performed before [B+0] then an >>> invalidate msg might get in and transfer ownership of line A to a
    different
    core C2, allowing the new value of [A+1] to be visible at C2 before
    [B+0].

    In order to prevent this under TSO, either LSQ actually stalls ST [A+1],
    <
    If you have an MDM+TC == CC, the CPU can perform the ST into CC where it
    awaits "ordering" while the external world is left believing it has not
    started yet {This is 1991 technology}. CC can effectively eliminate ST
    ordering stalls seen from the SPU while preserving all of the TSO-ness
    the external observers need.

    MDM = Memory Dependency Matrix
    TC = Temporal Cache
    CC = Conditional Cache

    The Conditional Cache sounds similar to the Store Buffers that other
    designs refer to but with multiple versions, as you outline below.
    This seems to duplicate many existing functions of the Load Store Queue.

    I'm thinking it would be simpler to keep everything in one circular load/store queue and hold store data there after retire until it
    can be handed off to the cache. See below.
    <
    a) it is circular and tied intimately to the execution window.
    b) it integrates the LDs into the queue {just in case a store has
    to be replayed all dependent LDs get replayed too and for solving
    memory order problems dynamically.}

    <
    or it allows it to proceed to the cache but pins line A until the
    update to B is done. If it uses the second pinning approach then it
    must also deal with all the potential deadlock/livelock possibilities.
    <
    In a conditional Cache, every instructions has (at least a portion) of
    its Data Cache associated line. So every ST has a place to deposit its
    data; and that place can be subject to backup and cancellation (based on
    external stuff happening}.
    <
    After the ST reached the complete state (ready to retire) CC data is
    migrated to DC data as porting and banking permiti.
    <

    Yes, but that seems a rather expensive approach.
    The problem I have with the CC is that it requires some pretty complex
    logic to track multiple versions of cache lines, journal before-image copies, track their inter-dependencies, and decide when to commit updates.
    <
    We did not track all that stuff (Mc 88120 circa 1991) what we did was to copy data from the cache/memory into everybody needing that data, then stores could overwrite data into the cache into younger accesses; so most of the STs did
    not actually write into DC because we knew a younger store would do that for us.
    <
    And much of this duplicates functionality that the LSQ already has to
    support store-to-load forwarding and load-store bypass address matching.
    <
    The CC is the LD/ST queue, but by integrating certain features, most of the control logic vanishes, and the component "pretty much" manages itself. It
    does speak back to the Execution window logic to control the advancement of
    the consistent point and the retire point of the window.
    <
    All those buffer need free lists and allocators, another dependency matrix, CAM's to match addresses to the assigned buffers.
    <
    Nope:: dedicated storage. If the storage is not available, insert stalls.
    <
    Its not clear to me that all this is significantly better than
    a simpler approach.
    <
    Think of it like a reservation station for a horizon of memory reference instructions. Instead of CAMing on renamed registers, you CAM on portion
    of VA from the CPU side, and on PA for the snooping side. The MDM handles
    the memory renaming in a way that can be undone {Branch prediction repair
    and Snoop discovering memory order has been violated.}

    Instead I was thinking of having a unified LSQ as a single circular buffer with all the load and store entries in (circular) program order.
    <
    Yep, that is what a CC does. However, once 2 addresses discover "we cannot refer to the same cache line container" they become independent.
    <
    Store data is held in the LSQ after the store instruction retires
    <
    The ST must deliver its data to DC before the ST can retire. Until the ST delivers its data to DC, the ST is subject to being replayed upon detection
    of memory order violations.
    <
    until it is accepted by the cache. While it is in the LSQ it can still be forwarded to younger loads to the same address.

    LSQ stores are sent to the pipelined cache which indicates back to LSQ
    a hit/miss after the pipeline latency.
    <
    And hit data is written into All CC entires waiting for it.
    <
    If it hits then the entry is removed from LSQ.
    <
    Entries are inserted into CC at issue and removed from CC at retire. They remain replayable until consistent.
    <
    If it misses then the store data is held in LSQ and cache blocks further stores until the miss resolves. However LSQ continues to send future
    <
    Mc 88120 kept byte writes in the CC.data so STs could write data into CC
    even before data arrives from the memory hierarchy.
    <
    store address to trigger line prefetches until all miss buffers are busy.
    <
    The CC is the Miss Buffer !! and entries that missed but got translated
    are selected for external access in much the same way that instructions
    are selected from reservation stations.
    <
    If a load misses then all further loads must stop because
    load-load bypassing is not allowed until TSO.
    <
    But the MDM effectively allows multiple LDs to miss and be serviced independently {except config, MMI/O, ATOMIC) and repair back to TSO
    only if an external even requires actual TSO behavior.
    <
    When the missed line arrives, cache sends a wakeup signal to LSQ
    which restarts sending entries from where it left off.
    <
    When miss data returns with the PA, a portion of PA it used and everybody who is waiting on the returning data snarfs it up in the same write (into CC) cycle.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Sun Nov 12 19:13:07 2023
    Chris M. Thomasson wrote:

    On 11/10/2023 3:35 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/8/2023 6:26 AM, EricP wrote:


    Btw, have you ever tried to implement hazard pointers on an x86? It
    requires an explicit memory barrier.

    That lock-free stuff makes my brain hurt.

    Iirc, hazard pointers require a store followed by a load to another
    location to be honored. This requires a membar on x86.
    <
    It is stuff like this that made My 66000 architecture define changes to
    memory order based on several pieces of state (thus no membars)
    <
    Accesses to ROM are unordered
    Accesses to config space is strongly ordered
    Accesses to MMI/O space is sequentially consistent
    Participating accesses (ATOMIC) are sequentially consistent
    everything else is causal.
    And the HW tracks this on a per memory reference basis--in effect
    all orderings are in effect all the time.
    <
    Performance guys get what they want,
    Lamport guys (atomic) get what they want
    Device drivers get what they want on the accesses that need it
    ..

    Nice! Well, there is a way to avoid the explicit membar on x86. It
    involves a marriage of RCU and Hazard Pointers.
    <
    I watched membar evolve during my time at AMD and decided to make a
    machine that never needs to use them--but still gets the right thinig
    done.
    <

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to MitchAlsup on Sun Nov 12 14:01:46 2023
    On 11/12/2023 11:13 AM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/10/2023 3:35 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/8/2023 6:26 AM, EricP wrote:


    Btw, have you ever tried to implement hazard pointers on an x86?
    It requires an explicit memory barrier.

    That lock-free stuff makes my brain hurt.

    Iirc, hazard pointers require a store followed by a load to another
    location to be honored. This requires a membar on x86.
    <
    It is stuff like this that made My 66000 architecture define changes
    to memory order based on several pieces of state (thus no membars)
    <
    Accesses to ROM are unordered
    Accesses to config space is strongly ordered
    Accesses to MMI/O space is sequentially consistent
    Participating accesses (ATOMIC) are sequentially consistent
    everything else is causal.
    And the HW tracks this on a per memory reference basis--in effect
    all orderings are in effect all the time.
    <
    Performance guys get what they want,
    Lamport guys (atomic) get what they want
    Device drivers get what they want on the accesses that need it
    ..

    Nice! Well, there is a way to avoid the explicit membar on x86. It
    involves a marriage of RCU and Hazard Pointers.
    <
    I watched membar evolve during my time at AMD and decided to make a
    machine that never needs to use them--but still gets the right thinig
    done.
    <

    TSO? Oh, wait. I need to go think back about your system. It been some
    years sense we conversed about it. Btw, what happened to the Mill?
    Anyway, I digress.

    Fwiw, imho, it helps to think about proper alignment and boundaries, try
    to work with a given architecture, not against it.

    A highly relaxed memory model can be beneficial for certain workloads.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Kent Dickey on Sun Nov 12 14:39:11 2023
    On 11/12/2023 2:33 PM, Kent Dickey wrote:
    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain workloads.

    I know a lot of people believe that statement to be true. In general, it
    is assumed to be true without proof.

    I believe that statement to be false. Can you describe some of these workloads?

    One example... Basically, think along the lines of RCU. Usually,
    read-mostly and write rather rarely can _greatly_ benefit for not using
    any memory barriers. And this in on rather relaxed systems. Think SPARC
    in RMO mode. DEC alpha is a "special case" wrt RCU.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Kent Dickey on Sun Nov 12 14:43:18 2023
    On 11/12/2023 2:33 PM, Kent Dickey wrote:
    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain workloads.

    I know a lot of people believe that statement to be true. In general, it
    is assumed to be true without proof.

    I believe that statement to be false. Can you describe some of these workloads?

    Also, think about converting any sound lock-free algorithm's finely
    tuned memory barriers to _all_ sequential consistency... That would ruin performance right off the bat... Think about it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kent Dickey@21:1/5 to Chris M. Thomasson on Sun Nov 12 22:33:11 2023
    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain workloads.

    I know a lot of people believe that statement to be true. In general, it
    is assumed to be true without proof.

    I believe that statement to be false. Can you describe some of these workloads?

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Mon Nov 13 00:20:51 2023
    Chris M. Thomasson wrote:

    On 11/12/2023 2:33 PM, Kent Dickey wrote:
    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain workloads.

    I know a lot of people believe that statement to be true. In general, it
    is assumed to be true without proof.

    I believe that statement to be false. Can you describe some of these
    workloads?

    Also, think about converting any sound lock-free algorithm's finely
    tuned memory barriers to _all_ sequential consistency... That would ruin performance right off the bat... Think about it.
    <
    Assuming you are willing to accept the wrong answer fast, rather than the
    right answer later. There are very few algorithms with this property.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Kent Dickey on Mon Nov 13 00:19:21 2023
    Kent Dickey wrote:

    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain workloads.

    I know a lot of people believe that statement to be true. In general, it
    is assumed to be true without proof.
    <
    In its most general case, relaxed order only provides a performance advantage when the code is single threaded.
    <
    I believe that statement to be false. Can you describe some of these workloads?
    <
    Relaxed memory order fails spectacularly when multiple threads are accessing data.
    <
    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to MitchAlsup on Sun Nov 12 16:28:20 2023
    On 11/12/2023 4:20 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/12/2023 2:33 PM, Kent Dickey wrote:
    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain workloads.

    I know a lot of people believe that statement to be true.  In
    general, it
    is assumed to be true without proof.

    I believe that statement to be false.  Can you describe some of these
    workloads?

    Also, think about converting any sound lock-free algorithm's finely
    tuned memory barriers to _all_ sequential consistency... That would
    ruin performance right off the bat... Think about it.
    <
    Assuming you are willing to accept the wrong answer fast, rather than
    the right answer later. There are very few algorithms with this property.

    That does not make any sense to me. Think of a basic mutex. It basically requires an acquire membar for the lock and a release membar for the
    unlock. On SPARC that would be:

    acquire = MEMBAR #LoadStore | #LoadLoad

    release = MEMBAR #LoadStore | #StoreStore

    Okay, fine. However, if I made them sequentially consistent, it would
    require a damn #StoreLoad barrier for both acquire and release. This is
    not good and should be avoided when possible.

    Also, RCU prides itself with not having to use any memory barriers for
    its read side. If RCU was forced to use a seq cst, basically LOCKED RMW
    or MFENCE on Intel, it would completely ruin its performance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kent Dickey@21:1/5 to Chris M. Thomasson on Mon Nov 13 05:44:43 2023
    In article <uirqj3$9q9q$1@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2023 4:20 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/12/2023 2:33 PM, Kent Dickey wrote:
    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain workloads. >>>>
    I know a lot of people believe that statement to be true.  In
    general, it
    is assumed to be true without proof.

    I believe that statement to be false.  Can you describe some of these >>>> workloads?

    Also, think about converting any sound lock-free algorithm's finely
    tuned memory barriers to _all_ sequential consistency... That would
    ruin performance right off the bat... Think about it.
    <
    Assuming you are willing to accept the wrong answer fast, rather than
    the right answer later. There are very few algorithms with this property.

    That does not make any sense to me. Think of a basic mutex. It basically >requires an acquire membar for the lock and a release membar for the
    unlock. On SPARC that would be:

    acquire = MEMBAR #LoadStore | #LoadLoad

    release = MEMBAR #LoadStore | #StoreStore

    Okay, fine. However, if I made them sequentially consistent, it would
    require a damn #StoreLoad barrier for both acquire and release. This is
    not good and should be avoided when possible.

    Also, RCU prides itself with not having to use any memory barriers for
    its read side. If RCU was forced to use a seq cst, basically LOCKED RMW
    or MFENCE on Intel, it would completely ruin its performance.

    You are using the terms in the exact opposite meaning as I would understand computer architecture.

    We'll just assume 3 choices for CPU ordering:

    - Sequential consistency (SC). Hardware does everything, there are no
    barriers needed, ld/st instructions appear to happen in some
    global order.
    - Total Store Ordering (TSO) (x86). Stores appear to be done in program
    order, but a CPU can peek and see its own local store results
    before other CPUs can. Loads appear to be done in some total
    program order (not counting hitting its own local stores).
    TSO is like SC except there's effectively a store queue, and
    stores can finish when placed in the queue, and the queue drains
    in FIFO order. Needs no barriers, except for special cases like
    Lamport's algorithm (it's easy to avoid barriers).
    - Relaxed. Loads and stores are not ordered, users have to put in memory
    barriers and hope they did it right.

    So, a highly relaxed memory model is the ONLY model which needs barriers.
    If you want to get rid of barriers, use a better memory model.

    A relaxed ordering SYSTEM says rather than spending a few thousand
    gates getting ordering right by hardware in the CPU, instead we're going
    to require software to put in some very difficult to understand barriers.
    And we're going to have a 45 page academic paper using all the greek
    alphabet to describe when you need to put in barriers. Literally no one understands all the rules, so the best bet is put in too many barriers
    and wait for someone to nitpick your code and fix it for you.

    [I have a theorem: there is no correct non-trivial multithreaded program
    on an architecture which requires barriers for correctness.].

    A very simple thought exercise shows even if Sequential Consistency
    and/or TSO were slower (and I maintain they are not), but even if you
    believe that, a Relaxed Ordering system WILL be slower than TSO or
    Sequential for workloads which often use barriers (instructions tagged
    with acquire/release are barriers). In a Relaxed ordering system, the
    barriers will not be as efficient as the automatic barriers of TSO/SC (otherwise, why not just do that?), so if barriers are executed often, performance will be lower than hardware TSO/SC, even if there are no contentions or any work for the barriers to do. In fact, performance
    could be significantly lower.

    People know this, it's why they keep trying to get rid of barriers in
    their code. So get rid of all them and demand TSO ordering.

    Thus, the people trapped in Relaxed Ordering Hell then push weird schemes
    on everyone else to try to come up with algorithms which need fewer
    barriers. It's crazy.

    Relaxed Ordering is a mistake.

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kent Dickey@21:1/5 to MitchAlsup on Mon Nov 13 05:54:31 2023
    In article <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>,
    MitchAlsup <mitchalsup@aol.com> wrote:
    Kent Dickey wrote:

    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain workloads.

    I know a lot of people believe that statement to be true. In general, it
    is assumed to be true without proof.
    <
    In its most general case, relaxed order only provides a performance advantage >when the code is single threaded.

    I believe a Relaxed Memory model provides a small performance improvement
    ONLY to simple in-order CPUs in an MP system (if you're a single CPU,
    there's nothing to order).

    Relazed Memory ordering provides approximately zero performance improvement
    to an OoO CPU, and in fact, might actually lower performance (depends on
    how barriers are done--if done poorly, it could be a big negative).

    Yes, the system designers of the world have said: let's slow down our
    fastest most expensive most profitable CPUs, so we can speed up our cheapest lowest profit CPUs a few percent, and push a ton of work onto software developers.

    It's crazy.

    I believe that statement to be false. Can you describe some of these
    workloads?
    <
    Relaxed memory order fails spectacularly when multiple threads are accessing >data.

    Probably need to clarify with "accessing modified data".

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to Kent Dickey on Mon Nov 13 11:49:31 2023
    Kent Dickey <kegs@provalid.com> wrote:
    We'll just assume 3 choices for CPU ordering:

    - Sequential consistency (SC). Hardware does everything, there are no
    barriers needed, ld/st instructions appear to happen in some
    global order.
    - Total Store Ordering (TSO) (x86). Stores appear to be done in program
    order, but a CPU can peek and see its own local store results
    before other CPUs can. Loads appear to be done in some total
    program order (not counting hitting its own local stores).
    TSO is like SC except there's effectively a store queue, and
    stores can finish when placed in the queue, and the queue drains
    in FIFO order. Needs no barriers, except for special cases like
    Lamport's algorithm (it's easy to avoid barriers).
    - Relaxed. Loads and stores are not ordered, users have to put in memory
    barriers and hope they did it right.

    So, a highly relaxed memory model is the ONLY model which needs barriers.

    But this isn't true. Real processors aren't anywhere near as wildly
    chaotic as this.

    The common form of relaxed memory we see today is causally consistent
    and multi-copy atomic (and cache coherent). So, all other threads see
    stores to a single location in the same order, and you don't get the extraordinary going-backwards-in-time behaviour of DEC Alpha.

    A relaxed ordering SYSTEM says rather than spending a few thousand
    gates getting ordering right by hardware in the CPU, instead we're
    going to require software to put in some very difficult to
    understand barriers.

    I don't think that's really true. The reorderings we see in currently-
    produced hardware are, more or less, a subset of the same reorderings
    that C compilers perform. Therefore, if you see a confusing hardware
    reordering in a multi-threaded C program it may well be (probably is!)
    a bug (according to the C standard) *even on a TSO machine*. The only
    common counter-example to this is for volatile accesses.

    A very simple thought exercise shows even if Sequential Consistency
    and/or TSO were slower (and I maintain they are not), but even if
    you believe that, a Relaxed Ordering system WILL be slower than TSO
    or Sequential for workloads which often use barriers (instructions
    tagged with acquire/release are barriers). In a Relaxed ordering
    system, the barriers will not be as efficient as the automatic
    barriers of TSO/SC (otherwise, why not just do that?),

    Whyever not? They do the same thing.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Kent Dickey on Mon Nov 13 13:22:44 2023
    Kent Dickey wrote:
    In article <uirqj3$9q9q$1@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2023 4:20 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/12/2023 2:33 PM, Kent Dickey wrote:
    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain workloads. >>>>> I know a lot of people believe that statement to be true. In
    general, it
    is assumed to be true without proof.

    I believe that statement to be false. Can you describe some of these >>>>> workloads?
    Also, think about converting any sound lock-free algorithm's finely
    tuned memory barriers to _all_ sequential consistency... That would
    ruin performance right off the bat... Think about it.
    <
    Assuming you are willing to accept the wrong answer fast, rather than
    the right answer later. There are very few algorithms with this property. >> That does not make any sense to me. Think of a basic mutex. It basically
    requires an acquire membar for the lock and a release membar for the
    unlock. On SPARC that would be:

    acquire = MEMBAR #LoadStore | #LoadLoad

    release = MEMBAR #LoadStore | #StoreStore

    Okay, fine. However, if I made them sequentially consistent, it would
    require a damn #StoreLoad barrier for both acquire and release. This is
    not good and should be avoided when possible.

    Also, RCU prides itself with not having to use any memory barriers for
    its read side. If RCU was forced to use a seq cst, basically LOCKED RMW
    or MFENCE on Intel, it would completely ruin its performance.

    You are using the terms in the exact opposite meaning as I would understand computer architecture.

    We'll just assume 3 choices for CPU ordering:

    - Sequential consistency (SC). Hardware does everything, there are no
    barriers needed, ld/st instructions appear to happen in some
    global order.
    - Total Store Ordering (TSO) (x86). Stores appear to be done in program
    order, but a CPU can peek and see its own local store results
    before other CPUs can. Loads appear to be done in some total
    program order (not counting hitting its own local stores).
    TSO is like SC except there's effectively a store queue, and
    stores can finish when placed in the queue, and the queue drains
    in FIFO order. Needs no barriers, except for special cases like
    Lamport's algorithm (it's easy to avoid barriers).
    - Relaxed. Loads and stores are not ordered, users have to put in memory
    barriers and hope they did it right.

    So, a highly relaxed memory model is the ONLY model which needs barriers.
    If you want to get rid of barriers, use a better memory model.

    A relaxed ordering SYSTEM says rather than spending a few thousand
    gates getting ordering right by hardware in the CPU, instead we're going
    to require software to put in some very difficult to understand barriers.
    And we're going to have a 45 page academic paper using all the greek
    alphabet to describe when you need to put in barriers. Literally no one understands all the rules, so the best bet is put in too many barriers
    and wait for someone to nitpick your code and fix it for you.

    [I have a theorem: there is no correct non-trivial multithreaded program
    on an architecture which requires barriers for correctness.].

    A very simple thought exercise shows even if Sequential Consistency
    and/or TSO were slower (and I maintain they are not), but even if you
    believe that, a Relaxed Ordering system WILL be slower than TSO or
    Sequential for workloads which often use barriers (instructions tagged
    with acquire/release are barriers). In a Relaxed ordering system, the barriers will not be as efficient as the automatic barriers of TSO/SC (otherwise, why not just do that?), so if barriers are executed often, performance will be lower than hardware TSO/SC, even if there are no contentions or any work for the barriers to do. In fact, performance
    could be significantly lower.

    People know this, it's why they keep trying to get rid of barriers in
    their code. So get rid of all them and demand TSO ordering.

    Thus, the people trapped in Relaxed Ordering Hell then push weird schemes
    on everyone else to try to come up with algorithms which need fewer
    barriers. It's crazy.

    Relaxed Ordering is a mistake.

    Kent

    I suggest something different: the ability to switch between TSO and
    relaxed with non-privileged user mode instructions.

    Non-concurrent code does not see the relaxed ordering, and should benefit
    from extra concurrency in the Load Store Queue and cache that relaxed rules allow, because the local core always sees its own memory as consistent.
    For example, relaxed ordering allows multiple LD and ST to be in
    multiple pipelines to multiple cache banks at once without regard
    as to the exact order the operations are applied.

    This is fine for non concurrently accessed data structures,
    either non-shared data areas or shared but guarded by mutexes.

    But relaxed is hard for people to reason about for concurrently accessed
    lock free data structures. Now these don't just appear out of thin air so
    it is reasonable for a program to emit TSO_START and TSO_END instructions.

    On the other hand, almost no code is lock-free or ever will be.
    So why have all the extra HW logic to support TSO if its only really
    needed for this rare kind of programming.

    But there is also a category of memory area that is not covered by the
    above rules, one where one core thinks its memory is local and not shared
    but in fact it is being accessed concurrently.

    If thread T1 (say an app) on core C1 says its memory is relaxed, and calls
    a subroutine passing a pointer to a value on T1's stack, and that pointer
    is passed to thread T2 (a driver) on core C2 which accesses that memory,
    then even if T2 declared itself to be using TSO rules it would not force
    T1 on C1 obey them.

    Where this approach could fail is the kind of laissez-faire sharing done
    by many apps, libraries, and OS's behind the scenes in the real world.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to aph@littlepinkcloud.invalid on Mon Nov 13 19:16:41 2023
    aph@littlepinkcloud.invalid wrote:

    Kent Dickey <kegs@provalid.com> wrote:
    We'll just assume 3 choices for CPU ordering:

    - Sequential consistency (SC). Hardware does everything, there are no
    barriers needed, ld/st instructions appear to happen in some
    global order.
    - Total Store Ordering (TSO) (x86). Stores appear to be done in program
    order, but a CPU can peek and see its own local store results
    before other CPUs can. Loads appear to be done in some total
    program order (not counting hitting its own local stores).
    TSO is like SC except there's effectively a store queue, and
    stores can finish when placed in the queue, and the queue drains
    in FIFO order. Needs no barriers, except for special cases like
    Lamport's algorithm (it's easy to avoid barriers).
    - Relaxed. Loads and stores are not ordered, users have to put in memory
    barriers and hope they did it right.

    So, a highly relaxed memory model is the ONLY model which needs barriers.

    But this isn't true. Real processors aren't anywhere near as wildly
    chaotic as this.

    The common form of relaxed memory we see today is causally consistent
    and multi-copy atomic (and cache coherent). So, all other threads see
    stores to a single location in the same order, and you don't get the extraordinary going-backwards-in-time behaviour of DEC Alpha.

    A relaxed ordering SYSTEM says rather than spending a few thousand
    gates getting ordering right by hardware in the CPU, instead we're
    going to require software to put in some very difficult to
    understand barriers.
    <
    The control logic to perform may be that small, the state to maintain it
    is at least (5×48^2)×1.2-gates.

    I don't think that's really true. The reorderings we see in currently- produced hardware are, more or less, a subset of the same reorderings
    that C compilers perform. Therefore, if you see a confusing hardware reordering in a multi-threaded C program it may well be (probably is!)
    a bug (according to the C standard) *even on a TSO machine*. The only
    common counter-example to this is for volatile accesses.
    <
    Andy Glew used to report that there was surprisingly few instructions
    performed out-of-order on an GBOoO machine.

    A very simple thought exercise shows even if Sequential Consistency
    and/or TSO were slower (and I maintain they are not), but even if
    you believe that, a Relaxed Ordering system WILL be slower than TSO
    or Sequential for workloads which often use barriers (instructions
    tagged with acquire/release are barriers). In a Relaxed ordering
    system, the barriers will not be as efficient as the automatic
    barriers of TSO/SC (otherwise, why not just do that?),

    Whyever not? They do the same thing.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Mon Nov 13 19:29:30 2023
    EricP wrote:

    Kent Dickey wrote:
    In article <uirqj3$9q9q$1@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2023 4:20 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/12/2023 2:33 PM, Kent Dickey wrote:
    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain workloads. >>>>>> I know a lot of people believe that statement to be true. In
    general, it
    is assumed to be true without proof.

    I believe that statement to be false. Can you describe some of these >>>>>> workloads?
    Also, think about converting any sound lock-free algorithm's finely
    tuned memory barriers to _all_ sequential consistency... That would
    ruin performance right off the bat... Think about it.
    <
    Assuming you are willing to accept the wrong answer fast, rather than
    the right answer later. There are very few algorithms with this property. >>> That does not make any sense to me. Think of a basic mutex. It basically >>> requires an acquire membar for the lock and a release membar for the
    unlock. On SPARC that would be:

    acquire = MEMBAR #LoadStore | #LoadLoad

    release = MEMBAR #LoadStore | #StoreStore

    Okay, fine. However, if I made them sequentially consistent, it would
    require a damn #StoreLoad barrier for both acquire and release. This is
    not good and should be avoided when possible.

    Also, RCU prides itself with not having to use any memory barriers for
    its read side. If RCU was forced to use a seq cst, basically LOCKED RMW
    or MFENCE on Intel, it would completely ruin its performance.

    You are using the terms in the exact opposite meaning as I would understand >> computer architecture.

    We'll just assume 3 choices for CPU ordering:

    - Sequential consistency (SC). Hardware does everything, there are no
    barriers needed, ld/st instructions appear to happen in some
    global order.
    - Total Store Ordering (TSO) (x86). Stores appear to be done in program
    order, but a CPU can peek and see its own local store results
    before other CPUs can. Loads appear to be done in some total
    program order (not counting hitting its own local stores).
    TSO is like SC except there's effectively a store queue, and
    stores can finish when placed in the queue, and the queue drains
    in FIFO order. Needs no barriers, except for special cases like
    Lamport's algorithm (it's easy to avoid barriers).
    - Relaxed. Loads and stores are not ordered, users have to put in memory
    barriers and hope they did it right.

    So, a highly relaxed memory model is the ONLY model which needs barriers.
    If you want to get rid of barriers, use a better memory model.

    A relaxed ordering SYSTEM says rather than spending a few thousand
    gates getting ordering right by hardware in the CPU, instead we're going
    to require software to put in some very difficult to understand barriers.
    And we're going to have a 45 page academic paper using all the greek
    alphabet to describe when you need to put in barriers. Literally no one
    understands all the rules, so the best bet is put in too many barriers
    and wait for someone to nitpick your code and fix it for you.

    [I have a theorem: there is no correct non-trivial multithreaded program
    on an architecture which requires barriers for correctness.].

    A very simple thought exercise shows even if Sequential Consistency
    and/or TSO were slower (and I maintain they are not), but even if you
    believe that, a Relaxed Ordering system WILL be slower than TSO or
    Sequential for workloads which often use barriers (instructions tagged
    with acquire/release are barriers). In a Relaxed ordering system, the
    barriers will not be as efficient as the automatic barriers of TSO/SC
    (otherwise, why not just do that?), so if barriers are executed often,
    performance will be lower than hardware TSO/SC, even if there are no
    contentions or any work for the barriers to do. In fact, performance
    could be significantly lower.

    People know this, it's why they keep trying to get rid of barriers in
    their code. So get rid of all them and demand TSO ordering.

    Thus, the people trapped in Relaxed Ordering Hell then push weird schemes
    on everyone else to try to come up with algorithms which need fewer
    barriers. It's crazy.

    Relaxed Ordering is a mistake.

    Kent

    I suggest something different: the ability to switch between TSO and
    relaxed with non-privileged user mode instructions.
    <
    I suggest this switch between modes be done without executing any extra instructions. I do the switches based on the address space of the access
    {DRAM, MMI/O, config, ROM} and I also switch to SC when an ATOMIC event
    begins.
    <
    Non-concurrent code does not see the relaxed ordering, and should benefit from extra concurrency in the Load Store Queue and cache that relaxed rules allow, because the local core always sees its own memory as consistent.
    For example, relaxed ordering allows multiple LD and ST to be in
    multiple pipelines to multiple cache banks at once without regard
    as to the exact order the operations are applied.
    <
    I suspect SUN lost significant performance by always running TSO and
    it still required barrier instructions.
    <
    This is fine for non concurrently accessed data structures,
    either non-shared data areas or shared but guarded by mutexes.

    But relaxed is hard for people to reason about for concurrently accessed
    lock free data structures.
    <
    It is hard for people who have a completely vonNeumann thinking pattern
    pattern to reason about these things, it is not hard for someone whos
    entire career was spent doing multiplicity of things concurrently.
    <
    SW languages (and debuggers,...) teach people to think:: this happens than
    that happens then something else happens. HW languages teach people to think "crap all of this is happening at once, how do I make sense out of it".
    <
    In CPU design (or chip design in general) one is NEVER given the illusion
    that a single <vast> state describes the moment. It is s shame the SW
    did not follow similar route.
    <
    Now these don't just appear out of thin air so
    it is reasonable for a program to emit TSO_START and TSO_END instructions.

    On the other hand, almost no code is lock-free or ever will be.
    So why have all the extra HW logic to support TSO if its only really
    needed for this rare kind of programming.
    <
    I made this exact argument to SUN circa 1993.....
    <
    But there is also a category of memory area that is not covered by the
    above rules, one where one core thinks its memory is local and not shared
    but in fact it is being accessed concurrently.

    If thread T1 (say an app) on core C1 says its memory is relaxed, and calls
    a subroutine passing a pointer to a value on T1's stack, and that pointer
    is passed to thread T2 (a driver) on core C2 which accesses that memory,
    then even if T2 declared itself to be using TSO rules it would not force
    T1 on C1 obey them.

    Where this approach could fail is the kind of laissez-faire sharing done
    by many apps, libraries, and OS's behind the scenes in the real world.

    So, anything written in JavaScript........

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to aph@littlepinkcloud.invalid on Mon Nov 13 13:11:18 2023
    On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:
    Kent Dickey <kegs@provalid.com> wrote:
    We'll just assume 3 choices for CPU ordering:

    - Sequential consistency (SC). Hardware does everything, there are no
    barriers needed, ld/st instructions appear to happen in some
    global order.
    - Total Store Ordering (TSO) (x86). Stores appear to be done in program
    order, but a CPU can peek and see its own local store results
    before other CPUs can. Loads appear to be done in some total
    program order (not counting hitting its own local stores).
    TSO is like SC except there's effectively a store queue, and
    stores can finish when placed in the queue, and the queue drains
    in FIFO order. Needs no barriers, except for special cases like
    Lamport's algorithm (it's easy to avoid barriers).
    - Relaxed. Loads and stores are not ordered, users have to put in memory
    barriers and hope they did it right.

    So, a highly relaxed memory model is the ONLY model which needs barriers.

    But this isn't true. Real processors aren't anywhere near as wildly
    chaotic as this.

    The common form of relaxed memory we see today is causally consistent
    and multi-copy atomic (and cache coherent). So, all other threads see
    stores to a single location in the same order, and you don't get the extraordinary going-backwards-in-time behaviour of DEC Alpha.

    A relaxed ordering SYSTEM says rather than spending a few thousand
    gates getting ordering right by hardware in the CPU, instead we're
    going to require software to put in some very difficult to
    understand barriers.

    I don't think that's really true. The reorderings we see in currently- produced hardware are, more or less, a subset of the same reorderings
    that C compilers perform. Therefore, if you see a confusing hardware reordering in a multi-threaded C program it may well be (probably is!)
    a bug (according to the C standard) *even on a TSO machine*. The only
    common counter-example to this is for volatile accesses.

    Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*



    A very simple thought exercise shows even if Sequential Consistency
    and/or TSO were slower (and I maintain they are not), but even if
    you believe that, a Relaxed Ordering system WILL be slower than TSO
    or Sequential for workloads which often use barriers (instructions
    tagged with acquire/release are barriers). In a Relaxed ordering
    system, the barriers will not be as efficient as the automatic
    barriers of TSO/SC (otherwise, why not just do that?),

    Whyever not? They do the same thing.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Kent Dickey on Mon Nov 13 13:36:37 2023
    On 11/12/2023 9:54 PM, Kent Dickey wrote:
    In article <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>,
    MitchAlsup <mitchalsup@aol.com> wrote:
    Kent Dickey wrote:

    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain workloads.

    I know a lot of people believe that statement to be true. In general, it >>> is assumed to be true without proof.
    <
    In its most general case, relaxed order only provides a performance advantage
    when the code is single threaded.

    I believe a Relaxed Memory model provides a small performance improvement ONLY to simple in-order CPUs in an MP system (if you're a single CPU,
    there's nothing to order).

    Relazed Memory ordering provides approximately zero performance improvement to an OoO CPU, and in fact, might actually lower performance (depends on
    how barriers are done--if done poorly, it could be a big negative).

    Yes, the system designers of the world have said: let's slow down our
    fastest most expensive most profitable CPUs, so we can speed up our cheapest lowest profit CPUs a few percent, and push a ton of work onto software developers.

    It's crazy.

    I believe that statement to be false. Can you describe some of these
    workloads?
    <
    Relaxed memory order fails spectacularly when multiple threads are accessing >> data.

    Probably need to clarify with "accessing modified data".

    Kent

    Huh? So, C++ is crazy for allowing for std::memory_order_relaxed to even
    exist? I must be misunderstanding you point here. Sorry if I am. ;^o

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to MitchAlsup on Mon Nov 13 13:31:27 2023
    On 11/13/2023 11:29 AM, MitchAlsup wrote:
    EricP wrote:

    Kent Dickey wrote:
    In article <uirqj3$9q9q$1@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2023 4:20 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/12/2023 2:33 PM, Kent Dickey wrote:
    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain
    workloads.
    I know a lot of people believe that statement to be true.  In >>>>>>> general, it
    is assumed to be true without proof.

    I believe that statement to be false.  Can you describe some of >>>>>>> these
    workloads?
    Also, think about converting any sound lock-free algorithm's
    finely tuned memory barriers to _all_ sequential consistency...
    That would ruin performance right off the bat... Think about it.
    <
    Assuming you are willing to accept the wrong answer fast, rather
    than the right answer later. There are very few algorithms with
    this property.
    That does not make any sense to me. Think of a basic mutex. It
    basically requires an acquire membar for the lock and a release
    membar for the unlock. On SPARC that would be:

    acquire = MEMBAR #LoadStore | #LoadLoad

    release = MEMBAR #LoadStore | #StoreStore

    Okay, fine. However, if I made them sequentially consistent, it
    would require a damn #StoreLoad barrier for both acquire and
    release. This is not good and should be avoided when possible.

    Also, RCU prides itself with not having to use any memory barriers
    for its read side. If RCU was forced to use a seq cst, basically
    LOCKED RMW or MFENCE on Intel, it would completely ruin its
    performance.
    [...]
    < I suspect SUN lost significant performance by always running TSO and
    it still required barrier instructions.
    [...]

    Intel still requires an explicit membar for hazard pointers as-is. Sparc
    in TSO mode still requires a membar for this. Spard needs a #StoreLoad
    wrt the store followed by a load to another location relationship to
    hold. Intel needs a LOCK'ed atomic or MFENCE to handle this.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to Chris M. Thomasson on Tue Nov 14 10:25:22 2023
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:

    I don't think that's really true. The reorderings we see in currently-
    produced hardware are, more or less, a subset of the same reorderings
    that C compilers perform. Therefore, if you see a confusing hardware
    reordering in a multi-threaded C program it may well be (probably is!)
    a bug (according to the C standard) *even on a TSO machine*. The only
    common counter-example to this is for volatile accesses.

    Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*

    Maybe I wasn't clear enough. If you use std::atomic and
    std::memory_order_* in such a way that there are no data races, your
    concurrent program will be fine on both TSO and relaxed memory
    ordering. If you try to fix data races with volatile instead of
    std::atomic and std::memory_order_*, that'll mostly fix things on a
    TSO machine, but not on a machine with relaxed memory ordering.

    (For pedants: Mostly, but not completely, even on TSO, e.g. Dekker's
    Algorithm, which needs something stronger.)

    Because of this, the assertion that programming a non-TSO machine is
    "harder" doesn't IMO stand up, at least in C programs, because the
    same data-race bugs can manifest themselves as either compiler
    optimizations or hardware reorderings. And a compiler can, at least in
    theory, can do things that are far weirder than any memory system
    does.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to aph@littlepinkcloud.invalid on Tue Nov 14 12:44:44 2023
    On 11/14/2023 2:25 AM, aph@littlepinkcloud.invalid wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:

    I don't think that's really true. The reorderings we see in currently-
    produced hardware are, more or less, a subset of the same reorderings
    that C compilers perform. Therefore, if you see a confusing hardware
    reordering in a multi-threaded C program it may well be (probably is!)
    a bug (according to the C standard) *even on a TSO machine*. The only
    common counter-example to this is for volatile accesses.

    Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*

    Maybe I wasn't clear enough.

    Well, I think I was being a bit too dense here and missed your main
    point. Sorry.


    If you use std::atomic and
    std::memory_order_* in such a way that there are no data races, your concurrent program will be fine on both TSO and relaxed memory
    ordering.

    Agreed.


    If you try to fix data races with volatile instead of
    std::atomic and std::memory_order_*, that'll mostly fix things on a
    TSO machine, but not on a machine with relaxed memory ordering.

    (For pedants: Mostly, but not completely, even on TSO, e.g. Dekker's Algorithm, which needs something stronger.)

    Indeed, it does.


    Because of this, the assertion that programming a non-TSO machine is
    "harder" doesn't IMO stand up, at least in C programs, because the
    same data-race bugs can manifest themselves as either compiler
    optimizations or hardware reorderings. And a compiler can, at least in theory, can do things that are far weirder than any memory system
    does.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Branimir Maksimovic@21:1/5 to Chris M. Thomasson on Wed Nov 15 04:10:08 2023
    On 2023-11-13, Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/13/2023 11:29 AM, MitchAlsup wrote:
    EricP wrote:

    Kent Dickey wrote:
    In article <uirqj3$9q9q$1@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2023 4:20 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/12/2023 2:33 PM, Kent Dickey wrote:
    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain
    workloads.
    I know a lot of people believe that statement to be true.  In >>>>>>>> general, it
    is assumed to be true without proof.

    I believe that statement to be false.  Can you describe some of >>>>>>>> these
    workloads?
    Also, think about converting any sound lock-free algorithm's
    finely tuned memory barriers to _all_ sequential consistency...
    That would ruin performance right off the bat... Think about it.
    <
    Assuming you are willing to accept the wrong answer fast, rather
    than the right answer later. There are very few algorithms with
    this property.
    That does not make any sense to me. Think of a basic mutex. It
    basically requires an acquire membar for the lock and a release
    membar for the unlock. On SPARC that would be:

    acquire = MEMBAR #LoadStore | #LoadLoad

    release = MEMBAR #LoadStore | #StoreStore

    Okay, fine. However, if I made them sequentially consistent, it
    would require a damn #StoreLoad barrier for both acquire and
    release. This is not good and should be avoided when possible.

    Also, RCU prides itself with not having to use any memory barriers
    for its read side. If RCU was forced to use a seq cst, basically
    LOCKED RMW or MFENCE on Intel, it would completely ruin its
    performance.
    [...]
    < I suspect SUN lost significant performance by always running TSO and
    it still required barrier instructions.
    [...]

    Intel still requires an explicit membar for hazard pointers as-is. Sparc
    in TSO mode still requires a membar for this. Spard needs a #StoreLoad
    wrt the store followed by a load to another location relationship to
    hold. Intel needs a LOCK'ed atomic or MFENCE to handle this.
    I think that Apple M1 requires, too. I has problems wihout membar.


    --

    7-77-777, Evil Sinner! https://www.linkedin.com/in/branimir-maksimovic-6762bbaa/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Wed Nov 15 18:32:40 2023
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I have written a microbenchmark for measuring how memory dependencies
    affect the performance of various microarchitectures. You can find it
    along with a description and results on ><http://www.complang.tuwien.ac.at/anton/memdep/>.

    I have now added parameter combinations for detecting something that I currently call memory renaming: If the microarchitecture can overlap independent accesses to the same memory location (analogous to
    register renaming).

    Does anybody know if there is a commonly-used term for this feature?
    I would also be interested in a commonly-used term for bypassing the
    store buffer in store-to-load forwarding (as we see in Zen3 and Tiger
    Lake).

    Concerning memory renaming, it turns out that Intel added it in
    Nehalem, and AMD added it between K10 and Excavator (probably in the Bulldozer). ARM has not implemented this in any microarchitecture I
    have measured (up to Cortex-A76), and Apple has it in Firestorm (M1
    P-core), but not in Icestorm (M1 E-Core).

    One interesting benefit of memory renaming is that when you are bad at
    register allocation, and keep local variables in memory (or virtual
    machine registers, or virtual machine stack items), independent
    computations that use the same local variable still can execute in
    parallel. Say, stuff like

    a = b[1]; c = a+1;
    a = b[2]; d = a-1;

    The hardware can execute them in parallel, even if a is located in
    memory.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Wed Nov 15 19:05:42 2023
    Anton Ertl wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I have written a microbenchmark for measuring how memory dependencies >>affect the performance of various microarchitectures. You can find it >>along with a description and results on >><http://www.complang.tuwien.ac.at/anton/memdep/>.

    I have now added parameter combinations for detecting something that I currently call memory renaming: If the microarchitecture can overlap independent accesses to the same memory location (analogous to
    register renaming).

    Does anybody know if there is a commonly-used term for this feature?
    <
    I use the term Conditional Cache, I have heard others call it a Memory
    Reorder Buffer.
    <
    I would also be interested in a commonly-used term for bypassing the
    store buffer in store-to-load forwarding (as we see in Zen3 and Tiger
    Lake).
    <
    Store-to-Load-Forwarding is the only term I know here.
    <
    Concerning memory renaming, it turns out that Intel added it in
    Nehalem, and AMD added it between K10 and Excavator (probably in the Bulldozer). ARM has not implemented this in any microarchitecture I
    have measured (up to Cortex-A76), and Apple has it in Firestorm (M1
    P-core), but not in Icestorm (M1 E-Core).

    One interesting benefit of memory renaming is that when you are bad at register allocation,
    <
    Or bad at alias analysis...
    <
    < and keep local variables in memory (or virtual
    machine registers, or virtual machine stack items), independent
    computations that use the same local variable still can execute in
    parallel. Say, stuff like

    a = b[1]; c = a+1;
    a = b[2]; d = a-1;

    The hardware can execute them in parallel, even if a is located in
    memory.
    <
    And why not, they are independent containers.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kent Dickey@21:1/5 to aph@littlepinkcloud.invalid on Wed Nov 15 20:32:16 2023
    In article <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>,
    <aph@littlepinkcloud.invalid> wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:

    I don't think that's really true. The reorderings we see in currently-
    produced hardware are, more or less, a subset of the same reorderings
    that C compilers perform. Therefore, if you see a confusing hardware
    reordering in a multi-threaded C program it may well be (probably is!)
    a bug (according to the C standard) *even on a TSO machine*. The only
    common counter-example to this is for volatile accesses.

    Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*

    Maybe I wasn't clear enough. If you use std::atomic and
    std::memory_order_* in such a way that there are no data races, your >concurrent program will be fine on both TSO and relaxed memory
    ordering. If you try to fix data races with volatile instead of
    std::atomic and std::memory_order_*, that'll mostly fix things on a
    TSO machine, but not on a machine with relaxed memory ordering.

    What you are saying is:

    As long as you fully analyze your program, ensure all multithreaded accesses are only through atomic variables, and you label every access to an
    atomic variable properly (although my point is: exactly what should that
    be??), then there is no problem.

    If you assume you've already solved the problem, then you find the
    problem is solved! Magic!

    What I'm arguing is: the CPU should behave as if memory_order_seq_cst
    is set on all accesses with no special trickery. This acquire/release
    nonsense is all weakly ordered brain damage. The problem is on weakly
    ordered CPUs, performance definitely does matter in terms of getting this
    stuff right, but that's their problem. Being weakly ordered makes them
    slower when they have to execute barriers for correctness, but it's the barriers themselves that are the slow down, not ordering the requests
    properly.

    If the CPU takes ownership of ordering, then the only rule is: you just
    have to use atomic properly (even then, you can often get away with
    volatile for most producer/consumer cases), and these subtypes for all
    accesses don't matter for correctness or performance.

    It also would be nice if multithreaded programs could be written in C99,
    or pre-C++11. It's kinda surprising we've only been able to write threaded programs for about 10 years.

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Kent Dickey on Wed Nov 15 12:42:13 2023
    On 11/15/2023 12:32 PM, Kent Dickey wrote:
    In article <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>,
    <aph@littlepinkcloud.invalid> wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:

    I don't think that's really true. The reorderings we see in currently- >>>> produced hardware are, more or less, a subset of the same reorderings
    that C compilers perform. Therefore, if you see a confusing hardware
    reordering in a multi-threaded C program it may well be (probably is!) >>>> a bug (according to the C standard) *even on a TSO machine*. The only
    common counter-example to this is for volatile accesses.

    Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*

    Maybe I wasn't clear enough. If you use std::atomic and
    std::memory_order_* in such a way that there are no data races, your
    concurrent program will be fine on both TSO and relaxed memory
    ordering. If you try to fix data races with volatile instead of
    std::atomic and std::memory_order_*, that'll mostly fix things on a
    TSO machine, but not on a machine with relaxed memory ordering.

    What you are saying is:

    As long as you fully analyze your program, ensure all multithreaded accesses are only through atomic variables, and you label every access to an
    atomic variable properly (although my point is: exactly what should that be??), then there is no problem.

    If you assume you've already solved the problem, then you find the
    problem is solved! Magic!

    What I'm arguing is: the CPU should behave as if memory_order_seq_cst
    is set on all accesses with no special trickery. This acquire/release nonsense is all weakly ordered brain damage.

    Huh? Just because you have a problem with memory ordering, does not mean
    that all std::memory_order_* should default to seq_cst! You are being
    radically obtuse here. Is as if you have no idea about what you are
    writing about.

    [...]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Kent Dickey on Wed Nov 15 20:56:39 2023
    Kent Dickey wrote:

    In article <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>,
    <aph@littlepinkcloud.invalid> wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:

    I don't think that's really true. The reorderings we see in currently- >>>> produced hardware are, more or less, a subset of the same reorderings
    that C compilers perform. Therefore, if you see a confusing hardware
    reordering in a multi-threaded C program it may well be (probably is!) >>>> a bug (according to the C standard) *even on a TSO machine*. The only
    common counter-example to this is for volatile accesses.

    Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*

    Maybe I wasn't clear enough. If you use std::atomic and
    std::memory_order_* in such a way that there are no data races, your >>concurrent program will be fine on both TSO and relaxed memory
    ordering. If you try to fix data races with volatile instead of
    std::atomic and std::memory_order_*, that'll mostly fix things on a
    TSO machine, but not on a machine with relaxed memory ordering.

    What you are saying is:

    As long as you fully analyze your program, ensure all multithreaded accesses are only through atomic variables, and you label every access to an
    atomic variable properly (although my point is: exactly what should that be??), then there is no problem.

    If you assume you've already solved the problem, then you find the
    problem is solved! Magic!

    What I'm arguing is: the CPU should behave as if memory_order_seq_cst
    is set on all accesses with no special trickery.
    <
    Should appear as if.....not behave as if.
    <
    This acquire/release nonsense is all weakly ordered brain damage.
    <
    Agreed ...
    <
    The problem is on weakly ordered CPUs, performance definitely does matter in terms of getting this stuff right, but that's their problem.
    <
    Not necessarily !! If you have a weakly ordered machine and start doing something ATOMIC, the processor can switch to sequential consistency upon
    the detection of the ATOMIC event starting (LL for example) stay SC until
    the ATOMIC even is done, then revert back to weakly ordered as it pleases.
    The single threaded code gets is performance while the multithreaded code
    gets is SC (as Lamport demonstrated was necessary).
    <
    Being weakly ordered makes them slower when they have to execute barriers for correctness, but it's the barriers themselves that are the slow down, not ordering the requests properly.
    <
    But You see, doing it my way gets rid of the MemBar instructions but not
    their necessary effects. In addition, in my model, every access within
    an ATOMIC event is SC not just a MemBar at the front and end.
    <
    If the CPU takes ownership of ordering, then the only rule is: you just
    have to use atomic properly (even then, you can often get away with
    volatile for most producer/consumer cases), and these subtypes for all accesses don't matter for correctness or performance.
    <
    But a CPU is not in a position to determine Memory Order, the (multiplicity
    of) memory controllers do, the CPUs just have to figure out how to put up with the imposed order based on the kind of things the CPU is doing at that instant. <
    It also would be nice if multithreaded programs could be written in C99,
    or pre-C++11. It's kinda surprising we've only been able to write threaded programs for about 10 years.
    <
    I wrote multithreaded programs on an 8085.....
    <
    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to MitchAlsup on Wed Nov 15 13:00:06 2023
    On 11/15/2023 12:56 PM, MitchAlsup wrote:
    Kent Dickey wrote:

    In article <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>,
     <aph@littlepinkcloud.invalid> wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:

    I don't think that's really true. The reorderings we see in currently- >>>>> produced hardware are, more or less, a subset of the same reorderings >>>>> that C compilers perform. Therefore, if you see a confusing hardware >>>>> reordering in a multi-threaded C program it may well be (probably is!) >>>>> a bug (according to the C standard) *even on a TSO machine*. The only >>>>> common counter-example to this is for volatile accesses.

    Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*

    Maybe I wasn't clear enough. If you use std::atomic and
    std::memory_order_* in such a way that there are no data races, your
    concurrent program will be fine on both TSO and relaxed memory
    ordering. If you try to fix data races with volatile instead of
    std::atomic and std::memory_order_*, that'll mostly fix things on a
    TSO machine, but not on a machine with relaxed memory ordering.

    What you are saying is:

    As long as you fully analyze your program, ensure all multithreaded
    accesses
    are only through atomic variables, and you label every access to an
    atomic variable properly (although my point is: exactly what should that
    be??), then there is no problem.

    If you assume you've already solved the problem, then you find the
    problem is solved!  Magic!

    What I'm arguing is: the CPU should behave as if memory_order_seq_cst
    is set on all accesses with no special trickery.
    <
    Should appear as if.....not behave as if.
    <
                                                     This acquire/release
    nonsense is all weakly ordered brain damage.
    <
    Agreed ...
    <

    Why do you "seem" to think its brain damage? Knowing how to use them
    properly is a good thing.



                                                  The problem is on weakly
    ordered CPUs, performance definitely does matter in terms of getting this
    stuff right, but that's their problem.
    <
    Not necessarily !! If you have a weakly ordered machine and start doing something ATOMIC, the processor can switch to sequential consistency upon
    the detection of the ATOMIC event starting (LL for example) stay SC until
    the ATOMIC even is done, then revert back to weakly ordered as it pleases. The single threaded code gets is performance while the multithreaded code gets is SC (as Lamport demonstrated was necessary).
    <
                                            Being weakly ordered makes them
    slower when they have to execute barriers for correctness, but it's the
    barriers themselves that are the slow down, not ordering the requests
    properly.
    <
    But You see, doing it my way gets rid of the MemBar instructions but not their necessary effects. In addition, in my model, every access within
    an ATOMIC event is SC not just a MemBar at the front and end.
    <
    If the CPU takes ownership of ordering, then the only rule is: you just
    have to use atomic properly (even then, you can often get away with
    volatile for most producer/consumer cases), and these subtypes for all
    accesses don't matter for correctness or performance.
    <
    But a CPU is not in a position to determine Memory Order, the
    (multiplicity of) memory controllers do, the CPUs just have to figure
    out how to put up with
    the imposed order based on the kind of things the CPU is doing at that instant.
    <
    It also would be nice if multithreaded programs could be written in C99,
    or pre-C++11.  It's kinda surprising we've only been able to write
    threaded
    programs for about 10 years.
    <
    I wrote multithreaded programs on an 8085.....
    <
    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kent Dickey@21:1/5 to Chris M. Thomasson on Wed Nov 15 21:09:30 2023
    In article <uiu4t5$t4c2$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2023 9:54 PM, Kent Dickey wrote:
    In article <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>,
    MitchAlsup <mitchalsup@aol.com> wrote:
    Kent Dickey wrote:

    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain workloads. >>>
    I know a lot of people believe that statement to be true. In general, it >>>> is assumed to be true without proof.
    <
    In its most general case, relaxed order only provides a performance advantage
    when the code is single threaded.

    I believe a Relaxed Memory model provides a small performance improvement
    ONLY to simple in-order CPUs in an MP system (if you're a single CPU,
    there's nothing to order).

    Relazed Memory ordering provides approximately zero performance improvement >> to an OoO CPU, and in fact, might actually lower performance (depends on
    how barriers are done--if done poorly, it could be a big negative).

    Yes, the system designers of the world have said: let's slow down our
    fastest most expensive most profitable CPUs, so we can speed up our cheapest >> lowest profit CPUs a few percent, and push a ton of work onto software
    developers.

    It's crazy.

    I believe that statement to be false. Can you describe some of these
    workloads?
    <
    Relaxed memory order fails spectacularly when multiple threads are accessing
    data.

    Probably need to clarify with "accessing modified data".

    Kent

    Huh? So, C++ is crazy for allowing for std::memory_order_relaxed to even >exist? I must be misunderstanding you point here. Sorry if I am. ;^o

    You have internalized weakly ordered memory, and you're having trouble
    seeing beyond it.

    CPUs with weakly ordered memory are the ones that need all those flags.
    Yes, you need the flags if you want to use those CPUs. I'm pointing out:
    we could all just require better memory ordering and get rid of all this
    cruft. Give the flag, don't give the flag, the program is still correct
    and works properly.

    It's like FP denorms--it's generally been decided the hardware cost
    to implement it is small, so hardware needs to support it at full speed.
    No need to write code in a careful way to avoid denorms, to use funky CPU- specific calls to turn on flush-to-0, etc., it just works, we move on to
    other topics. But we still have flush-to-0 calls available--but you don't
    need to bother to use them. In my opinion, memory ordering is much more complex for programmers to handle. I maintain it's actually so
    complex most people cannot get it right in software for non-trivial interactions. I've found many hardware designers have a very hard time reasoning about this as well when I report bugs (since the rules are so
    complex and poorly described). There are over 100 pages describing memory ordering in the Arm Architectureal Reference Manual, and it is very
    complex (Dependency through registers and memory; Basic Dependency;
    Address Dependency; Data Dependency; Control Dependency; Pick Basic
    dependency; Pick Address Dependency; Pick Data Dependency; Pick
    Control Dependency, Pick Dependency...and this is just from the definition
    of terms). It's all very abstract and difficult to follow. I'll be
    honest: I do not understand all of these rules, and I don't care to.
    I know how to implement a CPU, so I know what they've done, and that's
    much simpler to understand. But writing a threaded application is much
    more complex than it should be for software.

    The cost to do TSO is some out-of-order tracking structures need to get
    a little bigger, and some instructions have to stay in queues longer
    (which is why they may need to get bigger), and allow re-issuing loads
    which now have stale data. The difference between TSO and Sequential Consistency is to just disallow loads seeing stores queued before they
    write to the data cache (well, you can speculatively let loads happen,
    but you need to be able to walk it back, which is not difficult). This
    is why I say the performance cost is low--normal code missing caches and
    not being pestered by other CPUs can run at the same speed. But when
    other CPUs begin pestering us, the interference can all be worked out as efficiently as possible using hardware, and barriers just do not
    compete.

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Kent Dickey on Wed Nov 15 13:25:30 2023
    On 11/15/2023 1:09 PM, Kent Dickey wrote:
    In article <uiu4t5$t4c2$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2023 9:54 PM, Kent Dickey wrote:
    In article <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>,
    MitchAlsup <mitchalsup@aol.com> wrote:
    Kent Dickey wrote:

    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain workloads. >>>>
    I know a lot of people believe that statement to be true. In general, it >>>>> is assumed to be true without proof.
    <
    In its most general case, relaxed order only provides a performance advantage
    when the code is single threaded.

    I believe a Relaxed Memory model provides a small performance improvement >>> ONLY to simple in-order CPUs in an MP system (if you're a single CPU,
    there's nothing to order).

    Relazed Memory ordering provides approximately zero performance improvement >>> to an OoO CPU, and in fact, might actually lower performance (depends on >>> how barriers are done--if done poorly, it could be a big negative).

    Yes, the system designers of the world have said: let's slow down our
    fastest most expensive most profitable CPUs, so we can speed up our cheapest
    lowest profit CPUs a few percent, and push a ton of work onto software
    developers.

    It's crazy.

    I believe that statement to be false. Can you describe some of these >>>>> workloads?
    <
    Relaxed memory order fails spectacularly when multiple threads are accessing
    data.

    Probably need to clarify with "accessing modified data".

    Kent

    Huh? So, C++ is crazy for allowing for std::memory_order_relaxed to even
    exist? I must be misunderstanding you point here. Sorry if I am. ;^o

    You have internalized weakly ordered memory, and you're having trouble
    seeing beyond it.

    Really? Don't project yourself on me. Altering all of the memory
    barriers of a finely tuned lock-free algorithm to seq_cst is VERY bad.



    CPUs with weakly ordered memory are the ones that need all those flags.
    Yes, you need the flags if you want to use those CPUs. I'm pointing out:
    we could all just require better memory ordering and get rid of all this cruft. Give the flag, don't give the flag, the program is still correct
    and works properly.

    Huh? Just cruft? wow. Just because it seems hard for you does not mean
    we should eliminate it. Believe it or not there are people out there
    that know how to use memory barriers. I suppose you would use seq_cst to
    load each node of a lock-free stack iteration in a RCU read-side region.
    This is terrible! Realy bad, bad, BAD! Afaicvt, it kind a, sort a, seems
    like you do not have all that much experience with them. Humm...



    It's like FP denorms--it's generally been decided the hardware cost
    to implement it is small, so hardware needs to support it at full speed.
    No need to write code in a careful way to avoid denorms, to use funky CPU- specific calls to turn on flush-to-0, etc., it just works, we move on to other topics. But we still have flush-to-0 calls available--but you don't need to bother to use them. In my opinion, memory ordering is much more complex for programmers to handle. I maintain it's actually so
    complex most people cannot get it right in software for non-trivial interactions. I've found many hardware designers have a very hard time reasoning about this as well when I report bugs (since the rules are so complex and poorly described). There are over 100 pages describing memory ordering in the Arm Architectureal Reference Manual, and it is very
    complex (Dependency through registers and memory; Basic Dependency;
    Address Dependency; Data Dependency; Control Dependency; Pick Basic dependency; Pick Address Dependency; Pick Data Dependency; Pick
    Control Dependency, Pick Dependency...and this is just from the definition
    of terms). It's all very abstract and difficult to follow. I'll be
    honest: I do not understand all of these rules, and I don't care to.
    I know how to implement a CPU, so I know what they've done, and that's
    much simpler to understand. But writing a threaded application is much
    more complex than it should be for software.

    The cost to do TSO is some out-of-order tracking structures need to get
    a little bigger, and some instructions have to stay in queues longer
    (which is why they may need to get bigger), and allow re-issuing loads
    which now have stale data. The difference between TSO and Sequential Consistency is to just disallow loads seeing stores queued before they
    write to the data cache (well, you can speculatively let loads happen,
    but you need to be able to walk it back, which is not difficult). This
    is why I say the performance cost is low--normal code missing caches and
    not being pestered by other CPUs can run at the same speed. But when
    other CPUs begin pestering us, the interference can all be worked out as efficiently as possible using hardware, and barriers just do not
    compete.

    Having access to fine grain memory barriers is a very good thing. Of
    course we can use C++ right now and make everything seq_cst, but that is moronic. Why would you want to use seq_cst everywhere when you do not
    have to? There are rather massive performance implications.

    Are you thinking about a magic arch that we cannot use right now?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Chris M. Thomasson on Wed Nov 15 13:27:59 2023
    On 11/15/2023 1:25 PM, Chris M. Thomasson wrote:
    On 11/15/2023 1:09 PM, Kent Dickey wrote:
    In article <uiu4t5$t4c2$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2023 9:54 PM, Kent Dickey wrote:
    In article <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>,
    MitchAlsup <mitchalsup@aol.com> wrote:
    Kent Dickey wrote:

    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain
    workloads.

    I know a lot of people believe that statement to be true.  In
    general, it
    is assumed to be true without proof.
    <
    In its most general case, relaxed order only provides a performance
    advantage
    when the code is single threaded.

    I believe a Relaxed Memory model provides a small performance
    improvement
    ONLY to simple in-order CPUs in an MP system (if you're a single CPU,
    there's nothing to order).

    Relazed Memory ordering provides approximately zero performance
    improvement
    to an OoO CPU, and in fact, might actually lower performance
    (depends on
    how barriers are done--if done poorly, it could be a big negative).

    Yes, the system designers of the world have said: let's slow down our
    fastest most expensive most profitable CPUs, so we can speed up our
    cheapest
    lowest profit CPUs a few percent, and push a ton of work onto software >>>> developers.

    It's crazy.

    I believe that statement to be false.  Can you describe some of these >>>>>> workloads?
    <
    Relaxed memory order fails spectacularly when multiple threads are
    accessing
    data.

    Probably need to clarify with "accessing modified data".

    Kent

    Huh? So, C++ is crazy for allowing for std::memory_order_relaxed to even >>> exist? I must be misunderstanding you point here. Sorry if I am. ;^o

    You have internalized weakly ordered memory, and you're having trouble
    seeing beyond it.

    Really? Don't project yourself on me. Altering all of the memory
    barriers of a finely tuned lock-free algorithm to seq_cst is VERY bad.



    CPUs with weakly ordered memory are the ones that need all those flags.
    Yes, you need the flags if you want to use those CPUs.  I'm pointing out: >> we could all just require better memory ordering and get rid of all this
    cruft.  Give the flag, don't give the flag, the program is still correct
    and works properly.

    Huh? Just cruft? wow. Just because it seems hard for you does not mean
    we should eliminate it. Believe it or not there are people out there
    that know how to use memory barriers. I suppose you would use seq_cst to
    load each node of a lock-free stack iteration in a RCU read-side region.
    This is terrible! Realy bad, bad, BAD! Afaicvt, it kind a, sort a, seems
    like you do not have all that much experience with them. Humm...



    It's like FP denorms--it's generally been decided the hardware cost
    to implement it is small, so hardware needs to support it at full speed.
    No need to write code in a careful way to avoid denorms, to use funky
    CPU-
    specific calls to turn on flush-to-0, etc., it just works, we move on to
    other topics.  But we still have flush-to-0 calls available--but you
    don't
    need to bother to use them.  In my opinion, memory ordering is much more
    complex for programmers to handle.  I maintain it's actually so
    complex most people cannot get it right in software for non-trivial
    interactions.  I've found many hardware designers have a very hard time
    reasoning about this as well when I report bugs (since the rules are so
    complex and poorly described).  There are over 100 pages describing
    memory
    ordering in the Arm Architectureal Reference Manual, and it is very
    complex (Dependency through registers and memory; Basic Dependency;
    Address Dependency; Data Dependency; Control Dependency; Pick Basic
    dependency; Pick Address Dependency; Pick Data Dependency; Pick
    Control Dependency, Pick Dependency...and this is just from the
    definition
    of terms).  It's all very abstract and difficult to follow.  I'll be
    honest: I do not understand all of these rules, and I don't care to.
    I know how to implement a CPU, so I know what they've done, and that's
    much simpler to understand.  But writing a threaded application is much
    more complex than it should be for software.

    The cost to do TSO is some out-of-order tracking structures need to get
    a little bigger, and some instructions have to stay in queues longer
    (which is why they may need to get bigger), and allow re-issuing loads
    which now have stale data.  The difference between TSO and Sequential
    Consistency is to just disallow loads seeing stores queued before they
    write to the data cache (well, you can speculatively let loads happen,
    but you need to be able to walk it back, which is not difficult).  This
    is why I say the performance cost is low--normal code missing caches and
    not being pestered by other CPUs can run at the same speed.  But when
    other CPUs begin pestering us, the interference can all be worked out as
    efficiently as possible using hardware, and barriers just do not
    compete.

    Having access to fine grain memory barriers is a very good thing. Of
    course we can use C++ right now and make everything seq_cst, but that is moronic. Why would you want to use seq_cst everywhere when you do not
    have to? There are rather massive performance implications.

    Are you thinking about a magic arch that we cannot use right now?

    https://youtu.be/DZJPqTTt7MA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Chris M. Thomasson on Wed Nov 15 13:35:41 2023
    On 11/15/2023 1:00 PM, Chris M. Thomasson wrote:
    On 11/15/2023 12:56 PM, MitchAlsup wrote:
    Kent Dickey wrote:

    In article <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>,
     <aph@littlepinkcloud.invalid> wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:

    I don't think that's really true. The reorderings we see in
    currently-
    produced hardware are, more or less, a subset of the same reorderings >>>>>> that C compilers perform. Therefore, if you see a confusing hardware >>>>>> reordering in a multi-threaded C program it may well be (probably
    is!)
    a bug (according to the C standard) *even on a TSO machine*. The only >>>>>> common counter-example to this is for volatile accesses.

    Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*

    Maybe I wasn't clear enough. If you use std::atomic and
    std::memory_order_* in such a way that there are no data races, your
    concurrent program will be fine on both TSO and relaxed memory
    ordering. If you try to fix data races with volatile instead of
    std::atomic and std::memory_order_*, that'll mostly fix things on a
    TSO machine, but not on a machine with relaxed memory ordering.

    What you are saying is:

    As long as you fully analyze your program, ensure all multithreaded
    accesses
    are only through atomic variables, and you label every access to an
    atomic variable properly (although my point is: exactly what should that >>> be??), then there is no problem.

    If you assume you've already solved the problem, then you find the
    problem is solved!  Magic!

    What I'm arguing is: the CPU should behave as if memory_order_seq_cst
    is set on all accesses with no special trickery.
    <
    Should appear as if.....not behave as if.
    <
                                                     This acquire/release
    nonsense is all weakly ordered brain damage.
    <
    Agreed ...
    <

    Why do you "seem" to think its brain damage? Knowing how to use them
    properly is a good thing.



                                                  The problem is on weakly
    ordered CPUs, performance definitely does matter in terms of getting
    this
    stuff right, but that's their problem.
    <
    Not necessarily !! If you have a weakly ordered machine and start doing
    something ATOMIC, the processor can switch to sequential consistency upon
    the detection of the ATOMIC event starting (LL for example) stay SC until
    the ATOMIC even is done, then revert back to weakly ordered as it
    pleases.
    The single threaded code gets is performance while the multithreaded code
    gets is SC (as Lamport demonstrated was necessary).
    <
                                            Being weakly ordered makes them
    slower when they have to execute barriers for correctness, but it's the
    barriers themselves that are the slow down, not ordering the requests
    properly.

    Huh? Avoiding as many memory barriers as possible (aka... ideally
    relaxed) is key to making sync algorithms faster! Or using an acquire
    instead of seq_cst is great. Only use seq_cst when you absolutely have to.


    <
    But You see, doing it my way gets rid of the MemBar instructions but not
    their necessary effects. In addition, in my model, every access within
    an ATOMIC event is SC not just a MemBar at the front and end.
    <
    If the CPU takes ownership of ordering, then the only rule is: you just
    have to use atomic properly (even then, you can often get away with
    volatile for most producer/consumer cases), and these subtypes for all
    accesses don't matter for correctness or performance.
    <
    But a CPU is not in a position to determine Memory Order, the
    (multiplicity of) memory controllers do, the CPUs just have to figure
    out how to put up with
    the imposed order based on the kind of things the CPU is doing at that
    instant.
    <
    It also would be nice if multithreaded programs could be written in C99, >>> or pre-C++11.  It's kinda surprising we've only been able to write
    threaded
    programs for about 10 years.
    <
    I wrote multithreaded programs on an 8085.....
    <
    Kent


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Chris M. Thomasson on Wed Nov 15 13:41:06 2023
    On 11/15/2023 1:35 PM, Chris M. Thomasson wrote:
    On 11/15/2023 1:00 PM, Chris M. Thomasson wrote:
    On 11/15/2023 12:56 PM, MitchAlsup wrote:
    Kent Dickey wrote:

    In article <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>,
     <aph@littlepinkcloud.invalid> wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:

    I don't think that's really true. The reorderings we see in
    currently-
    produced hardware are, more or less, a subset of the same
    reorderings
    that C compilers perform. Therefore, if you see a confusing hardware >>>>>>> reordering in a multi-threaded C program it may well be (probably >>>>>>> is!)
    a bug (according to the C standard) *even on a TSO machine*. The >>>>>>> only
    common counter-example to this is for volatile accesses.

    Be sure to keep C++ std::atomic in mind... Also, std::memory_order_* >>>>>
    Maybe I wasn't clear enough. If you use std::atomic and
    std::memory_order_* in such a way that there are no data races, your >>>>> concurrent program will be fine on both TSO and relaxed memory
    ordering. If you try to fix data races with volatile instead of
    std::atomic and std::memory_order_*, that'll mostly fix things on a
    TSO machine, but not on a machine with relaxed memory ordering.

    What you are saying is:

    As long as you fully analyze your program, ensure all multithreaded
    accesses
    are only through atomic variables, and you label every access to an
    atomic variable properly (although my point is: exactly what should
    that
    be??), then there is no problem.

    If you assume you've already solved the problem, then you find the
    problem is solved!  Magic!

    What I'm arguing is: the CPU should behave as if memory_order_seq_cst
    is set on all accesses with no special trickery.
    <
    Should appear as if.....not behave as if.
    <
                                                     This acquire/release
    nonsense is all weakly ordered brain damage.
    <
    Agreed ...
    <

    Why do you "seem" to think its brain damage? Knowing how to use them
    properly is a good thing.



                                                  The problem is on weakly
    ordered CPUs, performance definitely does matter in terms of getting
    this
    stuff right, but that's their problem.
    <
    Not necessarily !! If you have a weakly ordered machine and start doing
    something ATOMIC, the processor can switch to sequential consistency
    upon
    the detection of the ATOMIC event starting (LL for example) stay SC
    until
    the ATOMIC even is done, then revert back to weakly ordered as it
    pleases.
    The single threaded code gets is performance while the multithreaded
    code
    gets is SC (as Lamport demonstrated was necessary).
    <
                                            Being weakly ordered makes them
    slower when they have to execute barriers for correctness, but it's the >>>> barriers themselves that are the slow down, not ordering the requests
    properly.

    Huh? Avoiding as many memory barriers as possible (aka... ideally
    relaxed) is key to making sync algorithms faster! Or using an acquire
    instead of seq_cst is great. Only use seq_cst when you absolutely have to.


    <
    But You see, doing it my way gets rid of the MemBar instructions but not >>> their necessary effects. In addition, in my model, every access within
    an ATOMIC event is SC not just a MemBar at the front and end.
    <
    If the CPU takes ownership of ordering, then the only rule is: you just >>>> have to use atomic properly (even then, you can often get away with
    volatile for most producer/consumer cases), and these subtypes for all >>>> accesses don't matter for correctness or performance.
    <
    But a CPU is not in a position to determine Memory Order, the
    (multiplicity of) memory controllers do, the CPUs just have to figure
    out how to put up with
    the imposed order based on the kind of things the CPU is doing at
    that instant.
    <
    It also would be nice if multithreaded programs could be written in
    C99,
    or pre-C++11.  It's kinda surprising we've only been able to write
    threaded
    programs for about 10 years.
    <
    I wrote multithreaded programs on an 8085.....
    <
    Kent



    pseudo-code:

    std::atomic<node*> m_node = create_stack();



    // Read-Copy Update (RCU)
    // reader thread iteration...

    node* current = n_node.load(std::memory_order_relaxed);
    while (current)
    {
    node* next = current->m_next.load(std::memory_order_relaxed);
    compute(current);
    current = next;
    }


    Now, you are trying to tell me that I should use seq_cst instead of
    those relaxed memory barriers for the iteration? That would DESTROY performance, big time... Really BAD!


    Listen to all of this:

    https://youtu.be/DZJPqTTt7MA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Chris M. Thomasson on Wed Nov 15 13:56:31 2023
    On 11/15/2023 1:41 PM, Chris M. Thomasson wrote:
    On 11/15/2023 1:35 PM, Chris M. Thomasson wrote:
    On 11/15/2023 1:00 PM, Chris M. Thomasson wrote:
    On 11/15/2023 12:56 PM, MitchAlsup wrote:
    Kent Dickey wrote:

    In article <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>,
     <aph@littlepinkcloud.invalid> wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:

    I don't think that's really true. The reorderings we see in
    currently-
    produced hardware are, more or less, a subset of the same
    reorderings
    that C compilers perform. Therefore, if you see a confusing
    hardware
    reordering in a multi-threaded C program it may well be
    (probably is!)
    a bug (according to the C standard) *even on a TSO machine*. The >>>>>>>> only
    common counter-example to this is for volatile accesses.

    Be sure to keep C++ std::atomic in mind... Also, std::memory_order_* >>>>>>
    Maybe I wasn't clear enough. If you use std::atomic and
    std::memory_order_* in such a way that there are no data races, your >>>>>> concurrent program will be fine on both TSO and relaxed memory
    ordering. If you try to fix data races with volatile instead of
    std::atomic and std::memory_order_*, that'll mostly fix things on a >>>>>> TSO machine, but not on a machine with relaxed memory ordering.

    What you are saying is:

    As long as you fully analyze your program, ensure all multithreaded
    accesses
    are only through atomic variables, and you label every access to an
    atomic variable properly (although my point is: exactly what should
    that
    be??), then there is no problem.

    If you assume you've already solved the problem, then you find the
    problem is solved!  Magic!

    What I'm arguing is: the CPU should behave as if memory_order_seq_cst >>>>> is set on all accesses with no special trickery.
    <
    Should appear as if.....not behave as if.
    <
                                                     This acquire/release
    nonsense is all weakly ordered brain damage.
    <
    Agreed ...
    <

    Why do you "seem" to think its brain damage? Knowing how to use them
    properly is a good thing.



                                                  The problem is on weakly
    ordered CPUs, performance definitely does matter in terms of
    getting this
    stuff right, but that's their problem.
    <
    Not necessarily !! If you have a weakly ordered machine and start doing >>>> something ATOMIC, the processor can switch to sequential consistency
    upon
    the detection of the ATOMIC event starting (LL for example) stay SC
    until
    the ATOMIC even is done, then revert back to weakly ordered as it
    pleases.
    The single threaded code gets is performance while the multithreaded
    code
    gets is SC (as Lamport demonstrated was necessary).
    <
                                            Being weakly ordered makes
    them
    slower when they have to execute barriers for correctness, but it's
    the
    barriers themselves that are the slow down, not ordering the requests >>>>> properly.

    Huh? Avoiding as many memory barriers as possible (aka... ideally
    relaxed) is key to making sync algorithms faster! Or using an acquire
    instead of seq_cst is great. Only use seq_cst when you absolutely have
    to.


    <
    But You see, doing it my way gets rid of the MemBar instructions but
    not
    their necessary effects. In addition, in my model, every access within >>>> an ATOMIC event is SC not just a MemBar at the front and end.
    <
    If the CPU takes ownership of ordering, then the only rule is: you
    just
    have to use atomic properly (even then, you can often get away with
    volatile for most producer/consumer cases), and these subtypes for all >>>>> accesses don't matter for correctness or performance.
    <
    But a CPU is not in a position to determine Memory Order, the
    (multiplicity of) memory controllers do, the CPUs just have to
    figure out how to put up with
    the imposed order based on the kind of things the CPU is doing at
    that instant.
    <
    It also would be nice if multithreaded programs could be written in
    C99,
    or pre-C++11.  It's kinda surprising we've only been able to write
    threaded
    programs for about 10 years.
    <
    I wrote multithreaded programs on an 8085.....
    <
    Kent



    pseudo-code:

    std::atomic<node*> m_node = create_stack();



    // Read-Copy Update (RCU)
    // reader thread iteration...

    node* current = n_node.load(std::memory_order_relaxed);
    while (current)
    {
       node* next = current->m_next.load(std::memory_order_relaxed);
       compute(current);
       current = next;
    }


    Now, you are trying to tell me that I should use seq_cst instead of
    those relaxed memory barriers for the iteration? That would DESTROY performance, big time... Really BAD!

    Heck, if I knew I was on a DEC Alpha and depending on the writer side of
    the algorithm I was using, std::memory_order_consume would be in order.



    Listen to all of this:

    https://youtu.be/DZJPqTTt7MA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Nov 15 17:40:15 2023
    Are you thinking about a magic arch that we cannot use right now?

    Yes, he is, obviously.
    So when you say "it's bad", please tell us why.

    We know it would run slow on existing CPUs, that's not the question.
    The question is: why would it be impossible or very hard to
    make a CPU that could execute such code efficiently.

    I suspect there can be a very valid reasons, maybe for the same kinds of reasons why some systems allow nested transactions (e.g. when you have
    a transaction with two calls to `gensym`: it doesn't matter whether the
    two calls really return consecutive symbols (as would be guaranteed if
    the code were truly run atomically), all that matters is that those
    symbols are unique).

    So maybe with sequential consistency, there could be some forms of
    parallelism which we'd completely disallow, whereas a weaker form of consistency would allow it. I'm having a hard time imagining what it
    could be, tho.

    BTW, IIRC, SGI's MIPS was defined to offer sequential consistency on
    their big supercomputers, no?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Stefan Monnier on Wed Nov 15 15:01:42 2023
    On 11/15/2023 2:40 PM, Stefan Monnier wrote:
    Are you thinking about a magic arch that we cannot use right now?

    Yes, he is, obviously.
    So when you say "it's bad", please tell us why.

    Oh, shit. This is the main point of my confusion. When I say its bad, I
    am referring to using std::memory_order_seq_cst all over the place on _existing_ architectures.



    We know it would run slow on existing CPUs, that's not the question.
    The question is: why would it be impossible or very hard to
    make a CPU that could execute such code efficiently.

    I suspect there can be a very valid reasons, maybe for the same kinds of reasons why some systems allow nested transactions (e.g. when you have
    a transaction with two calls to `gensym`: it doesn't matter whether the
    two calls really return consecutive symbols (as would be guaranteed if
    the code were truly run atomically), all that matters is that those
    symbols are unique).

    So maybe with sequential consistency, there could be some forms of parallelism which we'd completely disallow, whereas a weaker form of consistency would allow it. I'm having a hard time imagining what it
    could be, tho.

    BTW, IIRC, SGI's MIPS was defined to offer sequential consistency on
    their big supercomputers, no?

    Offering sequential consistency and mandating it are different things.
    We can all get sequential consistency just by using
    std::memory_order_seq_cst all of the time. If an arch can be created
    that is 100% sequential consistency and beat out existing finely tuned algorithms, like RCU based lock-free algorithms, then, I would love to
    see it.

    As of now, wrt are current arch's, say, iterating a lock-free stack in a
    RCU read side region using seq_cst is horrible.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Chris M. Thomasson on Wed Nov 15 15:15:10 2023
    On 11/15/2023 3:01 PM, Chris M. Thomasson wrote:
    On 11/15/2023 2:40 PM, Stefan Monnier wrote:
    Are you thinking about a magic arch that we cannot use right now?

    Yes, he is, obviously.
    So when you say "it's bad", please tell us why.

    Oh, shit. This is the main point of my confusion. When I say its bad, I
    am referring to using std::memory_order_seq_cst all over the place on _existing_ architectures.



    We know it would run slow on existing CPUs, that's not the question.
    The question is: why would it be impossible or very hard to
    make a CPU that could execute such code efficiently.

    I suspect there can be a very valid reasons, maybe for the same kinds of
    reasons why some systems allow nested transactions (e.g. when you have
    a transaction with two calls to `gensym`: it doesn't matter whether the
    two calls really return consecutive symbols (as would be guaranteed if
    the code were truly run atomically), all that matters is that those
    symbols are unique).

    So maybe with sequential consistency, there could be some forms of
    parallelism which we'd completely disallow, whereas a weaker form of
    consistency would allow it.  I'm having a hard time imagining what it
    could be, tho.

    BTW, IIRC, SGI's MIPS was defined to offer sequential consistency on
    their big supercomputers, no?

    Offering sequential consistency and mandating it are different things.
    We can all get sequential consistency just by using
    std::memory_order_seq_cst all of the time. If an arch can be created
    that is 100% sequential consistency and beat out existing finely tuned algorithms, like RCU based lock-free algorithms, then, I would love to
    see it.

    As of now, wrt are current arch's, say, iterating a lock-free stack in a
    RCU read side region using seq_cst is horrible.

    Just because programming sync algorithms on a weakly ordered arch can be difficult for some people does not mean we have to get rid of it
    altogether...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Wed Nov 15 23:32:12 2023
    Chris M. Thomasson wrote:

    pseudo-code:

    std::atomic<node*> m_node = create_stack();



    // Read-Copy Update (RCU)
    // reader thread iteration...

    node* current = n_node.load(std::memory_order_relaxed);
    while (current)
    {
    node* next = current->m_next.load(std::memory_order_relaxed);
    compute(current);
    current = next;
    }


    Now, you are trying to tell me that I should use seq_cst instead of
    those relaxed memory barriers for the iteration? That would DESTROY performance, big time... Really BAD!
    <
    The scan of the concurrent data structure's linked list does not have
    to be atomic, or even ordered. It is only once you have identified an
    element you want exclusive access to that sequential consistency is
    needed.


    Listen to all of this:

    https://youtu.be/DZJPqTTt7MA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Nov 15 18:39:28 2023
    Just because programming sync algorithms on a weakly ordered arch can be difficult for some people does not mean we have to get rid of it altogether...

    No, but it all depends on the hardware cost.
    SGI's big multiprocessors (starting with their R10K processor) obeyed sequential consistency and AFAIK that did not prevent them from providing top-notch performance.

    So if it's not terribly costly, it may end up being *faster* because the
    memory barriers become noops (not to mention the fact that performance engineers can spend time on other things).

    I suspect Intel and others did look into it at some point and ended up
    deciding it's not actually faster, but I agree with Kent that the
    complexity cost for programmers might not really be warranted and that
    maybe we'd all be better off if CPU manufacturers followed SGI's lead.

    According to https://dl.acm.org/doi/abs/10.1145/2366231.2337220
    the cost of supporting SC can be around 6% using tricks known in 2012.
    I'd be surprised if it can't be brought further down.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to MitchAlsup on Wed Nov 15 15:54:00 2023
    On 11/15/2023 3:32 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    pseudo-code:

    std::atomic<node*> m_node = create_stack();



    // Read-Copy Update (RCU)
    // reader thread iteration...

    node* current = n_node.load(std::memory_order_relaxed);
    while (current)
    {
        node* next = current->m_next.load(std::memory_order_relaxed);
        compute(current);
        current = next;
    }


    Now, you are trying to tell me that I should use seq_cst instead of
    those relaxed memory barriers for the iteration? That would DESTROY
    performance, big time... Really BAD!
    <
    The scan of the concurrent data structure's linked list does not have
    to be atomic, or even ordered.

    It has to use atomic loads and stores. However, RCU makes it so we do
    not need to use any memory barriers (dec alpha aside for a moment) while iterating it.


    It is only once you have identified an
    element you want exclusive access to that sequential consistency is
    needed.

    Even a mutex does not need sequential consistency.

    Dekker aside for a moment because it depends on a store followed by a
    load to another location. TSO cannot even handle this without a membar.





    Listen to all of this:

    https://youtu.be/DZJPqTTt7MA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Stefan Monnier on Wed Nov 15 16:03:31 2023
    On 11/15/2023 3:39 PM, Stefan Monnier wrote:
    Just because programming sync algorithms on a weakly ordered arch can be
    difficult for some people does not mean we have to get rid of it
    altogether...

    No, but it all depends on the hardware cost.
    SGI's big multiprocessors (starting with their R10K processor) obeyed sequential consistency and AFAIK that did not prevent them from providing top-notch performance.

    So if it's not terribly costly, it may end up being *faster* because the memory barriers become noops (not to mention the fact that performance engineers can spend time on other things).

    I suspect Intel and others did look into it at some point and ended up deciding it's not actually faster, but I agree with Kent that the
    complexity cost for programmers might not really be warranted and that
    maybe we'd all be better off if CPU manufacturers followed SGI's lead.

    According to https://dl.acm.org/doi/abs/10.1145/2366231.2337220
    the cost of supporting SC can be around 6% using tricks known in 2012.
    I'd be surprised if it can't be brought further down.

    Actually, for some reason this line of thinking reminds me of the
    following funny scene in a movie called Dirty Rotten Scoundrels. Where
    they had to put corks on the forks of Ruprecht (Steve Martin), to
    prevent him from hurting himself. So, basically, Ruprecht would be the programmer and the Micheal Caine character would be the arch designer
    thinking that relaxed models are too complex, and all programmers are
    mainly morons:

    https://youtu.be/SKDX-qJaJ08

    https://youtu.be/SKDX-qJaJ08?t=33

    Afaict, std::memory_order_consume was added to support RCU compatible algorithms.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Chris M. Thomasson on Wed Nov 15 15:56:50 2023
    On 11/15/2023 3:54 PM, Chris M. Thomasson wrote:
    On 11/15/2023 3:32 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    pseudo-code:

    std::atomic<node*> m_node = create_stack();



    // Read-Copy Update (RCU)
    // reader thread iteration...

    node* current = n_node.load(std::memory_order_relaxed);
    while (current)
    {
        node* next = current->m_next.load(std::memory_order_relaxed);
        compute(current);
        current = next;
    }


    Now, you are trying to tell me that I should use seq_cst instead of
    those relaxed memory barriers for the iteration? That would DESTROY
    performance, big time... Really BAD!
    <
    The scan of the concurrent data structure's linked list does not have
    to be atomic, or even ordered.

    It has to use atomic loads and stores. However, RCU makes it so we do
    not need to use any memory barriers (dec alpha aside for a moment) while iterating it.


    It is only once you have identified an
    element you want exclusive access to that sequential consistency is
    needed.

    Even a mutex does not need sequential consistency.

    Dekker aside for a moment because it depends on a store followed by a
    load to another location. TSO cannot even handle this without a membar.





    Listen to all of this:

    https://youtu.be/DZJPqTTt7MA


    Afaict, std::memory_order_consume was added to support RCU compatible algorithms.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Nov 15 19:37:01 2023
    Actually, for some reason this line of thinking reminds me of the following
    funny scene in a movie called Dirty Rotten Scoundrels. Where they had to
    put corks on the forks of Ruprecht (Steve Martin), to prevent him from
    hurting himself. So, basically, Ruprecht would be the programmer and the
    Micheal Caine character would be the arch designer thinking that relaxed
    models are too complex, and all programmers are mainly morons:

    Of course, cognitive dissonance would bite those people who have invested efforts into learning about all the intricacies of non-sequential
    memory models.

    But if we can make SC's efficiency sufficiently close to that of TSO,
    the benefit could be significant for all those people who have not
    invested such efforts.

    The evolution of computer programming is littered with steps that reduce performance in exchange for a cleaner programming model.

    You don't need to be a moron to be baffled by the complexity of relaxed
    memory models.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Kent Dickey on Wed Nov 15 18:09:28 2023
    On 11/15/2023 5:41 PM, Kent Dickey wrote:
    In article <jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org>,
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    Are you thinking about a magic arch that we cannot use right now?

    Yes, he is, obviously.
    So when you say "it's bad", please tell us why.

    We know it would run slow on existing CPUs, that's not the question.
    The question is: why would it be impossible or very hard to
    make a CPU that could execute such code efficiently.

    I suspect there can be a very valid reasons, maybe for the same kinds of
    reasons why some systems allow nested transactions (e.g. when you have
    a transaction with two calls to `gensym`: it doesn't matter whether the
    two calls really return consecutive symbols (as would be guaranteed if
    the code were truly run atomically), all that matters is that those
    symbols are unique).

    So maybe with sequential consistency, there could be some forms of
    parallelism which we'd completely disallow, whereas a weaker form of
    consistency would allow it. I'm having a hard time imagining what it
    could be, tho.

    BTW, IIRC, SGI's MIPS was defined to offer sequential consistency on
    their big supercomputers, no?


    Stefan

    Every HP PA-RISC system, from workstation to the largest server, were sequentially consistent and needed no barriers. It was not a problem,
    I never thought it was even considered any sort of performance issue.
    Once you decide to support it, you just throw some hardware at it,
    and you're done.

    Since the original HP PA-RISC MP designs were sequentially consistent, all the implementations afterward kept it up since no one wanted any existing code to break. The architects defined the architecture to allow weak ordering, but no implementation (by HP at least) did so. These
    architects then went on to IA-64, where it really is weak, since IA64 was
    in order, so it has more of a payoff there since IA64 didn't want to spend hardware on this (they had 50 other things to waste it on), and IA-64 is
    full of bad ideas and ideas just not implemented well.

    Sun was TSO, which is weaker. [...]

    Sun had RMO mode as well. Avoid #StoreLoad membars was a must, only use
    them when you absolutely have to.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kent Dickey@21:1/5 to monnier@iro.umontreal.ca on Thu Nov 16 01:41:56 2023
    In article <jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org>,
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    Are you thinking about a magic arch that we cannot use right now?

    Yes, he is, obviously.
    So when you say "it's bad", please tell us why.

    We know it would run slow on existing CPUs, that's not the question.
    The question is: why would it be impossible or very hard to
    make a CPU that could execute such code efficiently.

    I suspect there can be a very valid reasons, maybe for the same kinds of >reasons why some systems allow nested transactions (e.g. when you have
    a transaction with two calls to `gensym`: it doesn't matter whether the
    two calls really return consecutive symbols (as would be guaranteed if
    the code were truly run atomically), all that matters is that those
    symbols are unique).

    So maybe with sequential consistency, there could be some forms of >parallelism which we'd completely disallow, whereas a weaker form of >consistency would allow it. I'm having a hard time imagining what it
    could be, tho.

    BTW, IIRC, SGI's MIPS was defined to offer sequential consistency on
    their big supercomputers, no?


    Stefan

    Every HP PA-RISC system, from workstation to the largest server, were sequentially consistent and needed no barriers. It was not a problem,
    I never thought it was even considered any sort of performance issue.
    Once you decide to support it, you just throw some hardware at it,
    and you're done.

    Since the original HP PA-RISC MP designs were sequentially consistent, all
    the implementations afterward kept it up since no one wanted any existing
    code to break. The architects defined the architecture to allow weak
    ordering, but no implementation (by HP at least) did so. These
    architects then went on to IA-64, where it really is weak, since IA64 was
    in order, so it has more of a payoff there since IA64 didn't want to spend hardware on this (they had 50 other things to waste it on), and IA-64 is
    full of bad ideas and ideas just not implemented well.

    Sun was TSO, which is weaker. Sun was never a performance champion,
    other than by throwing the most cores at a parallel problem. So being
    TSO relative to Sequential Consistency didn't seem to buy Sun much. DEC
    Alpha was the poster child of weakly ordered (so weak they didn't maintain single CPU consistent ordering with itself) and it was often a performance champion on tech workloads, but that had more to do with their
    much higher clock frequencies, and that edge went away once out-of-order
    took off. DEC Alpha was never a big player in TPC-C workloads, where
    everybody made their money in the 90's. Technical computing is nice and
    fun, but there was not a lot of profit in it compared to business workloads
    in the 90s. IA64's reason to exist was to run SGI/SUN/DEC out of business,
    and it effectively did (while hurting HP about as much).

    I don't know if SGI was sequentially consistent. It's possible, since
    software developed for other systems might have pushed them to support it,
    but the academic RISC folks were pretty big on weakly ordered.

    A problem with weakly ordered is no implementation is THAT weakly
    ordered. To maximize doing things in a bad order requires "unlucky" cache misses, and these are just not common in practice. So "Store A; Store B"
    often appears to be done in that order with no barriers on weakly ordered systems. It's hard to feel confident you've written anything complex
    right, so most algorithms are kept relatively simple to make it more likely they are well tested.

    HP PA-RISC had poor atomic support, so HP-UX used a simple spin-lock
    using just load and store. I forget the details, but it was something
    like this: Each of 4 CPUs got one byte in a 4-byte word. First, set
    your byte to "iwant", then read the word. If everyone else is 0, then
    set your byte to "iwon", then read the word. If everyone else is still
    0, you've won, do what you want. And if you see other bytes getting
    set, then you move to a backoff algorithm to determine the winner (you
    have 256 states you can move through). Note that when you release the
    lock, you can pick the next winner with a single store. What I
    understood was this was actually faster than a simple compare-and-swap
    since it let software immediately see if there was contention, and move
    to a back-off algorithm right away (and you can see who's contending,
    and deal with that as well). Spinlocks tend to lead to cache
    invalidation storms, and it's hard to tune, but this was much more
    tuneable. It scaled to 64-CPUs by doing the lock in two steps, and
    moving to a 64-bit word.

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Stefan Monnier on Wed Nov 15 18:21:04 2023
    On 11/15/2023 4:37 PM, Stefan Monnier wrote:
    Actually, for some reason this line of thinking reminds me of the following >> funny scene in a movie called Dirty Rotten Scoundrels. Where they had to >> put corks on the forks of Ruprecht (Steve Martin), to prevent him from
    hurting himself. So, basically, Ruprecht would be the programmer and the >> Micheal Caine character would be the arch designer thinking that relaxed >> models are too complex, and all programmers are mainly morons:

    Of course, cognitive dissonance would bite those people who have invested efforts into learning about all the intricacies of non-sequential
    memory models.

    But if we can make SC's efficiency sufficiently close to that of TSO,
    the benefit could be significant for all those people who have not
    invested such efforts.

    The evolution of computer programming is littered with steps that reduce performance in exchange for a cleaner programming model.

    You don't need to be a moron to be baffled by the complexity of relaxed memory models.

    Mainly, the kernel guys have to know all about them in order to try to
    squeeze as much performance as they can from a given arch. Lock/Wait
    free algorithms are used quite a bit. Creating them is never easy,
    relaxed model or not.

    Fwiw, I like the SPARC model where one could put the processor in RMO or
    TSO mode. I think there was another mode, but I cannot remember it right
    now. A SEQ mode should be doable in SPARC. However, it would incur a performance penalty.

    The funny thing in that in lock/wait-free programming, we always look
    for the weakest barriers than can be used, and still be correct. So, if everything was suddenly seq_cst with much better performance on a new
    magic arch, well, I would need to check it out for sure!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Chris M. Thomasson on Thu Nov 16 14:54:49 2023
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/15/2023 4:37 PM, Stefan Monnier wrote:
    Actually, for some reason this line of thinking reminds me of the following >>> funny scene in a movie called Dirty Rotten Scoundrels. Where they had to >>> put corks on the forks of Ruprecht (Steve Martin), to prevent him from >>> hurting himself. So, basically, Ruprecht would be the programmer and the >>> Micheal Caine character would be the arch designer thinking that relaxed >>> models are too complex, and all programmers are mainly morons:

    Of course, cognitive dissonance would bite those people who have invested
    efforts into learning about all the intricacies of non-sequential
    memory models.

    But if we can make SC's efficiency sufficiently close to that of TSO,
    the benefit could be significant for all those people who have not
    invested such efforts.

    The evolution of computer programming is littered with steps that reduce
    performance in exchange for a cleaner programming model.

    You don't need to be a moron to be baffled by the complexity of relaxed
    memory models.

    Mainly, the kernel guys have to know all about them in order to try to >squeeze as much performance as they can from a given arch. Lock/Wait
    free algorithms are used quite a bit. Creating them is never easy,
    relaxed model or not.

    We ran into an issue with linux circa 2006 or thereabouts where
    there was an issue accessing an skb (networks stack buffer) that
    required adding a barrier (AMD64 processor). Details are fuzzy two decades later....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Scott Lurndal on Thu Nov 16 14:07:16 2023
    On 11/16/2023 6:54 AM, Scott Lurndal wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/15/2023 4:37 PM, Stefan Monnier wrote:
    Actually, for some reason this line of thinking reminds me of the following
    funny scene in a movie called Dirty Rotten Scoundrels. Where they had to
    put corks on the forks of Ruprecht (Steve Martin), to prevent him from >>>> hurting himself. So, basically, Ruprecht would be the programmer and the
    Micheal Caine character would be the arch designer thinking that relaxed
    models are too complex, and all programmers are mainly morons:

    Of course, cognitive dissonance would bite those people who have invested >>> efforts into learning about all the intricacies of non-sequential
    memory models.

    But if we can make SC's efficiency sufficiently close to that of TSO,
    the benefit could be significant for all those people who have not
    invested such efforts.

    The evolution of computer programming is littered with steps that reduce >>> performance in exchange for a cleaner programming model.

    You don't need to be a moron to be baffled by the complexity of relaxed
    memory models.

    Mainly, the kernel guys have to know all about them in order to try to
    squeeze as much performance as they can from a given arch. Lock/Wait
    free algorithms are used quite a bit. Creating them is never easy,
    relaxed model or not.

    We ran into an issue with linux circa 2006 or thereabouts where
    there was an issue accessing an skb (networks stack buffer) that
    required adding a barrier (AMD64 processor). Details are fuzzy two decades later....


    I know that RCU is being used for dynamic routing tables, and a lot more
    in the Linux kernel. I cannot remember if the skb used it, I bet it did.
    For some damn reason, this makes me think of eieio on the PPC...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to Kent Dickey on Fri Nov 17 09:03:24 2023
    Kent Dickey <kegs@provalid.com> wrote:
    In article <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>, <aph@littlepinkcloud.invalid> wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:

    I don't think that's really true. The reorderings we see in currently- >>>> produced hardware are, more or less, a subset of the same reorderings
    that C compilers perform. Therefore, if you see a confusing hardware
    reordering in a multi-threaded C program it may well be (probably is!) >>>> a bug (according to the C standard) *even on a TSO machine*. The only
    common counter-example to this is for volatile accesses.

    Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*

    Maybe I wasn't clear enough. If you use std::atomic and
    std::memory_order_* in such a way that there are no data races, your >>concurrent program will be fine on both TSO and relaxed memory
    ordering. If you try to fix data races with volatile instead of
    std::atomic and std::memory_order_*, that'll mostly fix things on a
    TSO machine, but not on a machine with relaxed memory ordering.

    What you are saying is:

    As long as you fully analyze your program, ensure all multithreaded
    accesses are only through atomic variables, and you label every
    access to an atomic variable properly (although my point is: exactly
    what should that be??), then there is no problem.

    Well, this is definitely true. But it's not exactly a plan: in
    practice, people use careful synchronization boundaries and immutable
    data structures.

    What I'm arguing is: the CPU should behave as if
    memory_order_seq_cst is set on all accesses with no special
    trickery.

    What I'm saying is: that isn't sufficient if you are using an
    optimizing compiler. And if you are programming for an optimizing
    compiler you have to follow the rules anway. And the optimizing
    compiler can reorder stores and loads as much, if not more, than the
    hardware does.

    This acquire/release nonsense is all weakly ordered brain
    damage. The problem is on weakly ordered CPUs, performance
    definitely does matter in terms of getting this stuff right, but
    that's their problem. Being weakly ordered makes them slower when
    they have to execute barriers for correctness, but it's the barriers themselves that are the slow down, not ordering the requests
    properly.

    How is that? All the barriers do is enforce the ordering.

    Take the Apple M1 as an exaple. It has a TSO mode bit. It also has TSO
    stores and loads, intended for when TSO mode is turned off. Are you
    saying that the TSO stores and loads use a different mechanism to
    enforce ordering from the one used when TSO is on by default?

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Fri Nov 17 10:44:45 2023
    As long as you fully analyze your program, ensure all multithreaded accesses are only through atomic variables, and you label every access to an
    atomic variable properly (although my point is: exactly what should that be??), then there is no problem.

    BTW, the above sounds daunting when writing in C because you have to do
    that analysis yourself, but there are programming languages out there
    which will do that analysis for you as part of type checking.
    I'm thinking here of languages like Rust or the STM library of
    Haskell. This also solves the problem that memory accesses can be
    reordered by the compiler, since in that case the compiler is fully
    aware of which accesses can be reordered and which can't.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stefan Monnier on Fri Nov 17 18:37:17 2023
    Stefan Monnier wrote:

    As long as you fully analyze your program, ensure all multithreaded accesses >> are only through atomic variables, and you label every access to an
    atomic variable properly (although my point is: exactly what should that
    be??), then there is no problem.

    BTW, the above sounds daunting when writing in C because you have to do
    that analysis yourself, but there are programming languages out there
    which will do that analysis for you as part of type checking.
    I'm thinking here of languages like Rust or the STM library of
    Haskell. This also solves the problem that memory accesses can be
    reordered by the compiler, since in that case the compiler is fully
    aware of which accesses can be reordered and which can't.


    Stefan
    <
    I created the Exotic Synchronization Method such that you could just
    write the code needed to do the work, and then decorate those accesses
    which are participating in the ATOMIC event. So, lets say you want to
    move an element from one doubly linked list to another place in some
    other doubly linked list:: you would write::
    <
    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    fn = fr->next;
    fp = fr->prev;
    tn = to->next;



    if( TRUE )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    fr->next = tn;
    return TRUE;
    }
    return FALSE;
    }

    In order to change this into a fully qualified ATOMIC event, the code
    is decorated as::

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    esmLOCK( fn = fr->next ); // get data
    esmLOCK( fp = fr->prev );
    esmLOCK( tn = to->next );
    esmLOCK( fn ); // touch data
    esmLOCK( fp );
    esmLOCK( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn; // move the bits around
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCK( fr->next = tn );
    return TRUE;
    }
    return FALSE;
    }

    Having a multiplicity of containers participate in an ATOMIC event
    is key to making ATOMIC stuff fast and needing fewer ATOMICs to
    to get the job(s) done.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to MitchAlsup on Fri Nov 17 11:21:13 2023
    On 11/17/2023 10:37 AM, MitchAlsup wrote:
    Stefan Monnier wrote:

    As long as you fully analyze your program, ensure all multithreaded
    accesses
    are only through atomic variables, and you label every access to an
    atomic variable properly (although my point is: exactly what should that >>> be??), then there is no problem.

    BTW, the above sounds daunting when writing in C because you have to do
    that analysis yourself, but there are programming languages out there
    which will do that analysis for you as part of type checking.
    I'm thinking here of languages like Rust or the STM library of
    Haskell.  This also solves the problem that memory accesses can be
    reordered by the compiler, since in that case the compiler is fully
    aware of which accesses can be reordered and which can't.


            Stefan
    <
    I created the Exotic Synchronization Method such that you could just
    write the code needed to do the work, and then decorate those accesses
    which are participating in the ATOMIC event. So, lets say you want to
    move an element from one doubly linked list to another place in some
    other doubly linked list:: you would write::
    <
    BOOLEAN MoveElement( Element *fr, Element *to )
    {
        fn = fr->next;
        fp = fr->prev;
        tn = to->next;



        if( TRUE )
        {
                 fp->next = fn;
                 fn->prev = fp;
                 to->next = fr;
                 tn->prev = fr;
                 fr->prev = to;
                 fr->next = tn;
                 return TRUE;
        }
        return FALSE;
    }

    In order to change this into a fully qualified ATOMIC event, the code
    is decorated as::

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
        esmLOCK( fn = fr->next );         // get data
        esmLOCK( fp = fr->prev );
        esmLOCK( tn = to->next );
        esmLOCK( fn );                    // touch data
        esmLOCK( fp );
        esmLOCK( tn );
        if( !esmINTERFERENCE() )
        {
                 fp->next = fn;           // move the bits around
                 fn->prev = fp;
                 to->next = fr;
                 tn->prev = fr;
                 fr->prev = to;
        esmLOCK( fr->next = tn );
                 return TRUE;
        }
        return FALSE;
    }

    Having a multiplicity of containers participate in an ATOMIC event
    is key to making ATOMIC stuff fast and needing fewer ATOMICs to to get
    the job(s) done.

    Interesting. Any chance of live lock wrt esmINTERFERENCE() ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to MitchAlsup on Fri Nov 17 11:22:31 2023
    On 11/17/2023 10:37 AM, MitchAlsup wrote:
    Stefan Monnier wrote:

    As long as you fully analyze your program, ensure all multithreaded
    accesses
    are only through atomic variables, and you label every access to an
    atomic variable properly (although my point is: exactly what should that >>> be??), then there is no problem.

    BTW, the above sounds daunting when writing in C because you have to do
    that analysis yourself, but there are programming languages out there
    which will do that analysis for you as part of type checking.
    I'm thinking here of languages like Rust or the STM library of
    Haskell.  This also solves the problem that memory accesses can be
    reordered by the compiler, since in that case the compiler is fully
    aware of which accesses can be reordered and which can't.


            Stefan
    <
    I created the Exotic Synchronization Method such that you could just
    write the code needed to do the work, and then decorate those accesses
    which are participating in the ATOMIC event. So, lets say you want to
    move an element from one doubly linked list to another place in some
    other doubly linked list:: you would write::
    <
    BOOLEAN MoveElement( Element *fr, Element *to )
    {
        fn = fr->next;
        fp = fr->prev;
        tn = to->next;



        if( TRUE )
        {
                 fp->next = fn;
                 fn->prev = fp;
                 to->next = fr;
                 tn->prev = fr;
                 fr->prev = to;
                 fr->next = tn;
                 return TRUE;
        }
        return FALSE;
    }

    In order to change this into a fully qualified ATOMIC event, the code
    is decorated as::

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
        esmLOCK( fn = fr->next );         // get data
        esmLOCK( fp = fr->prev );
        esmLOCK( tn = to->next );
        esmLOCK( fn );                    // touch data
        esmLOCK( fp );
        esmLOCK( tn );
        if( !esmINTERFERENCE() )
        {
                 fp->next = fn;           // move the bits around
                 fn->prev = fp;
                 to->next = fr;
                 tn->prev = fr;
                 fr->prev = to;


        esmLOCK( fr->next = tn );
    ^^^^^^^^^^^^^^^^^^^^^^^^

    Why perform an esmLOCK here? Please correct my confusion.

                 return TRUE;
        }
        return FALSE;
    }

    Having a multiplicity of containers participate in an ATOMIC event
    is key to making ATOMIC stuff fast and needing fewer ATOMICs to to get
    the job(s) done.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Branimir Maksimovic on Fri Nov 17 11:31:53 2023
    On 11/14/2023 8:10 PM, Branimir Maksimovic wrote:
    On 2023-11-13, Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/13/2023 11:29 AM, MitchAlsup wrote:
    EricP wrote:

    Kent Dickey wrote:
    In article <uirqj3$9q9q$1@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2023 4:20 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/12/2023 2:33 PM, Kent Dickey wrote:
    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain >>>>>>>>>> workloads.
    I know a lot of people believe that statement to be true.  In >>>>>>>>> general, it
    is assumed to be true without proof.

    I believe that statement to be false.  Can you describe some of >>>>>>>>> these
    workloads?
    Also, think about converting any sound lock-free algorithm's
    finely tuned memory barriers to _all_ sequential consistency... >>>>>>>> That would ruin performance right off the bat... Think about it. >>>>>>> <
    Assuming you are willing to accept the wrong answer fast, rather >>>>>>> than the right answer later. There are very few algorithms with
    this property.
    That does not make any sense to me. Think of a basic mutex. It
    basically requires an acquire membar for the lock and a release
    membar for the unlock. On SPARC that would be:

    acquire = MEMBAR #LoadStore | #LoadLoad

    release = MEMBAR #LoadStore | #StoreStore

    Okay, fine. However, if I made them sequentially consistent, it
    would require a damn #StoreLoad barrier for both acquire and
    release. This is not good and should be avoided when possible.

    Also, RCU prides itself with not having to use any memory barriers >>>>>> for its read side. If RCU was forced to use a seq cst, basically
    LOCKED RMW or MFENCE on Intel, it would completely ruin its
    performance.
    [...]
    < I suspect SUN lost significant performance by always running TSO and
    it still required barrier instructions.
    [...]

    Intel still requires an explicit membar for hazard pointers as-is. Sparc
    in TSO mode still requires a membar for this. Spard needs a #StoreLoad
    wrt the store followed by a load to another location relationship to
    hold. Intel needs a LOCK'ed atomic or MFENCE to handle this.
    I think that Apple M1 requires, too. I has problems wihout membar.



    Afaict, any TSO model needs an explicit barrier for the store followed
    by a load to another location relationship to hold. If the program does
    not use this relationship, then it does not need to use them on TSO.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Fri Nov 17 20:49:08 2023
    Chris M. Thomasson wrote:

    On 11/17/2023 10:37 AM, MitchAlsup wrote:
    Stefan Monnier wrote:

    As long as you fully analyze your program, ensure all multithreaded
    accesses
    are only through atomic variables, and you label every access to an
    atomic variable properly (although my point is: exactly what should that >>>> be??), then there is no problem.

    BTW, the above sounds daunting when writing in C because you have to do
    that analysis yourself, but there are programming languages out there
    which will do that analysis for you as part of type checking.
    I'm thinking here of languages like Rust or the STM library of
    Haskell.  This also solves the problem that memory accesses can be
    reordered by the compiler, since in that case the compiler is fully
    aware of which accesses can be reordered and which can't.


            Stefan
    <
    I created the Exotic Synchronization Method such that you could just
    write the code needed to do the work, and then decorate those accesses
    which are participating in the ATOMIC event. So, lets say you want to
    move an element from one doubly linked list to another place in some
    other doubly linked list:: you would write::
    <
    BOOLEAN MoveElement( Element *fr, Element *to )
    {
        fn = fr->next;
        fp = fr->prev;
        tn = to->next;



        if( TRUE )
        {
                 fp->next = fn;
                 fn->prev = fp;
                 to->next = fr;
                 tn->prev = fr;
                 fr->prev = to;
                 fr->next = tn;
                 return TRUE;
        }
        return FALSE;
    }

    In order to change this into a fully qualified ATOMIC event, the code
    is decorated as::

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
        esmLOCK( fn = fr->next );         // get data
        esmLOCK( fp = fr->prev );
        esmLOCK( tn = to->next );
        esmLOCK( fn );                    // touch data >>     esmLOCK( fp );
        esmLOCK( tn );
        if( !esmINTERFERENCE() )
        {
                 fp->next = fn;           // move the bits around
                 fn->prev = fp;
                 to->next = fr;
                 tn->prev = fr;
                 fr->prev = to;


        esmLOCK( fr->next = tn );
    ^^^^^^^^^^^^^^^^^^^^^^^^

    Why perform an esmLOCK here? Please correct my confusion.

    esmLOCK on the last participating Store is what tells the HW that the ATOMIC event is finished. So, the first esmLOCK begins the ATOMIC event and the
    only esmLOCK on an outgoing (ST or PostPush) memory reference ends the event.

                 return TRUE;
        }
        return FALSE;
    }

    Having a multiplicity of containers participate in an ATOMIC event
    is key to making ATOMIC stuff fast and needing fewer ATOMICs to to get
    the job(s) done.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Sat Nov 18 00:03:15 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Stefan Monnier wrote:

    As long as you fully analyze your program, ensure all multithreaded accesses
    are only through atomic variables, and you label every access to an
    atomic variable properly (although my point is: exactly what should that >>> be??), then there is no problem.

    BTW, the above sounds daunting when writing in C because you have to do
    that analysis yourself, but there are programming languages out there
    which will do that analysis for you as part of type checking.
    I'm thinking here of languages like Rust or the STM library of
    Haskell. This also solves the problem that memory accesses can be
    reordered by the compiler, since in that case the compiler is fully
    aware of which accesses can be reordered and which can't.


    Stefan
    <
    I created the Exotic Synchronization Method such that you could just
    write the code needed to do the work, and then decorate those accesses
    which are participating in the ATOMIC event. So, lets say you want to
    move an element from one doubly linked list to another place in some
    other doubly linked list:: you would write::
    <
    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    fn = fr->next;
    fp = fr->prev;
    tn = to->next;



    if( TRUE )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    fr->next = tn;
    return TRUE;
    }
    return FALSE;
    }

    In order to change this into a fully qualified ATOMIC event, the code
    is decorated as::

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    esmLOCK( fn = fr->next ); // get data
    esmLOCK( fp = fr->prev );
    esmLOCK( tn = to->next );
    esmLOCK( fn ); // touch data
    esmLOCK( fp );
    esmLOCK( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn; // move the bits around
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCK( fr->next = tn );
    return TRUE;
    }
    return FALSE;
    }

    Having a multiplicity of containers participate in an ATOMIC event
    is key to making ATOMIC stuff fast and needing fewer ATOMICs to
    to get the job(s) done.

    That looks suspiciously like transactional memory.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Sat Nov 18 01:03:26 2023
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Stefan Monnier wrote:

    As long as you fully analyze your program, ensure all multithreaded accesses
    are only through atomic variables, and you label every access to an
    atomic variable properly (although my point is: exactly what should that >>>> be??), then there is no problem.

    BTW, the above sounds daunting when writing in C because you have to do
    that analysis yourself, but there are programming languages out there
    which will do that analysis for you as part of type checking.
    I'm thinking here of languages like Rust or the STM library of
    Haskell. This also solves the problem that memory accesses can be
    reordered by the compiler, since in that case the compiler is fully
    aware of which accesses can be reordered and which can't.


    Stefan
    <
    I created the Exotic Synchronization Method such that you could just
    write the code needed to do the work, and then decorate those accesses >>which are participating in the ATOMIC event. So, lets say you want to
    move an element from one doubly linked list to another place in some
    other doubly linked list:: you would write::
    <
    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    fn = fr->next;
    fp = fr->prev;
    tn = to->next;



    if( TRUE )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    fr->next = tn;
    return TRUE;
    }
    return FALSE;
    }

    In order to change this into a fully qualified ATOMIC event, the code
    is decorated as::

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    esmLOCK( fn = fr->next ); // get data
    esmLOCK( fp = fr->prev );
    esmLOCK( tn = to->next );
    esmLOCK( fn ); // touch data
    esmLOCK( fp );
    esmLOCK( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn; // move the bits around
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCK( fr->next = tn );
    return TRUE;
    }
    return FALSE;
    }

    Having a multiplicity of containers participate in an ATOMIC event
    is key to making ATOMIC stuff fast and needing fewer ATOMICs to
    to get the job(s) done.

    That looks suspiciously like transactional memory.

    I has some flavors of such, but::
    it has no nesting,
    it has a strict limit of 8 participating cache lines,
    it automagically transfers control when disruptive interference is detected,
    it is subject to timeouts;

    But does have the property that all interested 3rd parties see participating memory only in the before or only in the completely after states.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Kent Dickey on Sat Nov 18 21:10:47 2023
    On Thu, 16 Nov 2023 01:41:56 -0000 (UTC)
    kegs@provalid.com (Kent Dickey) wrote:

    In article <jwv4jhm4pk7.fsf-monnier+comp.arch@gnu.org>,
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    Are you thinking about a magic arch that we cannot use right now?

    Yes, he is, obviously.
    So when you say "it's bad", please tell us why.

    We know it would run slow on existing CPUs, that's not the question.
    The question is: why would it be impossible or very hard to
    make a CPU that could execute such code efficiently.

    I suspect there can be a very valid reasons, maybe for the same
    kinds of reasons why some systems allow nested transactions (e.g.
    when you have a transaction with two calls to `gensym`: it doesn't
    matter whether the two calls really return consecutive symbols (as
    would be guaranteed if the code were truly run atomically), all that >matters is that those symbols are unique).

    So maybe with sequential consistency, there could be some forms of >parallelism which we'd completely disallow, whereas a weaker form of >consistency would allow it. I'm having a hard time imagining what it
    could be, tho.

    BTW, IIRC, SGI's MIPS was defined to offer sequential consistency on
    their big supercomputers, no?


    Stefan

    Every HP PA-RISC system, from workstation to the largest server, were sequentially consistent and needed no barriers. It was not a problem,
    I never thought it was even considered any sort of performance issue.
    Once you decide to support it, you just throw some hardware at it,
    and you're done.

    Since the original HP PA-RISC MP designs were sequentially
    consistent, all the implementations afterward kept it up since no one
    wanted any existing code to break. The architects defined the
    architecture to allow weak ordering, but no implementation (by HP at
    least) did so. These architects then went on to IA-64, where it
    really is weak, since IA64 was in order, so it has more of a payoff
    there since IA64 didn't want to spend hardware on this (they had 50
    other things to waste it on), and IA-64 is full of bad ideas and
    ideas just not implemented well.

    Sun was TSO, which is weaker. Sun was never a performance champion,
    other than by throwing the most cores at a parallel problem. So being
    TSO relative to Sequential Consistency didn't seem to buy Sun much.
    DEC Alpha was the poster child of weakly ordered (so weak they didn't maintain single CPU consistent ordering with itself) and it was often
    a performance champion on tech workloads, but that had more to do
    with their much higher clock frequencies, and that edge went away
    once out-of-order took off. DEC Alpha was never a big player in
    TPC-C workloads, where everybody made their money in the 90's.
    Technical computing is nice and fun, but there was not a lot of
    profit in it compared to business workloads in the 90s. IA64's
    reason to exist was to run SGI/SUN/DEC out of business, and it
    effectively did (while hurting HP about as much).

    I don't know if SGI was sequentially consistent.

    AFAIK, Stefan Monnier is correct. SGI MIPS gear was SC.
    Their later gear (Itanium and x86) were not, because of underlying
    cores.

    It's possible, since
    software developed for other systems might have pushed them to
    support it, but the academic RISC folks were pretty big on weakly
    ordered.

    A problem with weakly ordered is no implementation is THAT weakly
    ordered. To maximize doing things in a bad order requires "unlucky"
    cache misses, and these are just not common in practice. So "Store
    A; Store B" often appears to be done in that order with no barriers
    on weakly ordered systems. It's hard to feel confident you've
    written anything complex right, so most algorithms are kept
    relatively simple to make it more likely they are well tested.

    HP PA-RISC had poor atomic support, so HP-UX used a simple spin-lock
    using just load and store. I forget the details, but it was something
    like this: Each of 4 CPUs got one byte in a 4-byte word. First, set
    your byte to "iwant", then read the word. If everyone else is 0, then
    set your byte to "iwon", then read the word. If everyone else is
    still 0, you've won, do what you want. And if you see other bytes
    getting set, then you move to a backoff algorithm to determine the
    winner (you have 256 states you can move through). Note that when
    you release the lock, you can pick the next winner with a single
    store. What I understood was this was actually faster than a simple compare-and-swap since it let software immediately see if there was contention, and move to a back-off algorithm right away (and you can
    see who's contending, and deal with that as well). Spinlocks tend to
    lead to cache invalidation storms, and it's hard to tune, but this
    was much more tuneable. It scaled to 64-CPUs by doing the lock in
    two steps, and moving to a 64-bit word.

    Kent

    I don't see how high-end SC core can match performance of high-end TSO-or-weaker core with weak point of SC being single-thread
    performance in the case where code encounters few L1D misses intermixed
    with plenty of L1D hits.
    Of course, I don't know all tricks that high-end core designers are
    using today or even 10% of their tricks. So, may be, what I consider
    impossible is in fact possible.
    But as a matter of fact nobody designed brand new high-end SC core in
    these century. The closest were probably MIPS cores from Scott
    Lurndal's employee Cavium, but I think they were always at least factor
    of 1.5x slower than contemporary state of the art in single-thread
    performance and more commonly factor of 2+.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to MitchAlsup on Sat Nov 18 13:47:31 2023
    On 11/17/2023 5:03 PM, MitchAlsup wrote:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Stefan Monnier wrote:

    As long as you fully analyze your program, ensure all multithreaded
    accesses
    are only through atomic variables, and you label every access to an
    atomic variable properly (although my point is: exactly what should
    that
    be??), then there is no problem.

    BTW, the above sounds daunting when writing in C because you have to do >>>> that analysis yourself, but there are programming languages out there
    which will do that analysis for you as part of type checking.
    I'm thinking here of languages like Rust or the STM library of
    Haskell.  This also solves the problem that memory accesses can be
    reordered by the compiler, since in that case the compiler is fully
    aware of which accesses can be reordered and which can't.


            Stefan
    <
    I created the Exotic Synchronization Method such that you could just
    write the code needed to do the work, and then decorate those accesses
    which are participating in the ATOMIC event. So, lets say you want to
    move an element from one doubly linked list to another place in some
    other doubly linked list:: you would write::
    <
    BOOLEAN MoveElement( Element *fr, Element *to )
    {
        fn = fr->next;
        fp = fr->prev;
        tn = to->next;



        if( TRUE )
        {
                 fp->next = fn;
                 fn->prev = fp;
                 to->next = fr;
                 tn->prev = fr;
                 fr->prev = to;
                 fr->next = tn;
                 return TRUE;
        }
        return FALSE;
    }

    In order to change this into a fully qualified ATOMIC event, the code
    is decorated as::

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
        esmLOCK( fn = fr->next );         // get data
        esmLOCK( fp = fr->prev );
        esmLOCK( tn = to->next );
        esmLOCK( fn );                    // touch data >>>     esmLOCK( fp );
        esmLOCK( tn );
        if( !esmINTERFERENCE() )
        {
                 fp->next = fn;           // move the bits around
                 fn->prev = fp;
                 to->next = fr;
                 tn->prev = fr;
                 fr->prev = to;
        esmLOCK( fr->next = tn );
                 return TRUE;
        }
        return FALSE;
    }

    Having a multiplicity of containers participate in an ATOMIC event
    is key to making ATOMIC stuff fast and needing fewer ATOMICs to to
    get the job(s) done.

    That looks suspiciously like transactional memory.

    Indeed, it does. Worried about live lock wrt esmINTERFERENCE().


    I has some flavors of such, but::
    it has no nesting,
    it has a strict limit of 8 participating cache lines,
    it automagically transfers control when disruptive interference is
    detected,
    it is subject to timeouts;

    But does have the property that all interested 3rd parties see
    participating
    memory only in the before or only in the completely after states.

    Are you familiar with KCSS? K-Compare Single Swap?

    https://people.csail.mit.edu/shanir/publications/K-Compare.pdf

    ?

    Check this out the old thread:

    https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to All on Sat Nov 18 19:44:50 2023
    On 11/15/2023 5:41 PM, Kent Dickey wrote:
    [...]

    Btw, thank you for Kegs! Excellent.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Sun Nov 19 19:32:07 2023
    Chris M. Thomasson wrote:

    On 11/17/2023 5:03 PM, MitchAlsup wrote:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Stefan Monnier wrote:

    As long as you fully analyze your program, ensure all multithreaded >>>>>> accesses
    are only through atomic variables, and you label every access to an >>>>>> atomic variable properly (although my point is: exactly what should >>>>>> that
    be??), then there is no problem.

    BTW, the above sounds daunting when writing in C because you have to do >>>>> that analysis yourself, but there are programming languages out there >>>>> which will do that analysis for you as part of type checking.
    I'm thinking here of languages like Rust or the STM library of
    Haskell.  This also solves the problem that memory accesses can be
    reordered by the compiler, since in that case the compiler is fully
    aware of which accesses can be reordered and which can't.


            Stefan
    <
    I created the Exotic Synchronization Method such that you could just
    write the code needed to do the work, and then decorate those accesses >>>> which are participating in the ATOMIC event. So, lets say you want to
    move an element from one doubly linked list to another place in some
    other doubly linked list:: you would write::
    <
    BOOLEAN MoveElement( Element *fr, Element *to )
    {
        fn = fr->next;
        fp = fr->prev;
        tn = to->next;



        if( TRUE )
        {
                 fp->next = fn;
                 fn->prev = fp;
                 to->next = fr;
                 tn->prev = fr;
                 fr->prev = to;
                 fr->next = tn;
                 return TRUE;
        }
        return FALSE;
    }

    In order to change this into a fully qualified ATOMIC event, the code
    is decorated as::

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
        esmLOCK( fn = fr->next );         // get data
        esmLOCK( fp = fr->prev );
        esmLOCK( tn = to->next );
        esmLOCK( fn );                    // touch data >>>>     esmLOCK( fp );
        esmLOCK( tn );
        if( !esmINTERFERENCE() )
        {
                 fp->next = fn;           // move the bits around
                 fn->prev = fp;
                 to->next = fr;
                 tn->prev = fr;
                 fr->prev = to;
        esmLOCK( fr->next = tn );
                 return TRUE;
        }
        return FALSE;
    }

    Having a multiplicity of containers participate in an ATOMIC event
    is key to making ATOMIC stuff fast and needing fewer ATOMICs to to
    get the job(s) done.

    That looks suspiciously like transactional memory.

    Indeed, it does. Worried about live lock wrt esmINTERFERENCE().

    esmINTERFERENCE() is an actual instruction in My 66000 ISA. It is a
    conditional branch where the condition is delivered from the miss
    buffer (where I detect interference wrt participating cache lines.)


    I has some flavors of such, but::
    it has no nesting,
    it has a strict limit of 8 participating cache lines,
    it automagically transfers control when disruptive interference is
    detected,
    it is subject to timeouts;

    But does have the property that all interested 3rd parties see
    participating
    memory only in the before or only in the completely after states.

    Are you familiar with KCSS? K-Compare Single Swap?

    https://people.csail.mit.edu/shanir/publications/K-Compare.pdf

    Easily done:
    esmLOCK( c1 = p1->condition1 );
    esmLOCK( c2 = p2->condition2 );
    ...
    if( c1 == C1 && C2 == C2 && c2 == C3 ... )
    ...
    esmLOCK( some data );

    Esm was designed to allow any known synchronization means (in 2013)
    to be directly implemented in esm either inline or via subroutine
    calls.

    ?

    Check this out the old thread:

    https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Michael S on Sun Nov 19 19:38:47 2023
    Michael S wrote:

    On Thu, 16 Nov 2023 01:41:56 -0000 (UTC)
    kegs@provalid.com (Kent Dickey) wrote:

    Kent

    I don't see how high-end SC core can match performance of high-end TSO-or-weaker core with weak point of SC being single-thread
    performance in the case where code encounters few L1D misses intermixed
    with plenty of L1D hits.

    They cannot--but does it really matter ??

    Back in 1993 I ran an experiment on Mc 88120 where we did not consider
    a memory reference performed until we got a signal back from one of the multiple DRAM controllers (compared to fire and forget) and an big
    complicated applications the loss in performance was about 2%--2% perf
    loss allowed on to be in a position to recover from an ECC failure in
    the transport of write data to DRAM.

    This was still not TSO but it is a relevant data point.

    Of course, I don't know all tricks that high-end core designers are
    using today or even 10% of their tricks. So, may be, what I consider impossible is in fact possible.

    But as a matter of fact nobody designed brand new high-end SC core in
    these century. The closest were probably MIPS cores from Scott
    Lurndal's employee Cavium, but I think they were always at least factor
    of 1.5x slower than contemporary state of the art in single-thread performance and more commonly factor of 2+.


    Right now, nobody is competitive with x86-64 at the high end. You are
    going to need a 5GHz core operating 4 instructions per cycle to be
    within spitting distance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to MitchAlsup on Sun Nov 19 11:47:01 2023
    On 11/19/2023 11:32 AM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/17/2023 5:03 PM, MitchAlsup wrote:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Stefan Monnier wrote:

    As long as you fully analyze your program, ensure all
    multithreaded accesses
    are only through atomic variables, and you label every access to an >>>>>>> atomic variable properly (although my point is: exactly what
    should that
    be??), then there is no problem.

    BTW, the above sounds daunting when writing in C because you have
    to do
    that analysis yourself, but there are programming languages out there >>>>>> which will do that analysis for you as part of type checking.
    I'm thinking here of languages like Rust or the STM library of
    Haskell.  This also solves the problem that memory accesses can be >>>>>> reordered by the compiler, since in that case the compiler is fully >>>>>> aware of which accesses can be reordered and which can't.


            Stefan
    <
    I created the Exotic Synchronization Method such that you could just >>>>> write the code needed to do the work, and then decorate those accesses >>>>> which are participating in the ATOMIC event. So, lets say you want
    to move an element from one doubly linked list to another place in
    some
    other doubly linked list:: you would write::
    <
    BOOLEAN MoveElement( Element *fr, Element *to )
    {
        fn = fr->next;
        fp = fr->prev;
        tn = to->next;



        if( TRUE )
        {
                 fp->next = fn;
                 fn->prev = fp;
                 to->next = fr;
                 tn->prev = fr;
                 fr->prev = to;
                 fr->next = tn;
                 return TRUE;
        }
        return FALSE;
    }

    In order to change this into a fully qualified ATOMIC event, the code >>>>> is decorated as::

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
        esmLOCK( fn = fr->next );         // get data
        esmLOCK( fp = fr->prev );
        esmLOCK( tn = to->next );
        esmLOCK( fn );                    // touch data >>>>>     esmLOCK( fp );
        esmLOCK( tn );
        if( !esmINTERFERENCE() )
        {
                 fp->next = fn;           // move the bits around
                 fn->prev = fp;
                 to->next = fr;
                 tn->prev = fr;
                 fr->prev = to;
        esmLOCK( fr->next = tn );
                 return TRUE;
        }
        return FALSE;
    }

    Having a multiplicity of containers participate in an ATOMIC event
    is key to making ATOMIC stuff fast and needing fewer ATOMICs to to
    get the job(s) done.

    That looks suspiciously like transactional memory.

    Indeed, it does. Worried about live lock wrt esmINTERFERENCE().

    esmINTERFERENCE() is an actual instruction in My 66000 ISA. It is a conditional branch where the condition is delivered from the miss
    buffer (where I detect interference wrt participating cache lines.)

    So, can false sharing on a participating cache line make
    esmINTERFERENCE() return true?




    I has some flavors of such, but::
    it has no nesting,
    it has a strict limit of 8 participating cache lines,
    it automagically transfers control when disruptive interference is
    detected,
    it is subject to timeouts;

    But does have the property that all interested 3rd parties see
    participating
    memory only in the before or only in the completely after states.

    Are you familiar with KCSS? K-Compare Single Swap?

    https://people.csail.mit.edu/shanir/publications/K-Compare.pdf

    Easily done:
        esmLOCK( c1 = p1->condition1 );
        esmLOCK( c2 = p2->condition2 );
        ...
        if( c1 == C1 && C2 == C2 && c2 == C3 ... )
            ...
            esmLOCK( some data );

    Esm was designed to allow any known synchronization means (in 2013)
    to be directly implemented in esm either inline or via subroutine
    calls.

    I can see how that would work. The problem is that I am not exactly sure
    how esmINTERFERENCE works internally... Can it detect/prevent live lock?
    I remember hearing from my friend Joe Seigh, who worked at IBM, that
    they had some sort of logic that would prevent live lock in a compare
    and swap wrt their free pool manipulation logic. Iirc, it was somewhat
    related to ABA, hard to remember right now, sorry. I need to find that
    old thread back in comp.programming.threads.




    ?

    Check this out the old thread:

    https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Mon Nov 20 00:10:53 2023
    On Sun, 19 Nov 2023 21:17:20 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    MitchAlsup <mitchalsup@aol.com> schrieb:

    Right now, nobody is competitive with x86-64 at the high end. You
    are going to need a 5GHz core operating 4 instructions per cycle to
    be within spitting distance.

    Power10 isn't doing badly at 4 GHz if I read the SPECint values
    right, but it (very probably) cannot compete on performance per
    currency unit.

    You do not read SPECint values right.
    For starter, there are no POWER "speed" submissions at all. All
    published numbers are embarassingly parallel "rate" scores where since
    POWER7 POWER has huge advantage of 8 hardware thread vs all competitors
    except Oracle having only 1 or 2 threads.
    Since POWER8 IBM applies another trick of calling what is essentially 2
    cores a single core.
    My uneducated guess is that in sigle-threaded integer performance
    POWER10 should be approximately on par with the best Intel Skylake i.e.
    below fastest Intel, AMD and Apple chips from 2021.
    Even more so relatively to 2023.
    And by 2023Q4 apart frome those three mentioned above we have Qualcomm
    near the top and even ARM Inc (Cortex X3) while behind the other four,
    is probably ahead of IBM.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to MitchAlsup on Sun Nov 19 21:17:20 2023
    MitchAlsup <mitchalsup@aol.com> schrieb:

    Right now, nobody is competitive with x86-64 at the high end. You are
    going to need a 5GHz core operating 4 instructions per cycle to be
    within spitting distance.

    Power10 isn't doing badly at 4 GHz if I read the SPECint values
    right, but it (very probably) cannot compete on performance per
    currency unit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Sun Nov 19 22:23:21 2023
    Chris M. Thomasson wrote:

    On 11/19/2023 11:32 AM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/17/2023 5:03 PM, MitchAlsup wrote:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Stefan Monnier wrote:

    As long as you fully analyze your program, ensure all
    multithreaded accesses
    are only through atomic variables, and you label every access to an >>>>>>>> atomic variable properly (although my point is: exactly what
    should that
    be??), then there is no problem.

    BTW, the above sounds daunting when writing in C because you have >>>>>>> to do
    that analysis yourself, but there are programming languages out there >>>>>>> which will do that analysis for you as part of type checking.
    I'm thinking here of languages like Rust or the STM library of
    Haskell.  This also solves the problem that memory accesses can be >>>>>>> reordered by the compiler, since in that case the compiler is fully >>>>>>> aware of which accesses can be reordered and which can't.


            Stefan
    <
    I created the Exotic Synchronization Method such that you could just >>>>>> write the code needed to do the work, and then decorate those accesses >>>>>> which are participating in the ATOMIC event. So, lets say you want >>>>>> to move an element from one doubly linked list to another place in >>>>>> some
    other doubly linked list:: you would write::
    <
    BOOLEAN MoveElement( Element *fr, Element *to )
    {
        fn = fr->next;
        fp = fr->prev;
        tn = to->next;



        if( TRUE )
        {
                 fp->next = fn;
                 fn->prev = fp;
                 to->next = fr;
                 tn->prev = fr;
                 fr->prev = to;
                 fr->next = tn;
                 return TRUE;
        }
        return FALSE;
    }

    In order to change this into a fully qualified ATOMIC event, the code >>>>>> is decorated as::

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
        esmLOCK( fn = fr->next );         // get data
        esmLOCK( fp = fr->prev );
        esmLOCK( tn = to->next );
        esmLOCK( fn );                    // touch data
        esmLOCK( fp );
        esmLOCK( tn );
        if( !esmINTERFERENCE() )
        {
                 fp->next = fn;           // move the bits around
                 fn->prev = fp;
                 to->next = fr;
                 tn->prev = fr;
                 fr->prev = to;
        esmLOCK( fr->next = tn );
                 return TRUE;
        }
        return FALSE;
    }

    Having a multiplicity of containers participate in an ATOMIC event >>>>>> is key to making ATOMIC stuff fast and needing fewer ATOMICs to to >>>>>> get the job(s) done.

    That looks suspiciously like transactional memory.

    Indeed, it does. Worried about live lock wrt esmINTERFERENCE().

    esmINTERFERENCE() is an actual instruction in My 66000 ISA. It is a
    conditional branch where the condition is delivered from the miss
    buffer (where I detect interference wrt participating cache lines.)

    So, can false sharing on a participating cache line make
    esmINTERFERENCE() return true?




    I has some flavors of such, but::
    it has no nesting,
    it has a strict limit of 8 participating cache lines,
    it automagically transfers control when disruptive interference is
    detected,
    it is subject to timeouts;

    But does have the property that all interested 3rd parties see
    participating
    memory only in the before or only in the completely after states.

    Are you familiar with KCSS? K-Compare Single Swap?

    https://people.csail.mit.edu/shanir/publications/K-Compare.pdf

    Easily done:
        esmLOCK( c1 = p1->condition1 );
        esmLOCK( c2 = p2->condition2 );
        ...
        if( c1 == C1 && C2 == C2 && c2 == C3 ... )
            ...
            esmLOCK( some data );

    Esm was designed to allow any known synchronization means (in 2013)
    to be directly implemented in esm either inline or via subroutine
    calls.

    I can see how that would work. The problem is that I am not exactly sure
    how esmINTERFERENCE works internally... Can it detect/prevent live lock?

    esmINTERFERENCE is a branch on interference instruction. This is a conditional branch instruction that queries whether any of the participating cache lines has seen a read-with-intent or coherent-invalidate. In effect the branch
    logic reaches out to the miss buffer and asks if any of the participating
    cache lines has been snooped-for-write:: and you can't do this in 2 instruc- tions or you loose ATOMICicity.

    If it has, control is transferred and the ATOMIC event is failed
    If it has not, and all participating cache lines are present, then this
    core is allowed to NAK all requests to those participating cache lines
    {and control is not transferred}.

    So, you gain control over where flow goes on failure, and essentially
    commit the whole event to finish.

    So, while it does not eliminate live/dead-lock situations, it allows SW
    to be constructed to avoid live/dead lock situations:: Why is a value
    which is provided when an ATOMIC event fails. 0 means success, negative
    values are spurious (buffer overflows,...) while positives represent
    the number of competing threads, so the following case, skips elements
    on a linked list to decrease future initerference.

    Element* getElement( unSigned Key )
    {
    int count = 0;
    for( p = structure.head; p ; p = p->next )
    {
    if( p->Key == Key )
    {
    if( count-- < 0 )
    {
    esmLOCK( p );
    prev = p->prev;
    esmLOCK( prev );
    next = p->next;
    esmLOCK( next );
    if( !esmINTERFERENCE() )
    {
    p->prev = next;
    p->next = prev;
    p->prev = NULL;
    esmLOCK( p->next = NULL );
    return p;
    }
    else
    {
    count = esmWHY();
    p = structure.head;
    }
    }
    }
    }
    return NULL;
    }

    Doing ATOMIC things like this means one can take the BigO( n^3 ) activity
    that happens when a timer goes off and n threads all want access to the
    work queue, down to BigO( 3 ) yes=constant, but in practice it is reduced
    to BigO( ln( n ) ) when requesters arrive in random order at random time.

    I remember hearing from my friend Joe Seigh, who worked at IBM, that
    they had some sort of logic that would prevent live lock in a compare
    and swap wrt their free pool manipulation logic. Iirc, it was somewhat related to ABA, hard to remember right now, sorry. I need to find that
    old thread back in comp.programming.threads.

    Depending on system size: there can be several system function units that
    grant "order" for ATOMIC events. These are useful for 64+node systems
    and unnecessary for less than 8-node systems. Disjoint memory spaces
    can use independent ATOMIC arbiters and whether they are in use or not is invisible to SW.


    ?

    Check this out the old thread:

    https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Mon Nov 20 07:34:09 2023
    Michael S <already5chosen@yahoo.com> writes:
    My uneducated guess is that in sigle-threaded integer performance
    POWER10 should be approximately on par with the best Intel Skylake

    Slightly more educated:

    gforth-fast onebench.fs (development version of Gforth), lower is better:

    sieve bubble matrix fib fft version; machine
    0.075 0.099 0.042 0.112 0.033 20231116; Power10 3900MHz; gcc-11.4.1
    0.061 0.090 0.019 0.056 0.021 20231116; Core i5-6600K 4000MHz; gcc-10.1

    Another benchmark I can run quickly is the LaTeX benchmark <http://www.complang.tuwien.ac.at/anton/latex-bench/>:

    On the Power10:

    487.14 msec task-clock # 0.998 CPUs utilized
    3 context-switches # 6.158 /sec
    1 cpu-migrations # 2.053 /sec
    590 page-faults # 1.211 K/sec
    1897813538 cycles # 3.896 GHz
    4411333480 instructions # 2.32 insn per cycle
    542280135 branches # 1.113 G/sec
    11170586 branch-misses # 2.06% of all branches

    0.488029554 seconds time elapsed

    0.477731000 seconds user
    0.010164000 seconds sys

    On the Core i5-6600K (Skylake):

    396.04 msec task-clock # 0.999 CPUs utilized
    2 context-switches # 0.005 K/sec
    0 cpu-migrations # 0.000 K/sec
    7091 page-faults # 0.018 M/sec
    1584116718 cycles # 4.000 GHz
    3015599019 instructions # 1.90 insn per cycle
    564834472 branches # 1426.217 M/sec
    8230928 branch-misses # 1.46% of all branches

    0.396422458 seconds time elapsed

    0.380444000 seconds user
    0.016018000 seconds sys

    One disadvantage of this LaTeX benchmark is that it depends on the
    LaTeX version, and on the installed packages how much it has to do.
    In the present case both systems use Tex Live 2020, so that should
    make no difference. The Skylake has a lot of packages installed and
    executes more instructions for the LaTeX benchmark than any other
    AMD64 system where I measured the executed instructions. I do not
    know how many packages are installed on the Power10 system.

    Comparing the number of branches is interesting. If we assume that
    compiler transformations like if-conversion and loop unrolling do
    *not* cause a significant difference in branches, the similar number
    of branches suggests that the workload is comparable.

    In any case, both results indicate that the 3900MHz Power10 has a
    lower single-thread performance than a 4GHz Skylake. So you buy a
    Power10 if you want to process large numbers of threads or processes
    and want to put in lots of RAM.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Mon Nov 20 16:40:05 2023
    On Mon, 20 Nov 2023 07:34:09 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    My uneducated guess is that in sigle-threaded integer performance
    POWER10 should be approximately on par with the best Intel Skylake

    Slightly more educated:

    gforth-fast onebench.fs (development version of Gforth), lower is
    better:

    sieve bubble matrix fib fft version; machine
    0.075 0.099 0.042 0.112 0.033 20231116; Power10 3900MHz; gcc-11.4.1
    0.061 0.090 0.019 0.056 0.021 20231116; Core i5-6600K 4000MHz;
    gcc-10.1

    Another benchmark I can run quickly is the LaTeX benchmark <http://www.complang.tuwien.ac.at/anton/latex-bench/>:

    On the Power10:

    487.14 msec task-clock # 0.998 CPUs
    utilized 3 context-switches # 6.158 /sec
    1 cpu-migrations # 2.053 /sec
    590 page-faults # 1.211 K/sec
    1897813538 cycles # 3.896 GHz
    4411333480 instructions # 2.32 insn per
    cycle 542280135 branches # 1.113 G/sec
    11170586 branch-misses # 2.06% of all
    branches

    0.488029554 seconds time elapsed

    0.477731000 seconds user
    0.010164000 seconds sys

    On the Core i5-6600K (Skylake):

    396.04 msec task-clock # 0.999 CPUs
    utilized 2 context-switches # 0.005 K/sec
    0 cpu-migrations # 0.000 K/sec
    7091 page-faults # 0.018 M/sec
    1584116718 cycles # 4.000 GHz
    3015599019 instructions # 1.90 insn per
    cycle 564834472 branches # 1426.217 M/sec
    8230928 branch-misses # 1.46% of all
    branches

    0.396422458 seconds time elapsed

    0.380444000 seconds user
    0.016018000 seconds sys

    One disadvantage of this LaTeX benchmark is that it depends on the
    LaTeX version, and on the installed packages how much it has to do.
    In the present case both systems use Tex Live 2020, so that should
    make no difference. The Skylake has a lot of packages installed and
    executes more instructions for the LaTeX benchmark than any other
    AMD64 system where I measured the executed instructions. I do not
    know how many packages are installed on the Power10 system.

    Comparing the number of branches is interesting. If we assume that
    compiler transformations like if-conversion and loop unrolling do
    *not* cause a significant difference in branches, the similar number
    of branches suggests that the workload is comparable.

    In any case, both results indicate that the 3900MHz Power10 has a
    lower single-thread performance than a 4GHz Skylake. So you buy a
    Power10 if you want to process large numbers of threads or processes
    and want to put in lots of RAM.

    - anton

    That's interesting, thank you.

    Do not your benchmarks have rather small datasets?
    If true, it will put likes of POWER10 (or of Skylake-SP/Skylake-X)
    with their big, slow L2 caches at disadvantage relatively to Skylake
    Client that has small fast L2 cache.

    But if your numbers are representative then POWER10 can be matched by
    rather ancient Intel CPUs. E.g. by 9.5 y.o. i7-4790. Even my Xeon
    E3-1271 v3 would stay a chance. Or by AMD Zen1.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul A. Clayton@21:1/5 to EricP on Mon Nov 20 11:00:30 2023
    On 11/13/23 1:22 PM, EricP wrote:
    Kent Dickey wrote:
    [snip]
    Thus, the people trapped in Relaxed Ordering Hell then push
    weird schemes
    on everyone else to try to come up with algorithms which need fewer
    barriers.  It's crazy.

    Relaxed Ordering is a mistake.

    Kent

    I suggest something different: the ability to switch between TSO and
    relaxed with non-privileged user mode instructions.

    Non-concurrent code does not see the relaxed ordering, and should
    benefit
    from extra concurrency in the Load Store Queue and cache that
    relaxed rules
    allow, because the local core always sees its own memory as
    consistent.
    For example, relaxed ordering allows multiple LD and ST to be in
    multiple pipelines to multiple cache banks at once without regard
    as to the exact order the operations are applied.

    This is fine for non concurrently accessed data structures,
    either non-shared data areas or shared but guarded by mutexes.

    But relaxed is hard for people to reason about for concurrently
    accessed
    lock free data structures. Now these don't just appear out of thin
    air so
    it is reasonable for a program to emit TSO_START and TSO_END
    instructions.

    On the other hand, almost no code is lock-free or ever will be.
    So why have all the extra HW logic to support TSO if its only really
    needed for this rare kind of programming.

    But there is also a category of memory area that is not covered by
    the
    above rules, one where one core thinks its memory is local and not
    shared
    but in fact it is being accessed concurrently.

    If thread T1 (say an app) on core C1 says its memory is relaxed,
    and calls
    a subroutine passing a pointer to a value on T1's stack, and that
    pointer
    is passed to thread T2 (a driver) on core C2 which accesses that
    memory,
    then even if T2 declared itself to be using TSO rules it would not
    force
    T1 on C1 obey them.

    Where this approach could fail is the kind of laissez-faire
    sharing done
    by many apps, libraries, and OS's behind the scenes in the real
    world.

    Another possibility is for non-shared memory to be handled
    differently. (This is similar to My 66000's handling of memory
    types and things mentioned by Mitch Alsup here.)

    Even with a multithreaded program, stack and TLS would be "thread
    private" and not require the same consistency guarantees.

    Various memory partitioning schemes theoretically can provide
    similar benefits for systems with shared memory controllers where
    programs do not share most modifiable data with other programs.
    Even something like a web hosting system might be able to benefit
    from lack of coherence (much less consistency) between different
    web hosts.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Nov 20 11:32:00 2023
    MitchAlsup [2023-11-17 18:37:17] wrote:
    Stefan Monnier wrote:
    BTW, the above sounds daunting when writing in C because you have to do
    that analysis yourself, but there are programming languages out there
    which will do that analysis for you as part of type checking.
    [...]
    I created the Exotic Synchronization Method such that you could just
    write the code needed to do the work, and then decorate those accesses
    which are participating in the ATOMIC event.
    [...]
    In order to change this into a fully qualified ATOMIC event, the code
    is decorated as::

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    esmLOCK( fn = fr->next ); // get data
    esmLOCK( fp = fr->prev );
    esmLOCK( tn = to->next );
    esmLOCK( fn ); // touch data
    esmLOCK( fp );
    esmLOCK( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn; // move the bits around
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCK( fr->next = tn );
    return TRUE;
    }
    return FALSE;
    }

    This is nice, but the onus is still on the programmer to manually make
    sure they don't forget to always `esmLOCK` all the shared data, that
    they don't `esmLOCK` the data that doesn't need it, ...

    In contrast, Haskell's STM will emit a type error if you ever try to use
    a shared var in a non-atomic sequence (or if you try to do things like
    `printf` from within an atomic sequence, ...).

    I see Haskell's STM as a kind of "ideal API" for the programmer.
    And it works tolerably in many cases. But reconciling this kind of
    abstraction with the performance you can get by low-level twiddling is
    the hard part :-)


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Mon Nov 20 15:50:08 2023
    Michael S <already5chosen@yahoo.com> writes:
    Do not your benchmarks have rather small datasets?
    If true, it will put likes of POWER10 (or of Skylake-SP/Skylake-X)
    with their big, slow L2 caches at disadvantage relatively to Skylake
    Client that has small fast L2 cache.

    These benchmarks don't miss L1 much, so the size and latency of the L2
    have little influence. For the LaTeX benchmark:

    Performance counter stats for 'latex bench':

    2963866155 instructions:u
    874909912 L1-dcache-loads
    13846145 L1-dcache-load-misses #1.58% of all L1-dcache accesses
    2754917 LLC-loads
    227951 LLC-stores

    0.394554268 seconds time elapsed

    0.370677000 seconds user
    0.023914000 seconds sys

    Interestingly, ~20% of the D-cache load misses become LLC loads
    (i.e. L2 load misses), so a larger L2 may have an advantage. Let's
    see on a Tiger Lake:

    2955784200 instructions:u
    864268345 L1-dcache-loads
    11751007 L1-dcache-load-misses #1.36% of all L1-dcache accesses
    393018 LLC-loads
    13868 LLC-stores

    0.306192273 seconds time elapsed

    0.289893000 seconds user
    0.016105000 seconds sys

    So, going from 256KB L2 to 1280KB L2 reduces the number of L2 load
    misses by a factor of 7 on this benchmark (well, there is TeX Live
    2022 and possibly a different set of packages on the Tiger Lake
    machine, so there is also the question of comparability).

    The results I see on the Power10 machine are strange:

    Performance counter stats for 'latex bench':

    9622223 L1-dcache-load-misses
    9847374 LLC-loads

    0.476681995 seconds time elapsed

    0.466409000 seconds user
    0.010139000 seconds sys

    Maybe prefetches are counted as LLC-loads.

    For the small gforth benchmarks on Skylake:

    2205651226 instructions:u
    793549469 L1-dcache-loads
    4115040 L1-dcache-load-misses #0.52% of all L1-dcache accesses
    852553 LLC-loads
    26356 LLC-stores

    0.265190457 seconds time elapsed

    0.260044000 seconds user
    0.004036000 seconds sys

    An even smaller D-cache miss rate, and a similar 20% L2 miss rate
    (which constitutes an even smaller proportion of execution time.

    But if your numbers are representative then POWER10 can be matched by
    rather ancient Intel CPUs. E.g. by 9.5 y.o. i7-4790. Even my Xeon
    E3-1271 v3 would stay a chance. Or by AMD Zen1.

    Certainly on these benchmarks. There may be applications that benefit
    from what Power10 has to offer, but my guess is that for most
    applications, you are better off with a somewhat recent Intel or AMD
    CPU.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Paul A. Clayton on Mon Nov 20 17:26:43 2023
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 11/13/23 1:22 PM, EricP wrote:
    Kent Dickey wrote:
    [snip]
    Thus, the people trapped in Relaxed Ordering Hell then push
    weird schemes
    on everyone else to try to come up with algorithms which need fewer
    barriers.  It's crazy.

    Relaxed Ordering is a mistake.

    Kent

    I suggest something different: the ability to switch between TSO and
    relaxed with non-privileged user mode instructions.

    Even with a multithreaded program, stack and TLS would be "thread
    private" and not require the same consistency guarantees.

    Why do you think that 'stack' would be thread private? It's
    quite common to allocate long-lived data structures on the
    stack and pass the address of the object to code that may
    be executing in the context of other threads. So long as the lifetime of
    the object extends beyond the last reference, of course.

    Objects allocated on the stack in 'main', for instance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Paul A. Clayton on Mon Nov 20 13:26:39 2023
    Paul A. Clayton wrote:
    On 11/13/23 1:22 PM, EricP wrote:
    Kent Dickey wrote:
    [snip]
    Thus, the people trapped in Relaxed Ordering Hell then push weird
    schemes
    on everyone else to try to come up with algorithms which need fewer
    barriers. It's crazy.

    Relaxed Ordering is a mistake.

    Kent

    I suggest something different: the ability to switch between TSO and
    relaxed with non-privileged user mode instructions.

    Non-concurrent code does not see the relaxed ordering, and should benefit
    from extra concurrency in the Load Store Queue and cache that relaxed
    rules
    allow, because the local core always sees its own memory as consistent.
    For example, relaxed ordering allows multiple LD and ST to be in
    multiple pipelines to multiple cache banks at once without regard
    as to the exact order the operations are applied.

    This is fine for non concurrently accessed data structures,
    either non-shared data areas or shared but guarded by mutexes.

    But relaxed is hard for people to reason about for concurrently accessed
    lock free data structures. Now these don't just appear out of thin air so
    it is reasonable for a program to emit TSO_START and TSO_END
    instructions.

    On the other hand, almost no code is lock-free or ever will be.
    So why have all the extra HW logic to support TSO if its only really
    needed for this rare kind of programming.

    But there is also a category of memory area that is not covered by the
    above rules, one where one core thinks its memory is local and not shared
    but in fact it is being accessed concurrently.

    If thread T1 (say an app) on core C1 says its memory is relaxed, and
    calls
    a subroutine passing a pointer to a value on T1's stack, and that pointer
    is passed to thread T2 (a driver) on core C2 which accesses that memory,
    then even if T2 declared itself to be using TSO rules it would not force
    T1 on C1 obey them.

    Where this approach could fail is the kind of laissez-faire sharing done
    by many apps, libraries, and OS's behind the scenes in the real world.

    Another possibility is for non-shared memory to be handled
    differently. (This is similar to My 66000's handling of memory
    types and things mentioned by Mitch Alsup here.)

    Even with a multithreaded program, stack and TLS would be "thread
    private" and not require the same consistency guarantees.

    Various memory partitioning schemes theoretically can provide
    similar benefits for systems with shared memory controllers where
    programs do not share most modifiable data with other programs.
    Even something like a web hosting system might be able to benefit
    from lack of coherence (much less consistency) between different
    web hosts.

    But this is my point: in many programs there is no memory that
    you can point to and say it is always private to a single thread.
    And this is independent of language, its to do with program structure.

    You can say a certain memory range is shared and guarded by locks,
    or shared and managed by lock-free code.
    And we can point to this because the code in modularized this way.

    But the opposite of 'definitely shared' is not 'definitely private',
    it's 'dont know' or 'sometimes'.

    Eg: Your code pops up a dialog box, prompts the user for an integer value,
    and writes that value to a program global variable.
    Do you know whether that dialog box is a separate thread?
    Should you care? What if the dialog starts out in the same thread,
    then a new release changes it to a separate thread.
    What if the variable is on the thread stack or in a heap?

    On a weak ordered system I definitely need to know because
    I need barriers to ensure I can read the variable properly.

    And this is where TSO makes the difference.
    I don't have to know exactly every byte that might be updated concurrently
    in some context at some time (and that context can change dynamically).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Tue Nov 21 00:52:08 2023
    Scott Lurndal wrote:

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 11/13/23 1:22 PM, EricP wrote:
    Kent Dickey wrote:
    [snip]
    Thus, the people trapped in Relaxed Ordering Hell then push
    weird schemes
    on everyone else to try to come up with algorithms which need fewer
    barriers.  It's crazy.

    Relaxed Ordering is a mistake.

    Kent

    I suggest something different: the ability to switch between TSO and
    relaxed with non-privileged user mode instructions.

    Even with a multithreaded program, stack and TLS would be "thread
    private" and not require the same consistency guarantees.

    Why do you think that 'stack' would be thread private? It's
    quite common to allocate long-lived data structures on the
    stack and pass the address of the object to code that may
    be executing in the context of other threads. So long as the lifetime of the object extends beyond the last reference, of course.

    Even Thread Local Store is not private to the thread if the thread
    creates a pointer into it and allows others to see the pointer.

    The only thing the HW can validate as non-shared is that portion of
    the stack containing callee save registers (and the return address)
    but only 2 known architectures have these chunks of memory is an
    address space where threads cannot read-write-or-execute that chunk.

    Objects allocated on the stack in 'main', for instance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to EricP on Tue Nov 21 13:44:56 2023
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    But this is my point: in many programs there is no memory that
    you can point to and say it is always private to a single thread.

    That's more about program design. In a multi-threaded program this is
    something you really should know.

    And this is independent of language, its to do with program structure.

    You can say a certain memory range is shared and guarded by locks,
    or shared and managed by lock-free code.
    And we can point to this because the code in modularized this way.

    But the opposite of 'definitely shared' is not 'definitely private',
    it's 'dont know' or 'sometimes'.

    Eg: Your code pops up a dialog box, prompts the user for an integer value, and writes that value to a program global variable.
    Do you know whether that dialog box is a separate thread?
    Should you care? What if the dialog starts out in the same thread,
    then a new release changes it to a separate thread.
    What if the variable is on the thread stack or in a heap?

    On a weak ordered system I definitely need to know because
    I need barriers to ensure I can read the variable properly.

    In this case, no, I don't think you do. Barriers only control the
    ordering between accesses, not when they become visible, and here
    there's only one access. If there are at least two, and you really
    need to see one before the other, then you need a barrier. And even on
    a TSO machine, you're going to have to do something on both the reader
    and the writer sides if you need ordering to be protected from a
    compiler.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Tue Nov 21 09:29:43 2023
    MitchAlsup wrote:
    Scott Lurndal wrote:

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 11/13/23 1:22 PM, EricP wrote:
    Kent Dickey wrote:
    [snip]
    Thus, the people trapped in Relaxed Ordering Hell then push weird
    schemes
    on everyone else to try to come up with algorithms which need fewer
    barriers. It's crazy.

    Relaxed Ordering is a mistake.

    Kent

    I suggest something different: the ability to switch between TSO and
    relaxed with non-privileged user mode instructions.

    Even with a multithreaded program, stack and TLS would be "thread
    private" and not require the same consistency guarantees.

    Why do you think that 'stack' would be thread private? It's
    quite common to allocate long-lived data structures on the
    stack and pass the address of the object to code that may
    be executing in the context of other threads. So long as the
    lifetime of
    the object extends beyond the last reference, of course.

    Even Thread Local Store is not private to the thread if the thread
    creates a pointer into it and allows others to see the pointer.

    The only thing the HW can validate as non-shared is that portion of
    the stack containing callee save registers (and the return address)
    but only 2 known architectures have these chunks of memory is an
    address space where threads cannot read-write-or-execute that chunk.

    The callee save area may be R-W-E page protected against it own thread
    but it doesn't prevent a privileged thread from concurrently accessing
    that save area (say to edit the stack to deliver a signal)
    so the same coherence applies there too.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to aph@littlepinkcloud.invalid on Tue Nov 21 12:11:00 2023
    aph@littlepinkcloud.invalid wrote:
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    But this is my point: in many programs there is no memory that
    you can point to and say it is always private to a single thread.

    That's more about program design. In a multi-threaded program this is something you really should know.

    And this is independent of language, its to do with program structure.

    You can say a certain memory range is shared and guarded by locks,
    or shared and managed by lock-free code.
    And we can point to this because the code in modularized this way.

    But the opposite of 'definitely shared' is not 'definitely private',
    it's 'dont know' or 'sometimes'.

    Eg: Your code pops up a dialog box, prompts the user for an integer value, >> and writes that value to a program global variable.
    Do you know whether that dialog box is a separate thread?
    Should you care? What if the dialog starts out in the same thread,
    then a new release changes it to a separate thread.
    What if the variable is on the thread stack or in a heap?

    On a weak ordered system I definitely need to know because
    I need barriers to ensure I can read the variable properly.

    In this case, no, I don't think you do. Barriers only control the
    ordering between accesses, not when they become visible, and here
    there's only one access. If there are at least two, and you really
    need to see one before the other, then you need a barrier.

    The barriers also ensure the various local buffers, pipelines and
    inbound and outbound comms command and reply message queues are drained.
    It ensures that the operations that came before it have reached
    their coherency point - the cache controller - and that any
    outstanding asynchronous operations are complete.
    And that in turn controls when values become visible.

    On a weak order cpu with no store ordering, the cpu is not required
    to propagate any store into the cache within any period of time.
    It can stash it in a write combine buffer waiting to see if more
    updates to the same line appear.

    Weak order requires a membar after a store to force the it into the cache, triggering the coherence handshake which invalidates other copies,
    so that when remote cores reread a line they see the updated value.

    In other words, to retire the membar instruction the core must force the
    prior store values into the coherent cache making them globally visible.

    The difference for TSO is that a store has implied membars before it to
    prevent it bypassing (executing before) older loads and stores.

    And even on
    a TSO machine, you're going to have to do something on both the reader
    and the writer sides if you need ordering to be protected from a
    compiler.

    Andrew.

    Compilers are a different discussion.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Tue Nov 21 18:18:49 2023
    EricP wrote:

    MitchAlsup wrote:
    Scott Lurndal wrote:

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 11/13/23 1:22 PM, EricP wrote:
    Kent Dickey wrote:
    [snip]
    Thus, the people trapped in Relaxed Ordering Hell then push weird
    schemes
    on everyone else to try to come up with algorithms which need fewer >>>>>> barriers. It's crazy.

    Relaxed Ordering is a mistake.

    Kent

    I suggest something different: the ability to switch between TSO and >>>>> relaxed with non-privileged user mode instructions.

    Even with a multithreaded program, stack and TLS would be "thread
    private" and not require the same consistency guarantees.

    Why do you think that 'stack' would be thread private? It's
    quite common to allocate long-lived data structures on the
    stack and pass the address of the object to code that may
    be executing in the context of other threads. So long as the
    lifetime of
    the object extends beyond the last reference, of course.

    Even Thread Local Store is not private to the thread if the thread
    creates a pointer into it and allows others to see the pointer.

    The only thing the HW can validate as non-shared is that portion of
    the stack containing callee save registers (and the return address)
    but only 2 known architectures have these chunks of memory is an
    address space where threads cannot read-write-or-execute that chunk.

    The callee save area may be R-W-E page protected against it own thread
    but it doesn't prevent a privileged thread from concurrently accessing
    that save area (say to edit the stack to deliver a signal)
    so the same coherence applies there too.


    My 66000 monitors the PTE used to translate the portion of Call Stack
    Pointer such that RWE = 000 or will page-fault. This prevents malicious
    code from damaging the contract between caller and callee. There is
    an SP the application uses to create and destroy local data and there
    is a separate CSP the application cannot even see only ENTER, EXIT, and
    RET instructions can use CSP and these verify that RWE == 000.


    It is certainly true that a different set of mapping tables can access
    anything those tables map.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to EricP on Tue Nov 21 14:04:51 2023
    On 11/21/2023 9:11 AM, EricP wrote:
    aph@littlepinkcloud.invalid wrote:
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    But this is my point: in many programs there is no memory that
    you can point to and say it is always private to a single thread.

    That's more about program design. In a multi-threaded program this is
    something you really should know.

    And this is independent of language, its to do with program structure.

    You can say a certain memory range is shared and guarded by locks,
    or shared and managed by lock-free code.
    And we can point to this because the code in modularized this way.

    But the opposite of 'definitely shared' is not 'definitely private',
    it's 'dont know' or 'sometimes'.

    Eg: Your code pops up a dialog box, prompts the user for an integer
    value,
    and writes that value to a program global variable.
    Do you know whether that dialog box is a separate thread?
    Should you care? What if the dialog starts out in the same thread,
    then a new release changes it to a separate thread.
    What if the variable is on the thread stack or in a heap?

    On a weak ordered system I definitely need to know because
    I need barriers to ensure I can read the variable properly.

    In this case, no, I don't think you do. Barriers only control the
    ordering between accesses, not when they become visible, and here
    there's only one access. If there are at least two, and you really
    need to see one before the other, then you need a barrier.

    The barriers also ensure the various local buffers, pipelines and
    inbound and outbound comms command and reply message queues are drained.
    It ensures that the operations that came before it have reached
    their coherency point - the cache controller - and that any
    outstanding asynchronous operations are complete.
    And that in turn controls when values become visible.

    On a weak order cpu with no store ordering, the cpu is not required
    to propagate any store into the cache within any period of time.
    It can stash it in a write combine buffer waiting to see if more
    updates to the same line appear.

    Weak order requires a membar after a store to force the it into the cache, triggering the coherence handshake which invalidates other copies,
    so that when remote cores reread a line they see the updated value.

    In other words, to retire the membar instruction the core must force the prior store values into the coherent cache making them globally visible.

    Fwiw, it depends on the barrier. Wrt x86, think along the lines of
    LOCK'ED RMW, MFENCE, SFENCE, LFENCE. Iirc, the *FENCE instructions are
    for WB memory, also I think MFENCE can be used for non-wb membar as
    well... Cannot remember right now, damn it.

    On SPARC: #StoreLoad, #LoadStore, #LoadLoad, #StoreStore

    I think there is another one on the SPARC that I am forgetting.


    The difference for TSO is that a store has implied membars before it to prevent it bypassing (executing before) older loads and stores.

    And even on
    a TSO machine, you're going to have to do something on both the reader
    and the writer sides if you need ordering to be protected from a
    compiler.

    Andrew.

    Compilers are a different discussion.



    C++ Compilers get it right because of std::atomic, a compiler barrier is different than a memory barrier.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to EricP on Wed Nov 22 09:35:42 2023
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    aph@littlepinkcloud.invalid wrote:
    Barriers only control the ordering between accesses, not when they
    become visible, and here there's only one access. If there are at
    least two, and you really need to see one before the other, then
    you need a barrier.

    The barriers also ensure the various local buffers, pipelines and
    inbound and outbound comms command and reply message queues are
    drained.

    I'm surprised you say that. All they have to do is make it appear as
    if this has been done. How it actually happens is up to the hardware
    designer. In modern GBOOO CPUs, most barriers don't require everything
    to be pushed to cache. (As I understand it, GBOOO designs can treat
    the set of pending accesses as a sort of memory transaction, detect
    conflicts, and roll back to the last point of coherence and replay
    in-order. But I am not a hardware designer.)

    Weak order requires a membar after a store to force the it into the cache, triggering the coherence handshake which invalidates other copies,
    so that when remote cores reread a line they see the updated value.

    In other words, to retire the membar instruction the core must force the prior store values into the coherent cache making them globally visible.

    That's usually true for a StoreLoad, not for any of the others.

    The difference for TSO is that a store has implied membars before it to prevent it bypassing (executing before) older loads and stores.

    And even on a TSO machine, you're going to have to do something on
    both the reader and the writer sides if you need ordering to be
    protected from a compiler.

    Compilers are a different discussion.

    Not for the user. It's the same problem, with the same solution. It
    makes no difference to a C programmer where the reordering comes from.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to aph@littlepinkcloud.invalid on Wed Nov 22 20:38:06 2023
    aph@littlepinkcloud.invalid wrote:

    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    aph@littlepinkcloud.invalid wrote:
    Barriers only control the ordering between accesses, not when they
    become visible, and here there's only one access. If there are at
    least two, and you really need to see one before the other, then
    you need a barrier.

    The barriers also ensure the various local buffers, pipelines and
    inbound and outbound comms command and reply message queues are
    drained.

    I'm surprised you say that. All they have to do is make it appear as
    if this has been done. How it actually happens is up to the hardware designer. In modern GBOOO CPUs, most barriers don't require everything
    to be pushed to cache. (As I understand it, GBOOO designs can treat
    the set of pending accesses as a sort of memory transaction, detect conflicts, and roll back to the last point of coherence and replay
    in-order. But I am not a hardware designer.)

    Yes, exactly, an MemBar sets a boundary where all younger memory references
    of one type (or both) must wait to become visible until all older memory references of the other Types have become visible. Only 1 or 2 bits of state per queued memory ref is altered by an explicit MemBar or an implicit barrier (from change in operating mode).

    Weak order requires a membar after a store to force the it into the cache, >> triggering the coherence handshake which invalidates other copies,
    so that when remote cores reread a line they see the updated value.

    What if it is not-cacheable ?? or MMI/O ?? or configuration space ??

    In other words, to retire the membar instruction the core must force the
    prior store values into the coherent cache making them globally visible.

    Just Visible, the rest is not under SW control.

    That's usually true for a StoreLoad, not for any of the others.

    The difference for TSO is that a store has implied membars before it to
    prevent it bypassing (executing before) older loads and stores.

    Which is why TSO is slow(er).

    And even on a TSO machine, you're going to have to do something on
    both the reader and the writer sides if you need ordering to be
    protected from a compiler.

    Compilers are a different discussion.

    Not for the user. It's the same problem, with the same solution. It
    makes no difference to a C programmer where the reordering comes from.

    Somebody has to create the rules by which HW and SW cooperatively manage
    memory order in the presence of a multiplicity of CPUs and devices.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to MitchAlsup on Wed Nov 22 14:24:56 2023
    On 11/22/2023 12:38 PM, MitchAlsup wrote:
    aph@littlepinkcloud.invalid wrote:

    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    aph@littlepinkcloud.invalid wrote:
    Barriers only control the ordering between accesses, not when they
    become visible, and here there's only one access. If there are at
    least two, and you really need to see one before the other, then
    you need a barrier.

    The barriers also ensure the various local buffers, pipelines and
    inbound and outbound comms command and reply message queues are
    drained.

    I'm surprised you say that. All they have to do is make it appear as
    if this has been done. How it actually happens is up to the hardware
    designer. In modern GBOOO CPUs, most barriers don't require everything
    to be pushed to cache. (As I understand it, GBOOO designs can treat
    the set of pending accesses as a sort of memory transaction, detect
    conflicts, and roll back to the last point of coherence and replay
    in-order. But I am not a hardware designer.)

    Yes, exactly, an MemBar sets a boundary where all younger memory references of one type (or both) must wait to become visible until all older memory references of the other Types have become visible. Only 1 or 2 bits of
    state
    per queued memory ref is altered by an explicit MemBar or an implicit
    barrier
    (from change in operating mode).
    [...]

    There are different kinds of memory barriers. For instance, a mutex can
    use #LoadStore ordering, however, Dekker's algorithm as-is requires
    #StoreLoad. Also, think of eieio... ;^) Btw, have you heard of
    asymmetric mutexes before?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Chris M. Thomasson on Wed Nov 22 14:22:15 2023
    On 11/21/2023 2:04 PM, Chris M. Thomasson wrote:
    On 11/21/2023 9:11 AM, EricP wrote:
    aph@littlepinkcloud.invalid wrote:
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    But this is my point: in many programs there is no memory that
    you can point to and say it is always private to a single thread.

    That's more about program design. In a multi-threaded program this is
    something you really should know.

    And this is independent of language, its to do with program structure. >>>>
    You can say a certain memory range is shared and guarded by locks,
    or shared and managed by lock-free code.
    And we can point to this because the code in modularized this way.

    But the opposite of 'definitely shared' is not 'definitely private',
    it's 'dont know' or 'sometimes'.

    Eg: Your code pops up a dialog box, prompts the user for an integer
    value,
    and writes that value to a program global variable.
    Do you know whether that dialog box is a separate thread?
    Should you care? What if the dialog starts out in the same thread,
    then a new release changes it to a separate thread.
    What if the variable is on the thread stack or in a heap?

    On a weak ordered system I definitely need to know because
    I need barriers to ensure I can read the variable properly.

    In this case, no, I don't think you do. Barriers only control the
    ordering between accesses, not when they become visible, and here
    there's only one access. If there are at least two, and you really
    need to see one before the other, then you need a barrier.

    The barriers also ensure the various local buffers, pipelines and
    inbound and outbound comms command and reply message queues are drained.
    It ensures that the operations that came before it have reached
    their coherency point - the cache controller - and that any
    outstanding asynchronous operations are complete.
    And that in turn controls when values become visible.

    On a weak order cpu with no store ordering, the cpu is not required
    to propagate any store into the cache within any period of time.
    It can stash it in a write combine buffer waiting to see if more
    updates to the same line appear.

    Weak order requires a membar after a store to force the it into the
    cache,
    triggering the coherence handshake which invalidates other copies,
    so that when remote cores reread a line they see the updated value.

    In other words, to retire the membar instruction the core must force the
    prior store values into the coherent cache making them globally visible.

    Fwiw, it depends on the barrier. Wrt x86, think along the lines of
    LOCK'ED RMW, MFENCE, SFENCE, LFENCE. Iirc, the *FENCE instructions are
    for WB memory, also I think MFENCE can be used for non-wb membar as
    well... Cannot remember right now, damn it.

    On SPARC: #StoreLoad, #LoadStore, #LoadLoad, #StoreStore

    I think there is another one on the SPARC that I am forgetting.


    The difference for TSO is that a store has implied membars before it to
    prevent it bypassing (executing before) older loads and stores.

    And even on
    a TSO machine, you're going to have to do something on both the reader
    and the writer sides if you need ordering to be protected from a
    compiler.

    Andrew.

    Compilers are a different discussion.



    C++ Compilers get it right because of std::atomic, a compiler barrier is different than a memory barrier.


    Also, lets not forget EIEIO on the PPC. ;^)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to MitchAlsup on Wed Nov 22 21:53:16 2023
    On 11/19/2023 2:23 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/19/2023 11:32 AM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/17/2023 5:03 PM, MitchAlsup wrote:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Stefan Monnier wrote:

    As long as you fully analyze your program, ensure all
    multithreaded accesses
    are only through atomic variables, and you label every access >>>>>>>>> to an
    atomic variable properly (although my point is: exactly what >>>>>>>>> should that
    be??), then there is no problem.

    BTW, the above sounds daunting when writing in C because you
    have to do
    that analysis yourself, but there are programming languages out >>>>>>>> there
    which will do that analysis for you as part of type checking.
    I'm thinking here of languages like Rust or the STM library of >>>>>>>> Haskell.  This also solves the problem that memory accesses can be >>>>>>>> reordered by the compiler, since in that case the compiler is fully >>>>>>>> aware of which accesses can be reordered and which can't.


            Stefan
    <
    I created the Exotic Synchronization Method such that you could just >>>>>>> write the code needed to do the work, and then decorate those
    accesses
    which are participating in the ATOMIC event. So, lets say you
    want to move an element from one doubly linked list to another
    place in some
    other doubly linked list:: you would write::
    <
    BOOLEAN MoveElement( Element *fr, Element *to )
    {
        fn = fr->next;
        fp = fr->prev;
        tn = to->next;



        if( TRUE )
        {
                 fp->next = fn;
                 fn->prev = fp;
                 to->next = fr;
                 tn->prev = fr;
                 fr->prev = to;
                 fr->next = tn;
                 return TRUE;
        }
        return FALSE;
    }

    In order to change this into a fully qualified ATOMIC event, the >>>>>>> code
    is decorated as::

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
        esmLOCK( fn = fr->next );         // get data
        esmLOCK( fp = fr->prev );
        esmLOCK( tn = to->next );
        esmLOCK( fn );                    // touch data
        esmLOCK( fp );
        esmLOCK( tn );
        if( !esmINTERFERENCE() )
        {
                 fp->next = fn;           // move the bits around
                 fn->prev = fp;
                 to->next = fr;
                 tn->prev = fr;
                 fr->prev = to;
        esmLOCK( fr->next = tn );
                 return TRUE;
        }
        return FALSE;
    }

    Having a multiplicity of containers participate in an ATOMIC event >>>>>>> is key to making ATOMIC stuff fast and needing fewer ATOMICs to
    to get the job(s) done.

    That looks suspiciously like transactional memory.

    Indeed, it does. Worried about live lock wrt esmINTERFERENCE().

    esmINTERFERENCE() is an actual instruction in My 66000 ISA. It is a
    conditional branch where the condition is delivered from the miss
    buffer (where I detect interference wrt participating cache lines.)

    So, can false sharing on a participating cache line make
    esmINTERFERENCE() return true?




    I has some flavors of such, but::
    it has no nesting,
    it has a strict limit of 8 participating cache lines,
    it automagically transfers control when disruptive interference is
    detected,
    it is subject to timeouts;

    But does have the property that all interested 3rd parties see
    participating
    memory only in the before or only in the completely after states.

    Are you familiar with KCSS? K-Compare Single Swap?

    https://people.csail.mit.edu/shanir/publications/K-Compare.pdf

    Easily done:
         esmLOCK( c1 = p1->condition1 );
         esmLOCK( c2 = p2->condition2 );
         ...
         if( c1 == C1 && C2 == C2 && c2 == C3 ... )
             ...
             esmLOCK( some data );

    Esm was designed to allow any known synchronization means (in 2013)
    to be directly implemented in esm either inline or via subroutine
    calls.

    I can see how that would work. The problem is that I am not exactly
    sure how esmINTERFERENCE works internally... Can it detect/prevent
    live lock?

    esmINTERFERENCE is a branch on interference instruction. This is a conditional
    branch instruction that queries whether any of the participating cache
    lines
    has seen a read-with-intent or coherent-invalidate. In effect the branch logic reaches out to the miss buffer and asks if any of the participating cache lines has been snooped-for-write:: and you can't do this in 2
    instruc-
    tions or you loose ATOMICicity.

    If it has, control is transferred and the ATOMIC event is failed
    If it has not, and all participating cache lines are present, then this
    core is allowed to NAK all requests to those participating cache lines
    {and control is not transferred}.

    So, you gain control over where flow goes on failure, and essentially
    commit the whole event to finish.

    So, while it does not eliminate live/dead-lock situations, it allows SW
    to be constructed to avoid live/dead lock situations:: Why is a value
    which is provided when an ATOMIC event fails. 0 means success, negative values are spurious (buffer overflows,...) while positives represent
    the number of competing threads, so the following case, skips elements
    on a linked list to decrease future initerference.

    Element* getElement( unSigned  Key )
    {
        int count = 0;
        for( p = structure.head; p ; p = p->next )
        {
             if( p->Key == Key )
             {
                  if( count-- < 0 )
                  {
                       esmLOCK( p );
                       prev = p->prev;
                       esmLOCK( prev );
                       next = p->next;
                       esmLOCK( next );
                       if( !esmINTERFERENCE() )
                       {
                            p->prev = next;
                            p->next = prev;
                            p->prev = NULL;
                   esmLOCK( p->next = NULL );
                            return p;
                       }
                       else
                       {
                            count = esmWHY();
                            p = structure.head;
                       }
                  }
             }
        }
        return NULL;
    }

    Doing ATOMIC things like this means one can take the BigO( n^3 ) activity that happens when a timer goes off and n threads all want access to the
    work queue, down to BigO( 3 ) yes=constant, but in practice it is reduced
    to BigO( ln( n ) ) when requesters arrive in random order at random time.

    I remember hearing from my friend Joe Seigh, who worked at IBM, that
    they had some sort of logic that would prevent live lock in a compare
    and swap wrt their free pool manipulation logic. Iirc, it was somewhat
    related to ABA, hard to remember right now, sorry. I need to find that
    old thread back in comp.programming.threads.

    Depending on system size: there can be several system function units
    that grant "order" for ATOMIC events. These are useful for 64+node systems and unnecessary for less than 8-node systems. Disjoint memory spaces
    can use independent ATOMIC arbiters and whether they are in use or not is invisible to SW.


    ?

    Check this out the old thread:

    https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ

    Humm, you arch seems pretty neat/interesting to me. I need to learn more
    about it. Can it be abused with a rouge thread that keeps altering a cacheline(s) that are participating in the atomic block, so to speak?
    Anyway, I am busy with family time. Will get back to you.

    Fwiw, here is some of my work:

    https://youtu.be/HwIkk9zENcg

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Thu Nov 23 16:33:17 2023
    Chris M. Thomasson wrote:

    On 11/19/2023 2:23 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    So, while it does not eliminate live/dead-lock situations, it allows SW
    to be constructed to avoid live/dead lock situations:: Why is a value
    which is provided when an ATOMIC event fails. 0 means success, negative
    values are spurious (buffer overflows,...) while positives represent
    the number of competing threads, so the following case, skips elements
    on a linked list to decrease future initerference.

    Element* getElement( unSigned  Key )
    {
        int count = 0;
        for( p = structure.head; p ; p = p->next )
        {
             if( p->Key == Key )
             {
                  if( count-- < 0 )
                  {
                       esmLOCK( p );
                       prev = p->prev;
                       esmLOCK( prev );
                       next = p->next;
                       esmLOCK( next );
                       if( !esmINTERFERENCE() )
                       {
                            p->prev = next;
                            p->next = prev;
                            p->prev = NULL;
                   esmLOCK( p->next = NULL );
                            return p;
                       }
                       else
                       {
                            count = esmWHY();
                            p = structure.head;
                       }
                  }
             }
        }
        return NULL;
    }

    Doing ATOMIC things like this means one can take the BigO( n^3 ) activity
    that happens when a timer goes off and n threads all want access to the
    work queue, down to BigO( 3 ) yes=constant, but in practice it is reduced
    to BigO( ln( n ) ) when requesters arrive in random order at random time.

    I remember hearing from my friend Joe Seigh, who worked at IBM, that
    they had some sort of logic that would prevent live lock in a compare
    and swap wrt their free pool manipulation logic. Iirc, it was somewhat
    related to ABA, hard to remember right now, sorry. I need to find that
    old thread back in comp.programming.threads.

    Depending on system size: there can be several system function units
    that grant "order" for ATOMIC events. These are useful for 64+node systems >> and unnecessary for less than 8-node systems. Disjoint memory spaces
    can use independent ATOMIC arbiters and whether they are in use or not is
    invisible to SW.


    ?

    Check this out the old thread:

    https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ

    Humm, you arch seems pretty neat/interesting to me. I need to learn more about it. Can it be abused with a rouge thread that keeps altering a cacheline(s) that are participating in the atomic block, so to speak?
    Anyway, I am busy with family time. Will get back to you.

    While possible, it is a lot less likely than on a similar architecture
    without any of the bells and whistles.

    Fwiw, here is some of my work:

    https://youtu.be/HwIkk9zENcg

    Octopi in a box playing ball.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to aph@littlepinkcloud.invalid on Thu Nov 23 13:02:50 2023
    aph@littlepinkcloud.invalid wrote:
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    aph@littlepinkcloud.invalid wrote:
    Barriers only control the ordering between accesses, not when they
    become visible, and here there's only one access. If there are at
    least two, and you really need to see one before the other, then
    you need a barrier.
    The barriers also ensure the various local buffers, pipelines and
    inbound and outbound comms command and reply message queues are
    drained.

    I'm surprised you say that. All they have to do is make it appear as
    if this has been done. How it actually happens is up to the hardware designer. In modern GBOOO CPUs, most barriers don't require everything
    to be pushed to cache. (As I understand it, GBOOO designs can treat
    the set of pending accesses as a sort of memory transaction, detect conflicts, and roll back to the last point of coherence and replay
    in-order. But I am not a hardware designer.)

    I say that because that is what the Intel manual says for MFENCE and SFENCE, "The processor ensures that every store prior to SFENCE is globally visible before any store after SFENCE becomes globally visible"
    and Arm64 for DSB "ensures that memory accesses that occur before the DSB instruction have completed before the completion of the DSB instruction".

    Perhaps you are thinking of Intel LFENCE and Arm64 DMB which are weaker
    and do not require the prior operations to complete but just that they
    are performed or observed. These appear to only look at the local LSQ.

    WRT the draining for synchronization, I'm not sure what you think the difference is between making it appear to be done and actually doing so.
    Sure there may be some optimizations possible, but the net effect on the externally observable cache state must be the same.

    For store values to be globally visible the local node must receive the
    cache line in exclusive state AND have received ACK's from all the prior sharers that they have removed their copies from cache.

    If multiple miss buffers allow multiple pending cache misses then all of
    these outstanding operations must complete before the membar can retire.
    To do that the cache controller must send all pending outbound command
    messages and process all inbound replies, which means drain all the
    message queues until the misses are complete, then apply pending
    operations to the now resident cache lines.

    There are likely ways to optimize the above, overlap some sequences,
    but I don't see any steps that can be removed.

    Weak order requires a membar after a store to force the it into the cache, >> triggering the coherence handshake which invalidates other copies,
    so that when remote cores reread a line they see the updated value.

    In other words, to retire the membar instruction the core must force the
    prior store values into the coherent cache making them globally visible.

    That's usually true for a StoreLoad, not for any of the others.

    I was thinking of full membars as an example as the others
    are subsets of it and may allow some optimizations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Thu Nov 23 13:52:33 2023
    MitchAlsup wrote:
    aph@littlepinkcloud.invalid wrote:

    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    aph@littlepinkcloud.invalid wrote:
    Barriers only control the ordering between accesses, not when they
    become visible, and here there's only one access. If there are at
    least two, and you really need to see one before the other, then
    you need a barrier.

    The barriers also ensure the various local buffers, pipelines and
    inbound and outbound comms command and reply message queues are
    drained.

    I'm surprised you say that. All they have to do is make it appear as
    if this has been done. How it actually happens is up to the hardware
    designer. In modern GBOOO CPUs, most barriers don't require everything
    to be pushed to cache. (As I understand it, GBOOO designs can treat
    the set of pending accesses as a sort of memory transaction, detect
    conflicts, and roll back to the last point of coherence and replay
    in-order. But I am not a hardware designer.)

    Yes, exactly, an MemBar sets a boundary where all younger memory references of one type (or both) must wait to become visible until all older memory references of the other Types have become visible. Only 1 or 2 bits of
    state
    per queued memory ref is altered by an explicit MemBar or an implicit
    barrier
    (from change in operating mode).

    Weak order requires a membar after a store to force the it into the
    cache,
    triggering the coherence handshake which invalidates other copies,
    so that when remote cores reread a line they see the updated value.

    What if it is not-cacheable ?? or MMI/O ?? or configuration space ??

    In other words, to retire the membar instruction the core must force the >>> prior store values into the coherent cache making them globally visible.

    Just Visible, the rest is not under SW control.

    There are two kinds of barriers/fences (I don't know if there are official terms for them), which are local bypass barriers, and global completion barriers.

    Bypass barriers restrict which younger ops in the local load-store queue
    may bypass and start execution before older ops have made a value locally visible.

    Completion barriers block younger ops from starting execution before
    older ops have completed and read or written globally visible values.

    You appear to be referring to bypass barriers whereas I'm referring to completion barriers which require globally visible results.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul A. Clayton@21:1/5 to Scott Lurndal on Thu Nov 23 16:36:17 2023
    On 11/20/23 12:26 PM, Scott Lurndal wrote:
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 11/13/23 1:22 PM, EricP wrote:
    Kent Dickey wrote:
    [snip]
    Thus, the people trapped in Relaxed Ordering Hell then push
    weird schemes
    on everyone else to try to come up with algorithms which need fewer
    barriers.  It's crazy.

    Relaxed Ordering is a mistake.

    Kent

    I suggest something different: the ability to switch between TSO and
    relaxed with non-privileged user mode instructions.

    Even with a multithreaded program, stack and TLS would be "thread
    private" and not require the same consistency guarantees.

    Why do you think that 'stack' would be thread private? It's
    quite common to allocate long-lived data structures on the
    stack and pass the address of the object to code that may
    be executing in the context of other threads. So long as the lifetime of the object extends beyond the last reference, of course.

    Objects allocated on the stack in 'main', for instance.

    I thought (in my ignorance) that because the lifetime of
    stack-allocated data is controlled by the thread (call depth) that
    allocated that data it would not be thread safe. Clearly if one
    could guarantee that the allocating thread did not reduce its
    call depth to below that frame, it would be safe.

    (An allocation made in an initialization stage could generally be
    easy to guarantee not to be deallocated early — only being
    deallocated on program end by the OS — but at that point the early
    stack is in some sense not the same stack as for later uses.)

    I knew that pointers where allocated for arrays and objects on the
    stack and passed to a called function. That feels a bit icky to me
    in that I conceive of a stack frame as being just for that
    function, i.e., my mental model does not reflect practice. (Even
    allocating an array for internal use in a stack frame is
    inconsistent with my mental model which exclusively references off
    the stack pointer with immediate offsets.)

    I do not know how difficult it would be to establish a compilation
    system that did not use "the" stack for such allocations.
    Allocations to the stack have the advantage of simple management
    (just adjust the stack pointer) with the constraint of simple
    timing of allocation and free and the advantage of never missing
    cache for the pointer. Providing a modest prefetched-on-context-
    switch cache (like a Knapsack Cache proposed by Todd Austin for
    reduced latency) would allow multiple stacks/regions to have such
    benefits by placing the next allocation pointers there.

    Effectively extending the register set in that way could have
    other advantages. (Context switch overhead would increase.)
    Providing an "L2 register" storage seems to have some attraction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Thu Nov 23 21:14:34 2023
    EricP wrote:

    MitchAlsup wrote:
    aph@littlepinkcloud.invalid wrote:

    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    aph@littlepinkcloud.invalid wrote:
    Barriers only control the ordering between accesses, not when they
    become visible, and here there's only one access. If there are at
    least two, and you really need to see one before the other, then
    you need a barrier.

    The barriers also ensure the various local buffers, pipelines and
    inbound and outbound comms command and reply message queues are
    drained.

    I'm surprised you say that. All they have to do is make it appear as
    if this has been done. How it actually happens is up to the hardware
    designer. In modern GBOOO CPUs, most barriers don't require everything
    to be pushed to cache. (As I understand it, GBOOO designs can treat
    the set of pending accesses as a sort of memory transaction, detect
    conflicts, and roll back to the last point of coherence and replay
    in-order. But I am not a hardware designer.)

    Yes, exactly, an MemBar sets a boundary where all younger memory references >> of one type (or both) must wait to become visible until all older memory
    references of the other Types have become visible. Only 1 or 2 bits of
    state
    per queued memory ref is altered by an explicit MemBar or an implicit
    barrier
    (from change in operating mode).

    Weak order requires a membar after a store to force the it into the
    cache,
    triggering the coherence handshake which invalidates other copies,
    so that when remote cores reread a line they see the updated value.

    What if it is not-cacheable ?? or MMI/O ?? or configuration space ??

    In other words, to retire the membar instruction the core must force the >>>> prior store values into the coherent cache making them globally visible. >>
    Just Visible, the rest is not under SW control.

    There are two kinds of barriers/fences (I don't know if there are official terms for them), which are local bypass barriers, and global completion barriers.

    Processor order and Global order. One cannot reason even about single threaded sequential instruction execution (vonNeumann) without processor order, similarly,
    one cannot reason about Global (externally visible to 3rd party) order without "the memory model".

    Bypass barriers restrict which younger ops in the local load-store queue
    may bypass and start execution before older ops have made a value locally visible.

    Processor order.

    Completion barriers block younger ops from starting execution before
    older ops have completed and read or written globally visible values.

    Also processor order.

    You appear to be referring to bypass barriers whereas I'm referring to completion barriers which require globally visible results.

    I attempt to describe the reasoning capabilities as seen by both the CPU
    and by the interested 3rd party.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul A. Clayton@21:1/5 to EricP on Thu Nov 23 17:57:40 2023
    On 11/20/23 1:26 PM, EricP wrote:
    Paul A. Clayton wrote:
    On 11/13/23 1:22 PM, EricP wrote:
    [snip]
    Where this approach could fail is the kind of laissez-faire
    sharing done by many apps, libraries, and OS's behind the
    scenes in the real world.

    Another possibility is for non-shared memory to be handled
    differently. (This is similar to My 66000's handling of memory
    types and things mentioned by Mitch Alsup here.)

    Even with a multithreaded program, stack and TLS would be "thread
    private" and not require the same consistency guarantees.

    Various memory partitioning schemes theoretically can provide
    similar benefits for systems with shared memory controllers where
    programs do not share most modifiable data with other programs.
    Even something like a web hosting system might be able to benefit
    from lack of coherence (much less consistency) between different
    web hosts.

    But this is my point: in many programs there is no memory that
    you can point to and say it is always private to a single thread.
    And this is independent of language, its to do with program
    structure.

    You can say a certain memory range is shared and guarded by locks,
    or shared and managed by lock-free code.
    And we can point to this because the code in modularized this way.

    But the opposite of 'definitely shared' is not 'definitely private',
    it's 'dont know' or 'sometimes'.

    I see your point. I still think that a discipline could be
    enforced (above the hardware level) to avoid "laissez-faire
    sharing". However, not working until "software behaves properly"
    is not a very useful design choice, except possibly in early
    research efforts.

    Even without such, it would still be possible for hardware to have
    something like a coarse-grained snoop filter for the special cases
    of localized use. (I think something like this was being proposed
    here earlier.) Localization could include single thread/core and
    larger groups. Larger groups would be more simply provided by
    network topology locality, but one might want to spread threads to
    maximize cache availability so supporting conceptual/logical
    grouping and not just physical groupings might be desired.

    (Side comment: at least one IBM POWER implementation had a
    coherence state that indicated the cache block was only present in
    a physically local set of caches. I think this implementation used
    snooping, so this could significantly conserve interconnect
    bandwidth.)

    There might also be optimization opportunity for single-writer,
    multiple reader memory. Such optimizations might have very limited
    utility since such applications are not that common and most such
    applications might have other memory locations that have have
    multiple writers. ("Single-writer" could also be applied to a
    broader locality that still reduces synchronization delay.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to EricP on Thu Nov 23 23:33:31 2023
    On 11/23/2023 10:52 AM, EricP wrote:
    MitchAlsup wrote:
    aph@littlepinkcloud.invalid wrote:

    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    aph@littlepinkcloud.invalid wrote:
    Barriers only control the ordering between accesses, not when they
    become visible, and here there's only one access. If there are at
    least two, and you really need to see one before the other, then
    you need a barrier.

    The barriers also ensure the various local buffers, pipelines and
    inbound and outbound comms command and reply message queues are
    drained.

    I'm surprised you say that. All they have to do is make it appear as
    if this has been done. How it actually happens is up to the hardware
    designer. In modern GBOOO CPUs, most barriers don't require everything
    to be pushed to cache. (As I understand it, GBOOO designs can treat
    the set of pending accesses as a sort of memory transaction, detect
    conflicts, and roll back to the last point of coherence and replay
    in-order. But I am not a hardware designer.)

    Yes, exactly, an MemBar sets a boundary where all younger memory
    references
    of one type (or both) must wait to become visible until all older memory
    references of the other Types have become visible. Only 1 or 2 bits of
    state
    per queued memory ref is altered by an explicit MemBar or an implicit
    barrier
    (from change in operating mode).

    Weak order requires a membar after a store to force the it into the
    cache,
    triggering the coherence handshake which invalidates other copies,
    so that when remote cores reread a line they see the updated value.

    What if it is not-cacheable ?? or MMI/O ?? or configuration space ??

    In other words, to retire the membar instruction the core must force
    the
    prior store values into the coherent cache making them globally
    visible.

    Just Visible, the rest is not under SW control.

    There are two kinds of barriers/fences (I don't know if there are official terms for them), which are local bypass barriers, and global completion barriers.

    Wrt SPARC, #StoreLoad vs #LoadStore, well there a difference in
    performance. So, how would a #StoreLoad behave in Mitch's nice system?
    Also, how would a #LoadStore act? All the same? I need to learn more
    about his arch. Interesting.


    Bypass barriers restrict which younger ops in the local load-store queue
    may bypass and start execution before older ops have made a value locally visible.

    Completion barriers block younger ops from starting execution before
    older ops have completed and read or written globally visible values.

    You appear to be referring to bypass barriers whereas I'm referring to completion barriers which require globally visible results.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to EricP on Thu Nov 23 23:35:46 2023
    On 11/23/2023 10:52 AM, EricP wrote:
    MitchAlsup wrote:
    aph@littlepinkcloud.invalid wrote:

    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    aph@littlepinkcloud.invalid wrote:
    Barriers only control the ordering between accesses, not when they
    become visible, and here there's only one access. If there are at
    least two, and you really need to see one before the other, then
    you need a barrier.

    The barriers also ensure the various local buffers, pipelines and
    inbound and outbound comms command and reply message queues are
    drained.

    I'm surprised you say that. All they have to do is make it appear as
    if this has been done. How it actually happens is up to the hardware
    designer. In modern GBOOO CPUs, most barriers don't require everything
    to be pushed to cache. (As I understand it, GBOOO designs can treat
    the set of pending accesses as a sort of memory transaction, detect
    conflicts, and roll back to the last point of coherence and replay
    in-order. But I am not a hardware designer.)

    Yes, exactly, an MemBar sets a boundary where all younger memory
    references
    of one type (or both) must wait to become visible until all older memory
    references of the other Types have become visible. Only 1 or 2 bits of
    state
    per queued memory ref is altered by an explicit MemBar or an implicit
    barrier
    (from change in operating mode).

    Weak order requires a membar after a store to force the it into the
    cache,
    triggering the coherence handshake which invalidates other copies,
    so that when remote cores reread a line they see the updated value.

    What if it is not-cacheable ?? or MMI/O ?? or configuration space ??

    In other words, to retire the membar instruction the core must force
    the
    prior store values into the coherent cache making them globally
    visible.

    Just Visible, the rest is not under SW control.

    There are two kinds of barriers/fences (I don't know if there are official terms for them), which are local bypass barriers, and global completion barriers.

    Bypass barriers restrict which younger ops in the local load-store queue
    may bypass and start execution before older ops have made a value locally visible.

    Completion barriers block younger ops from starting execution before
    older ops have completed and read or written globally visible values.

    You appear to be referring to bypass barriers whereas I'm referring to completion barriers which require globally visible results.


    Basically, how to map the various membar ops into an arch that can be
    RMO. Assume the programmers have no problem with it... ;^o SPARC did it,
    but, is it worth it now? Is my knowledge of dealing with relaxed
    systems, threads/processes and membars obsoleted? shit man... ;^o

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to MitchAlsup on Thu Nov 23 23:38:15 2023
    On 11/23/2023 8:33 AM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/19/2023 2:23 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    So, while it does not eliminate live/dead-lock situations, it allows SW
    to be constructed to avoid live/dead lock situations:: Why is a value
    which is provided when an ATOMIC event fails. 0 means success, negative
    values are spurious (buffer overflows,...) while positives represent
    the number of competing threads, so the following case, skips elements
    on a linked list to decrease future initerference.

    Element* getElement( unSigned  Key )
    {
         int count = 0;
         for( p = structure.head; p ; p = p->next )
         {
              if( p->Key == Key )
              {
                   if( count-- < 0 )
                   {
                        esmLOCK( p );
                        prev = p->prev;
                        esmLOCK( prev );
                        next = p->next;
                        esmLOCK( next );
                        if( !esmINTERFERENCE() )
                        {
                             p->prev = next;
                             p->next = prev;
                             p->prev = NULL;
                    esmLOCK( p->next = NULL );
                             return p;
                        }
                        else
                        {
                             count = esmWHY();
                             p = structure.head;
                        }
                   }
              }
         }
         return NULL;
    }

    Doing ATOMIC things like this means one can take the BigO( n^3 )
    activity
    that happens when a timer goes off and n threads all want access to the
    work queue, down to BigO( 3 ) yes=constant, but in practice it is
    reduced
    to BigO( ln( n ) ) when requesters arrive in random order at random
    time.

    I remember hearing from my friend Joe Seigh, who worked at IBM, that
    they had some sort of logic that would prevent live lock in a
    compare and swap wrt their free pool manipulation logic. Iirc, it
    was somewhat related to ABA, hard to remember right now, sorry. I
    need to find that old thread back in comp.programming.threads.

    Depending on system size: there can be several system function units
    that grant "order" for ATOMIC events. These are useful for 64+node
    systems
    and unnecessary for less than 8-node systems. Disjoint memory spaces
    can use independent ATOMIC arbiters and whether they are in use or
    not is
    invisible to SW.


    ?

    Check this out the old thread:

    https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ

    Humm, you arch seems pretty neat/interesting to me. I need to learn
    more about it. Can it be abused with a rouge thread that keeps
    altering a cacheline(s) that are participating in the atomic block, so
    to speak? Anyway, I am busy with family time. Will get back to you.

    While possible, it is a lot less likely than on a similar architecture without any of the bells and whistles.

    Iirc, Jow Seigh mentioned how CS on IBM systems would prevent live lock
    by locking a bus or asserting a signal, that would ensure that a compare-and-swap would never get into death spiral of always failing.
    Iirc, Microsoft has something like this in its lock-free stack SList or something, Cannot remember exactly right now. Sorry.

    Joe Seigh mentioned internal docs, candy striped.


    Fwiw, here is some of my work:

    https://youtu.be/HwIkk9zENcg

    Octopi in a box playing ball.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Chris M. Thomasson on Thu Nov 23 23:41:44 2023
    On 11/23/2023 11:38 PM, Chris M. Thomasson wrote:
    On 11/23/2023 8:33 AM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/19/2023 2:23 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    So, while it does not eliminate live/dead-lock situations, it allows SW >>>> to be constructed to avoid live/dead lock situations:: Why is a value
    which is provided when an ATOMIC event fails. 0 means success, negative >>>> values are spurious (buffer overflows,...) while positives represent
    the number of competing threads, so the following case, skips elements >>>> on a linked list to decrease future initerference.

    Element* getElement( unSigned  Key )
    {
         int count = 0;
         for( p = structure.head; p ; p = p->next )
         {
              if( p->Key == Key )
              {
                   if( count-- < 0 )
                   {
                        esmLOCK( p );
                        prev = p->prev;
                        esmLOCK( prev );
                        next = p->next;
                        esmLOCK( next );
                        if( !esmINTERFERENCE() )
                        {
                             p->prev = next;
                             p->next = prev;
                             p->prev = NULL;
                    esmLOCK( p->next = NULL );
                             return p;
                        }
                        else
                        {
                             count = esmWHY();
                             p = structure.head;
                        }
                   }
              }
         }
         return NULL;
    }

    Doing ATOMIC things like this means one can take the BigO( n^3 )
    activity
    that happens when a timer goes off and n threads all want access to the >>>> work queue, down to BigO( 3 ) yes=constant, but in practice it is
    reduced
    to BigO( ln( n ) ) when requesters arrive in random order at random
    time.

    I remember hearing from my friend Joe Seigh, who worked at IBM,
    that they had some sort of logic that would prevent live lock in a
    compare and swap wrt their free pool manipulation logic. Iirc, it
    was somewhat related to ABA, hard to remember right now, sorry. I
    need to find that old thread back in comp.programming.threads.

    Depending on system size: there can be several system function units
    that grant "order" for ATOMIC events. These are useful for 64+node
    systems
    and unnecessary for less than 8-node systems. Disjoint memory spaces
    can use independent ATOMIC arbiters and whether they are in use or
    not is
    invisible to SW.


    ?

    Check this out the old thread:

    https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ

    Humm, you arch seems pretty neat/interesting to me. I need to learn
    more about it. Can it be abused with a rouge thread that keeps
    altering a cacheline(s) that are participating in the atomic block,
    so to speak? Anyway, I am busy with family time. Will get back to you.

    While possible, it is a lot less likely than on a similar architecture
    without any of the bells and whistles.

    Iirc, Jow Seigh mentioned how CS on IBM systems would prevent live lock
    by locking a bus or asserting a signal, that would ensure that a compare-and-swap would never get into death spiral of always failing.
    Iirc, Microsoft has something like this in its lock-free stack SList or something, Cannot remember exactly right now. Sorry.


    Wrt Microsoft's SList, I think it goes into the kernel to handle memory reclamation issues, aba, and such...


    Joe Seigh mentioned internal docs, candy striped.


    Fwiw, here is some of my work:

    https://youtu.be/HwIkk9zENcg

    Octopi in a box playing ball.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Paul A. Clayton on Thu Nov 23 23:44:27 2023
    On 11/23/2023 1:36 PM, Paul A. Clayton wrote:
    On 11/20/23 12:26 PM, Scott Lurndal wrote:
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 11/13/23 1:22 PM, EricP wrote:
    Kent Dickey wrote:
    [snip]
    Thus, the people trapped in Relaxed Ordering Hell then push
    weird schemes
    on everyone else to try to come up with algorithms which need fewer
    barriers.  It's crazy.

    Relaxed Ordering is a mistake.

    Kent

    I suggest something different: the ability to switch between TSO and
    relaxed with non-privileged user mode instructions.

    Even with a multithreaded program, stack and TLS would be "thread
    private" and not require the same consistency guarantees.

    Why do you think that 'stack' would be thread private?  It's
    quite common to allocate long-lived data structures on the
    stack and pass the address of the object to code that may
    be executing in the context of other threads.   So long as the
    lifetime of
    the object extends beyond the last reference, of course.

    Objects allocated on the stack in 'main', for instance.

    I thought (in my ignorance) that because the lifetime of stack-allocated
    data is controlled by the thread (call depth) that allocated that data
    it would not be thread safe. Clearly if one could guarantee that the allocating thread did not reduce its
    call depth to below that frame, it would be safe.

    (An allocation made in an initialization stage could generally be
    easy to guarantee not to be deallocated early — only being
    deallocated on program end by the OS — but at that point the early
    stack is in some sense not the same stack as for later uses.)

    I knew that pointers where allocated for arrays and objects on the
    stack and passed to a called function. That feels a bit icky to me
    in that I conceive of a stack frame as being just for that
    function, i.e., my mental model does not reflect practice. (Even
    allocating an array for internal use in a stack frame is
    inconsistent with my mental model which exclusively references off
    the stack pointer with immediate offsets.)

    I do not know how difficult it would be to establish a compilation
    system that did not use "the" stack for such allocations.
    Allocations to the stack have the advantage of simple management
    (just adjust the stack pointer) with the constraint of simple
    timing of allocation and free and the advantage of never missing
    cache for the pointer. Providing a modest prefetched-on-context-
    switch cache (like a Knapsack Cache proposed by Todd Austin for
    reduced latency) would allow multiple stacks/regions to have such
    benefits by placing the next allocation pointers there.

    Effectively extending the register set in that way could have
    other advantages. (Context switch overhead would increase.)
    Providing an "L2 register" storage seems to have some attraction.



    I did a memory allocator that was completely based on thread stacks.
    Iirc, it was on Quadros. A long time ago!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Paul A. Clayton on Thu Nov 23 23:56:19 2023
    On 11/23/2023 2:57 PM, Paul A. Clayton wrote:
    On 11/20/23 1:26 PM, EricP wrote:
    Paul A. Clayton wrote:
    On 11/13/23 1:22 PM, EricP wrote:
    [snip]
    Where this approach could fail is the kind of laissez-faire sharing
    done by many apps, libraries, and OS's behind the scenes in the real
    world.

    Another possibility is for non-shared memory to be handled
    differently. (This is similar to My 66000's handling of memory
    types and things mentioned by Mitch Alsup here.)

    Even with a multithreaded program, stack and TLS would be "thread
    private" and not require the same consistency guarantees.

    Various memory partitioning schemes theoretically can provide
    similar benefits for systems with shared memory controllers where
    programs do not share most modifiable data with other programs.
    Even something like a web hosting system might be able to benefit
    from lack of coherence (much less consistency) between different
    web hosts.

    But this is my point: in many programs there is no memory that
    you can point to and say it is always private to a single thread.
    And this is independent of language, its to do with program structure.

    You can say a certain memory range is shared and guarded by locks,
    or shared and managed by lock-free code.
    And we can point to this because the code in modularized this way.

    But the opposite of 'definitely shared' is not 'definitely private',
    it's 'dont know' or 'sometimes'.

    I see your point. I still think that a discipline could be
    enforced (above the hardware level) to avoid "laissez-faire
    sharing". However, not working until "software behaves properly"
    is not a very useful design choice, except possibly in early
    research efforts.

    Even without such, it would still be possible for hardware to have
    something like a coarse-grained snoop filter for the special cases
    of localized use. (I think something like this was being proposed
    here earlier.) Localization could include single thread/core and
    larger groups. Larger groups would be more simply provided by
    network topology locality, but one might want to spread threads to
    maximize cache availability so supporting conceptual/logical
    grouping and not just physical groupings might be desired.

    (Side comment: at least one IBM POWER implementation had a
    coherence state that indicated the cache block was only present in
    a physically local set of caches. I think this implementation used
    snooping, so this could significantly conserve interconnect
    bandwidth.)

    There might also be optimization opportunity for single-writer,
    multiple reader memory.

    Side note... Actually, there is a reason to create specialized queues
    for communications between threads. Basically, it goes like:

    single-producer/single-consumer
    multi-producer/single-consumer
    single-producer/multi-consumer
    multi-producer/multi-consumer

    Each one has a specific algorithm. Specialized for its requirements, single-producer/single-consumer is very fast, and iirc does not even
    need #StoreLoad or #LoadStore, #LoadLoad and #StoreStore can work even
    in RMO mode of a SPARC.

    Such optimizations might have very limited
    utility since such applications are not that common and most such applications might have other memory locations that have have
    multiple writers. ("Single-writer" could also be applied to a
    broader locality that still reduces synchronization delay.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Chris M. Thomasson on Fri Nov 24 00:40:32 2023
    On 11/24/2023 12:36 AM, Chris M. Thomasson wrote:
    On 11/15/2023 1:25 PM, Chris M. Thomasson wrote:
    On 11/15/2023 1:09 PM, Kent Dickey wrote:
    In article <uiu4t5$t4c2$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2023 9:54 PM, Kent Dickey wrote:
    In article <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>,
    MitchAlsup <mitchalsup@aol.com> wrote:
    Kent Dickey wrote:

    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain
    workloads.

    I know a lot of people believe that statement to be true.  In
    general, it
    is assumed to be true without proof.
    <
    In its most general case, relaxed order only provides a
    performance advantage
    when the code is single threaded.

    I believe a Relaxed Memory model provides a small performance
    improvement
    ONLY to simple in-order CPUs in an MP system (if you're a single CPU, >>>>> there's nothing to order).

    Relazed Memory ordering provides approximately zero performance
    improvement
    to an OoO CPU, and in fact, might actually lower performance
    (depends on
    how barriers are done--if done poorly, it could be a big negative).

    Yes, the system designers of the world have said: let's slow down our >>>>> fastest most expensive most profitable CPUs, so we can speed up our
    cheapest
    lowest profit CPUs a few percent, and push a ton of work onto software >>>>> developers.

    It's crazy.

    I believe that statement to be false.  Can you describe some of >>>>>>> these
    workloads?
    <
    Relaxed memory order fails spectacularly when multiple threads are >>>>>> accessing
    data.

    Probably need to clarify with "accessing modified data".

    Kent

    Huh? So, C++ is crazy for allowing for std::memory_order_relaxed to
    even
    exist? I must be misunderstanding you point here. Sorry if I am. ;^o

    You have internalized weakly ordered memory, and you're having trouble
    seeing beyond it.

    Really? Don't project yourself on me. Altering all of the memory
    barriers of a finely tuned lock-free algorithm to seq_cst is VERY bad.



    CPUs with weakly ordered memory are the ones that need all those flags.
    Yes, you need the flags if you want to use those CPUs.  I'm pointing
    out:
    we could all just require better memory ordering and get rid of all this >>> cruft.  Give the flag, don't give the flag, the program is still correct >>> and works properly.

    Huh? Just cruft? wow. Just because it seems hard for you does not mean
    we should eliminate it. Believe it or not there are people out there
    that know how to use memory barriers. I suppose you would use seq_cst
    to load each node of a lock-free stack iteration in a RCU read-side
    region. This is terrible! Realy bad, bad, BAD! Afaicvt, it kind a,
    sort a, seems like you do not have all that much experience with them.
    Humm...



    It's like FP denorms--it's generally been decided the hardware cost
    to implement it is small, so hardware needs to support it at full speed. >>> No need to write code in a careful way to avoid denorms, to use funky
    CPU-
    specific calls to turn on flush-to-0, etc., it just works, we move on to >>> other topics.  But we still have flush-to-0 calls available--but you
    don't
    need to bother to use them.  In my opinion, memory ordering is much more >>> complex for programmers to handle.  I maintain it's actually so
    complex most people cannot get it right in software for non-trivial
    interactions.  I've found many hardware designers have a very hard time >>> reasoning about this as well when I report bugs (since the rules are so
    complex and poorly described).  There are over 100 pages describing
    memory
    ordering in the Arm Architectureal Reference Manual, and it is very
    complex (Dependency through registers and memory; Basic Dependency;
    Address Dependency; Data Dependency; Control Dependency; Pick Basic
    dependency; Pick Address Dependency; Pick Data Dependency; Pick
    Control Dependency, Pick Dependency...and this is just from the
    definition
    of terms).  It's all very abstract and difficult to follow.  I'll be
    honest: I do not understand all of these rules, and I don't care to.
    I know how to implement a CPU, so I know what they've done, and that's
    much simpler to understand.  But writing a threaded application is much >>> more complex than it should be for software.

    The cost to do TSO is some out-of-order tracking structures need to get
    a little bigger, and some instructions have to stay in queues longer
    (which is why they may need to get bigger), and allow re-issuing loads
    which now have stale data.  The difference between TSO and Sequential
    Consistency is to just disallow loads seeing stores queued before they
    write to the data cache (well, you can speculatively let loads happen,
    but you need to be able to walk it back, which is not difficult).  This >>> is why I say the performance cost is low--normal code missing caches and >>> not being pestered by other CPUs can run at the same speed.  But when
    other CPUs begin pestering us, the interference can all be worked out as >>> efficiently as possible using hardware, and barriers just do not
    compete.

    Having access to fine grain memory barriers is a very good thing. Of
    course we can use C++ right now and make everything seq_cst, but that
    is moronic. Why would you want to use seq_cst everywhere when you do
    not have to? There are rather massive performance implications.

    Are you thinking about a magic arch that we cannot use right now?

    The problem with using seq_cst all over the place on an x86 is that it
    would need to use LOCK'ed RMW, even dummy ones, or MFENCE to get the
    ordering right.

    Afaict MFENCE is basically a #StoreLoad. A LOCK RMW is a membar but in
    an interesting way where there is a word(s) being updated. I say word's
    is because of cmpxchg8b on a 32 bit system. Or cmpxchg16b on a 64 bit
    system. Basically a DWCAS, or double-word compare-and-swap where it
    works with two contiguous words. This is different that a DCAS that can
    work with two non-contiguous words.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Chris M. Thomasson on Fri Nov 24 00:36:57 2023
    On 11/15/2023 1:25 PM, Chris M. Thomasson wrote:
    On 11/15/2023 1:09 PM, Kent Dickey wrote:
    In article <uiu4t5$t4c2$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2023 9:54 PM, Kent Dickey wrote:
    In article <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>,
    MitchAlsup <mitchalsup@aol.com> wrote:
    Kent Dickey wrote:

    In article <uiri0a$85mp$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    A highly relaxed memory model can be beneficial for certain
    workloads.

    I know a lot of people believe that statement to be true.  In
    general, it
    is assumed to be true without proof.
    <
    In its most general case, relaxed order only provides a performance
    advantage
    when the code is single threaded.

    I believe a Relaxed Memory model provides a small performance
    improvement
    ONLY to simple in-order CPUs in an MP system (if you're a single CPU,
    there's nothing to order).

    Relazed Memory ordering provides approximately zero performance
    improvement
    to an OoO CPU, and in fact, might actually lower performance
    (depends on
    how barriers are done--if done poorly, it could be a big negative).

    Yes, the system designers of the world have said: let's slow down our
    fastest most expensive most profitable CPUs, so we can speed up our
    cheapest
    lowest profit CPUs a few percent, and push a ton of work onto software >>>> developers.

    It's crazy.

    I believe that statement to be false.  Can you describe some of these >>>>>> workloads?
    <
    Relaxed memory order fails spectacularly when multiple threads are
    accessing
    data.

    Probably need to clarify with "accessing modified data".

    Kent

    Huh? So, C++ is crazy for allowing for std::memory_order_relaxed to even >>> exist? I must be misunderstanding you point here. Sorry if I am. ;^o

    You have internalized weakly ordered memory, and you're having trouble
    seeing beyond it.

    Really? Don't project yourself on me. Altering all of the memory
    barriers of a finely tuned lock-free algorithm to seq_cst is VERY bad.



    CPUs with weakly ordered memory are the ones that need all those flags.
    Yes, you need the flags if you want to use those CPUs.  I'm pointing out: >> we could all just require better memory ordering and get rid of all this
    cruft.  Give the flag, don't give the flag, the program is still correct
    and works properly.

    Huh? Just cruft? wow. Just because it seems hard for you does not mean
    we should eliminate it. Believe it or not there are people out there
    that know how to use memory barriers. I suppose you would use seq_cst to
    load each node of a lock-free stack iteration in a RCU read-side region.
    This is terrible! Realy bad, bad, BAD! Afaicvt, it kind a, sort a, seems
    like you do not have all that much experience with them. Humm...



    It's like FP denorms--it's generally been decided the hardware cost
    to implement it is small, so hardware needs to support it at full speed.
    No need to write code in a careful way to avoid denorms, to use funky
    CPU-
    specific calls to turn on flush-to-0, etc., it just works, we move on to
    other topics.  But we still have flush-to-0 calls available--but you
    don't
    need to bother to use them.  In my opinion, memory ordering is much more
    complex for programmers to handle.  I maintain it's actually so
    complex most people cannot get it right in software for non-trivial
    interactions.  I've found many hardware designers have a very hard time
    reasoning about this as well when I report bugs (since the rules are so
    complex and poorly described).  There are over 100 pages describing
    memory
    ordering in the Arm Architectureal Reference Manual, and it is very
    complex (Dependency through registers and memory; Basic Dependency;
    Address Dependency; Data Dependency; Control Dependency; Pick Basic
    dependency; Pick Address Dependency; Pick Data Dependency; Pick
    Control Dependency, Pick Dependency...and this is just from the
    definition
    of terms).  It's all very abstract and difficult to follow.  I'll be
    honest: I do not understand all of these rules, and I don't care to.
    I know how to implement a CPU, so I know what they've done, and that's
    much simpler to understand.  But writing a threaded application is much
    more complex than it should be for software.

    The cost to do TSO is some out-of-order tracking structures need to get
    a little bigger, and some instructions have to stay in queues longer
    (which is why they may need to get bigger), and allow re-issuing loads
    which now have stale data.  The difference between TSO and Sequential
    Consistency is to just disallow loads seeing stores queued before they
    write to the data cache (well, you can speculatively let loads happen,
    but you need to be able to walk it back, which is not difficult).  This
    is why I say the performance cost is low--normal code missing caches and
    not being pestered by other CPUs can run at the same speed.  But when
    other CPUs begin pestering us, the interference can all be worked out as
    efficiently as possible using hardware, and barriers just do not
    compete.

    Having access to fine grain memory barriers is a very good thing. Of
    course we can use C++ right now and make everything seq_cst, but that is moronic. Why would you want to use seq_cst everywhere when you do not
    have to? There are rather massive performance implications.

    Are you thinking about a magic arch that we cannot use right now?

    The problem with using seq_cst all over the place on an x86 is that it
    would need to use LOCK'ed RMW, even dummy ones, or MFENCE to get the
    ordering right.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to EricP on Fri Nov 24 10:12:01 2023
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    aph@littlepinkcloud.invalid wrote:
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    aph@littlepinkcloud.invalid wrote:
    Barriers only control the ordering between accesses, not when they
    become visible, and here there's only one access. If there are at
    least two, and you really need to see one before the other, then
    you need a barrier.
    The barriers also ensure the various local buffers, pipelines and
    inbound and outbound comms command and reply message queues are
    drained.

    I'm surprised you say that. All they have to do is make it appear as
    if this has been done. How it actually happens is up to the hardware
    designer. In modern GBOOO CPUs, most barriers don't require everything
    to be pushed to cache. (As I understand it, GBOOO designs can treat
    the set of pending accesses as a sort of memory transaction, detect
    conflicts, and roll back to the last point of coherence and replay
    in-order. But I am not a hardware designer.)

    I say that because that is what the Intel manual says for MFENCE and
    SFENCE, "The processor ensures that every store prior to SFENCE is
    globally visible before any store after SFENCE becomes globally
    visible"

    Well, hold on. The strongest barrier anyone needs for synchronization
    up to sequential consistency on Intel, at least in user mode, is a
    LOCK XCHG.

    and Arm64 for DSB "ensures that memory accesses that occur before
    the DSB instruction have completed before the completion of the DSB instruction". Perhaps you are thinking of Intel LFENCE and Arm64 DMB
    which are weaker and do not require the prior operations to complete
    but just that they are performed or observed. These appear to only
    look at the local LSQ.

    Of course. DMB is what programmers actually use, at least in user
    space. DSB is used rarely. You do need a DSB to flush the cache to the
    point of coherence, which you might need for persistent memory, and
    you need it to flush dcache->icache on some Arm designs. But you don't
    need anything more than DMB for lock-free algorithms.

    WRT the draining for synchronization, I'm not sure what you think
    the difference is between making it appear to be done and actually
    doing so. Sure there may be some optimizations possible, but the net
    effect on the externally observable cache state must be the same.

    As I wrote, you don't have to actually do the synchronization dance if
    it's not needed. You can speculate that there won't be a conflict,
    detect any memory-ordering violation, and roll back and replay.

    Adrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Fri Nov 24 18:32:34 2023
    Chris M. Thomasson wrote:

    On 11/23/2023 10:52 AM, EricP wrote:
    MitchAlsup wrote:

    There are two kinds of barriers/fences (I don't know if there are official >> terms for them), which are local bypass barriers, and global completion
    barriers.

    Bypass barriers restrict which younger ops in the local load-store queue
    may bypass and start execution before older ops have made a value locally
    visible.

    Completion barriers block younger ops from starting execution before
    older ops have completed and read or written globally visible values.

    You appear to be referring to bypass barriers whereas I'm referring to
    completion barriers which require globally visible results.


    Basically, how to map the various membar ops into an arch that can be
    RMO. Assume the programmers have no problem with it... ;^o SPARC did it,
    but, is it worth it now? Is my knowledge of dealing with relaxed
    systems, threads/processes and membars obsoleted? shit man... ;^o

    If the page being mapped is properly identified in the PTE, then there is
    no reason to need any MemBars.

    Also Note:: MemBars are the WRONG abstraction--a MemBar is like a wall
    whereas what programmers want is a bridge. As long as you are on the
    bridge (inside an ATOMIC event) you want one memory model and when you
    leave the bridge you are free to use a more performant model. MemBars
    only demark the edges of the bridge, they don't cover the whole bridge.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Fri Nov 24 18:42:25 2023
    Chris M. Thomasson wrote:

    Afaict MFENCE is basically a #StoreLoad. A LOCK RMW is a membar but in
    an interesting way where there is a word(s) being updated. I say word's
    is because of cmpxchg8b on a 32 bit system. Or cmpxchg16b on a 64 bit
    system. Basically a DWCAS, or double-word compare-and-swap where it
    works with two contiguous words. This is different that a DCAS that can
    work with two non-contiguous words.


    The later is generally known as DCADS Double compare and Double swap.
    I did see some academic literature a decade ago in wanting TCADS
    triple compare and double swap.

    It is for stuff like this that I invented esm so SW writers can
    program up any number of compares and and width of swapping. This
    means ISA does not have to change in the future wrt synchronization capabilities.

    It ends up that one of the most important properties is also found
    in LL/SC--and that is the the LL denotes the beginning of an ATOMIC
    event and SC denotes the end. LL/SC provide a bridge model to SW
    while ATOMIC events ending with xCAxS only provide notion one is
    in a ATOMIC event at the last instruction of the event.

    LL/SC can perform interference detection based on address while;
    xCAxS can only perform the interference based on the data at that address.....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to aph@littlepinkcloud.invalid on Fri Nov 24 16:02:05 2023
    On 11/24/2023 2:12 AM, aph@littlepinkcloud.invalid wrote:
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    aph@littlepinkcloud.invalid wrote:
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    aph@littlepinkcloud.invalid wrote:
    Barriers only control the ordering between accesses, not when they
    become visible, and here there's only one access. If there are at
    least two, and you really need to see one before the other, then
    you need a barrier.
    The barriers also ensure the various local buffers, pipelines and
    inbound and outbound comms command and reply message queues are
    drained.

    I'm surprised you say that. All they have to do is make it appear as
    if this has been done. How it actually happens is up to the hardware
    designer. In modern GBOOO CPUs, most barriers don't require everything
    to be pushed to cache. (As I understand it, GBOOO designs can treat
    the set of pending accesses as a sort of memory transaction, detect
    conflicts, and roll back to the last point of coherence and replay
    in-order. But I am not a hardware designer.)

    I say that because that is what the Intel manual says for MFENCE and
    SFENCE, "The processor ensures that every store prior to SFENCE is
    globally visible before any store after SFENCE becomes globally
    visible"

    Well, hold on. The strongest barrier anyone needs for synchronization
    up to sequential consistency on Intel, at least in user mode, is a
    LOCK XCHG.

    Side note, XCHG all by itself automatically implies a LOCK prefix... :^)



    and Arm64 for DSB "ensures that memory accesses that occur before
    the DSB instruction have completed before the completion of the DSB
    instruction". Perhaps you are thinking of Intel LFENCE and Arm64 DMB
    which are weaker and do not require the prior operations to complete
    but just that they are performed or observed. These appear to only
    look at the local LSQ.

    Of course. DMB is what programmers actually use, at least in user
    space. DSB is used rarely. You do need a DSB to flush the cache to the
    point of coherence, which you might need for persistent memory, and
    you need it to flush dcache->icache on some Arm designs. But you don't
    need anything more than DMB for lock-free algorithms.

    WRT the draining for synchronization, I'm not sure what you think
    the difference is between making it appear to be done and actually
    doing so. Sure there may be some optimizations possible, but the net
    effect on the externally observable cache state must be the same.

    As I wrote, you don't have to actually do the synchronization dance if
    it's not needed. You can speculate that there won't be a conflict,
    detect any memory-ordering violation, and roll back and replay.

    Adrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Sat Nov 25 00:02:15 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Chris M. Thomasson wrote:

    Afaict MFENCE is basically a #StoreLoad. A LOCK RMW is a membar but in
    an interesting way where there is a word(s) being updated. I say word's
    is because of cmpxchg8b on a 32 bit system. Or cmpxchg16b on a 64 bit
    system. Basically a DWCAS, or double-word compare-and-swap where it
    works with two contiguous words. This is different that a DCAS that can
    work with two non-contiguous words.


    The later is generally known as DCADS Double compare and Double swap.
    I did see some academic literature a decade ago in wanting TCADS
    triple compare and double swap.

    It is for stuff like this that I invented esm so SW writers can
    program up any number of compares and and width of swapping. This
    means ISA does not have to change in the future wrt synchronization >capabilities.

    It ends up that one of the most important properties is also found
    in LL/SC--and that is the the LL denotes the beginning of an ATOMIC
    event and SC denotes the end. LL/SC provide a bridge model to SW
    while ATOMIC events ending with xCAxS only provide notion one is
    in a ATOMIC event at the last instruction of the event.

    LL/SC can perform interference detection based on address while;
    xCAxS can only perform the interference based on the data at that address.....

    The fatal problem with ll/sc is scaling. They scale poorly with large processor counts.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Sat Nov 25 01:07:38 2023
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Chris M. Thomasson wrote:

    Afaict MFENCE is basically a #StoreLoad. A LOCK RMW is a membar but in
    an interesting way where there is a word(s) being updated. I say word's
    is because of cmpxchg8b on a 32 bit system. Or cmpxchg16b on a 64 bit
    system. Basically a DWCAS, or double-word compare-and-swap where it
    works with two contiguous words. This is different that a DCAS that can
    work with two non-contiguous words.


    The later is generally known as DCADS Double compare and Double swap.
    I did see some academic literature a decade ago in wanting TCADS
    triple compare and double swap.

    It is for stuff like this that I invented esm so SW writers can
    program up any number of compares and and width of swapping. This
    means ISA does not have to change in the future wrt synchronization >>capabilities.

    It ends up that one of the most important properties is also found
    in LL/SC--and that is the the LL denotes the beginning of an ATOMIC
    event and SC denotes the end. LL/SC provide a bridge model to SW
    while ATOMIC events ending with xCAxS only provide notion one is
    in a ATOMIC event at the last instruction of the event.

    LL/SC can perform interference detection based on address while;
    xCAxS can only perform the interference based on the data at that address.....

    The fatal problem with ll/sc is scaling. They scale poorly with large processor counts.


    One can take the PoV that the fatal problem is that only 1 memory container
    is used !! And it does not matter a whole lot if the underpinnings are LL/SC
    or 1CA1S.

    One can also take the PoV that all currently known CPU driven ATOMICs
    suffer at the BigO( level ) rather similarly, whereas Memory ATOMICs
    at least guarantee forward progress::: ADDtoMemory( addr, value )
    without scaling poorly.....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Scott Lurndal on Fri Nov 24 16:19:40 2023
    On 11/24/2023 4:02 PM, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup) writes:
    Chris M. Thomasson wrote:

    Afaict MFENCE is basically a #StoreLoad. A LOCK RMW is a membar but in
    an interesting way where there is a word(s) being updated. I say word's
    is because of cmpxchg8b on a 32 bit system. Or cmpxchg16b on a 64 bit
    system. Basically a DWCAS, or double-word compare-and-swap where it
    works with two contiguous words. This is different that a DCAS that can
    work with two non-contiguous words.


    The later is generally known as DCADS Double compare and Double swap.
    I did see some academic literature a decade ago in wanting TCADS
    triple compare and double swap.

    It is for stuff like this that I invented esm so SW writers can
    program up any number of compares and and width of swapping. This
    means ISA does not have to change in the future wrt synchronization
    capabilities.

    It ends up that one of the most important properties is also found
    in LL/SC--and that is the the LL denotes the beginning of an ATOMIC
    event and SC denotes the end. LL/SC provide a bridge model to SW
    while ATOMIC events ending with xCAxS only provide notion one is
    in a ATOMIC event at the last instruction of the event.

    LL/SC can perform interference detection based on address while;
    xCAxS can only perform the interference based on the data at that address.....

    The fatal problem with ll/sc is scaling. They scale poorly with large processor counts.

    It's rather touchy, and optimistic. LL/SC needs to work with isolated
    memory locations. The reservation granule has to be honored, false
    sharing on it is really BAD!!! Also, it helps if LL/SC can work with l2
    cache aligned memory, false sharing between unrelated cache lines is bad
    in and of itself, also, straddling a cache line is hyper bad... ___________________________
    struct l2_cache_line_node
    {
    std::atomic<l2_cache_line_node*> m_next;
    };

    struct reservation_granule_node
    {
    std::atomic<reservation_granule_node*> m_next;

    // the head of a lock-free stack...
    std::atomic<l2_cache_line_node*> m_head;
    };
    ___________________________


    Rules:
    ___________________________
    l2_cache_line_node's are always aligned on l2 cache line boundaries, and
    padded up to the size of a l2 cache line.

    reservation_granule_node's are always aligned on a reservation granule boundaries, and padded up to the size of a reservation granule.

    Avoid false sharing!
    ___________________________

    That helps LL/SC scale better, so to speak.

    However, imvvvho, I prefer pessimistic RMW's wrt LOCK CMPXCHG on Intel
    vs emulating CAS using LL/SC on a PPC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to Chris M. Thomasson on Sat Nov 25 10:44:44 2023
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    Side note, XCHG all by itself automatically implies a LOCK prefix... :^)

    Yeah, I remember reading that once, but I still see LOCK prefxies
    being used. I guess a LOCK doesn't hurt, and people want to be on the
    safe side, or something. Ah, programmers... :-)

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to aph@littlepinkcloud.invalid on Sat Nov 25 12:00:07 2023
    aph@littlepinkcloud.invalid wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    Side note, XCHG all by itself automatically implies a LOCK prefix... :^)

    Yeah, I remember reading that once, but I still see LOCK prefxies
    being used. I guess a LOCK doesn't hurt, and people want to be on the
    safe side, or something. Ah, programmers... :-)


    It is only the XCHG reg,[mem] forms which imply a LOCK!

    XCHG reg,reg has no bus transfers and does not assert LOCK.

    NOP uses the opcode for XCHG AX,AX and that is how it was implemented on
    the original 8086, before pipelined/superscalar implementations needed a
    way to avoid false dependencies on (E)AX.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to MitchAlsup on Sat Nov 25 11:54:13 2023
    MitchAlsup wrote:
    Scott Lurndal wrote:
    The fatal problem with ll/sc is scaling.  They scale poorly with large
    processor counts.


    One can take the PoV that the fatal problem is that only 1 memory container is used !! And it does not matter a whole lot if the underpinnings are
    LL/SC
    or 1CA1S.

    One can also take the PoV that all currently known CPU driven ATOMICs
    suffer at the BigO( level ) rather similarly, whereas Memory ATOMICs
    at least guarantee forward progress::: ADDtoMemory( addr, value )
    without scaling poorly.....

    And this is why XADD is my favorite building block when writing
    multi-threaded code.

    I used this as the only primitive in the NTPD server version I optimized
    for Meinberg's beefiest server, a 1U 4-core box with multiple ethernet
    ports, a timing gps receiver and a very stable local oscillator.

    The model I used was so simple that it was easy to show that it would be correct, have close to zero contention, and guarantee forward progress.

    This was made much easier because I was also forced to handle all
    incoming packets/requests in a single thread, so I decided to use N-1
    single writer/single reader queues in a round robin fashion, but XADD
    tends to work acceptably well even in scenarios with much more potential
    for contention.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Sat Nov 25 17:05:07 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Chris M. Thomasson wrote:

    Afaict MFENCE is basically a #StoreLoad. A LOCK RMW is a membar but in >>>> an interesting way where there is a word(s) being updated. I say word's >>>> is because of cmpxchg8b on a 32 bit system. Or cmpxchg16b on a 64 bit
    system. Basically a DWCAS, or double-word compare-and-swap where it
    works with two contiguous words. This is different that a DCAS that can >>>> work with two non-contiguous words.


    The later is generally known as DCADS Double compare and Double swap.
    I did see some academic literature a decade ago in wanting TCADS
    triple compare and double swap.

    It is for stuff like this that I invented esm so SW writers can
    program up any number of compares and and width of swapping. This
    means ISA does not have to change in the future wrt synchronization >>>capabilities.

    It ends up that one of the most important properties is also found
    in LL/SC--and that is the the LL denotes the beginning of an ATOMIC
    event and SC denotes the end. LL/SC provide a bridge model to SW
    while ATOMIC events ending with xCAxS only provide notion one is
    in a ATOMIC event at the last instruction of the event.

    LL/SC can perform interference detection based on address while;
    xCAxS can only perform the interference based on the data at that address.....

    The fatal problem with ll/sc is scaling. They scale poorly with large
    processor counts.


    One can take the PoV that the fatal problem is that only 1 memory container >is used !! And it does not matter a whole lot if the underpinnings are LL/SC >or 1CA1S.

    Clearly 1000 processors beating on the same cache line is
    something to be avoided. There still seem to be a dearth
    of programmers that understand multithreaded programming.


    One can also take the PoV that all currently known CPU driven ATOMICs
    suffer at the BigO( level ) rather similarly, whereas Memory ATOMICs
    at least guarantee forward progress::: ADDtoMemory( addr, value )
    without scaling poorly.....

    The difference between an atomic and 'll' is that the atomic is
    guaranteed to make forward progress, while the 'll' generally
    isn't and the chances of a SC failure increase exponentially with
    the processor count.

    Atomics are often supported directly by the shared LLC as well.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Sat Nov 25 19:49:53 2023
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Chris M. Thomasson wrote:

    Afaict MFENCE is basically a #StoreLoad. A LOCK RMW is a membar but in >>>>> an interesting way where there is a word(s) being updated. I say word's >>>>> is because of cmpxchg8b on a 32 bit system. Or cmpxchg16b on a 64 bit >>>>> system. Basically a DWCAS, or double-word compare-and-swap where it
    works with two contiguous words. This is different that a DCAS that can >>>>> work with two non-contiguous words.


    The later is generally known as DCADS Double compare and Double swap.
    I did see some academic literature a decade ago in wanting TCADS
    triple compare and double swap.

    It is for stuff like this that I invented esm so SW writers can
    program up any number of compares and and width of swapping. This
    means ISA does not have to change in the future wrt synchronization >>>>capabilities.

    It ends up that one of the most important properties is also found
    in LL/SC--and that is the the LL denotes the beginning of an ATOMIC >>>>event and SC denotes the end. LL/SC provide a bridge model to SW
    while ATOMIC events ending with xCAxS only provide notion one is
    in a ATOMIC event at the last instruction of the event.

    LL/SC can perform interference detection based on address while;
    xCAxS can only perform the interference based on the data at that address.....

    The fatal problem with ll/sc is scaling. They scale poorly with large
    processor counts.


    One can take the PoV that the fatal problem is that only 1 memory container >>is used !! And it does not matter a whole lot if the underpinnings are LL/SC >>or 1CA1S.

    Clearly 1000 processors beating on the same cache line is
    something to be avoided. There still seem to be a dearth
    of programmers that understand multithreaded programming.


    One can also take the PoV that all currently known CPU driven ATOMICs >>suffer at the BigO( level ) rather similarly, whereas Memory ATOMICs
    at least guarantee forward progress::: ADDtoMemory( addr, value )
    without scaling poorly.....

    The difference between an atomic and 'll' is that the atomic is
    guaranteed to make forward progress, while the 'll' generally
    isn't and the chances of a SC failure increase exponentially with
    the processor count.

    On typical interconnects, it is only cubically not exponentially:
    Each processor agrees that one of them made forward progress after
    a quadratic amount of memory requests and a cubic amount of memory
    responses (including SNOOP requests).

    Atomics are often supported directly by the shared LLC as well.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Sat Nov 25 20:09:26 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Chris M. Thomasson wrote:

    Afaict MFENCE is basically a #StoreLoad. A LOCK RMW is a membar but in >>>>>> an interesting way where there is a word(s) being updated. I say word's >>>>>> is because of cmpxchg8b on a 32 bit system. Or cmpxchg16b on a 64 bit >>>>>> system. Basically a DWCAS, or double-word compare-and-swap where it >>>>>> works with two contiguous words. This is different that a DCAS that can >>>>>> work with two non-contiguous words.


    The later is generally known as DCADS Double compare and Double swap. >>>>>I did see some academic literature a decade ago in wanting TCADS >>>>>triple compare and double swap.

    It is for stuff like this that I invented esm so SW writers can >>>>>program up any number of compares and and width of swapping. This >>>>>means ISA does not have to change in the future wrt synchronization >>>>>capabilities.

    It ends up that one of the most important properties is also found
    in LL/SC--and that is the the LL denotes the beginning of an ATOMIC >>>>>event and SC denotes the end. LL/SC provide a bridge model to SW >>>>>while ATOMIC events ending with xCAxS only provide notion one is
    in a ATOMIC event at the last instruction of the event.

    LL/SC can perform interference detection based on address while; >>>>>xCAxS can only perform the interference based on the data at that address.....

    The fatal problem with ll/sc is scaling. They scale poorly with large >>>> processor counts.


    One can take the PoV that the fatal problem is that only 1 memory container >>>is used !! And it does not matter a whole lot if the underpinnings are LL/SC >>>or 1CA1S.

    Clearly 1000 processors beating on the same cache line is
    something to be avoided. There still seem to be a dearth
    of programmers that understand multithreaded programming.


    One can also take the PoV that all currently known CPU driven ATOMICs >>>suffer at the BigO( level ) rather similarly, whereas Memory ATOMICs
    at least guarantee forward progress::: ADDtoMemory( addr, value )
    without scaling poorly.....

    The difference between an atomic and 'll' is that the atomic is
    guaranteed to make forward progress, while the 'll' generally
    isn't and the chances of a SC failure increase exponentially with
    the processor count.

    On typical interconnects, it is only cubically not exponentially:

    Isn't 'cubically' 'exponentially' when the exponent is 3?

    Then one must also factor in NUMA latencies (we had a NUMA system
    where worst case round trip latency between nodes was
    800ns over Inifiniband). Broadcast probes (HT) between NUMA nodes
    was counterindicated, so a directory was added.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Sat Nov 25 20:47:45 2023
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:


    The difference between an atomic and 'll' is that the atomic is
    guaranteed to make forward progress, while the 'll' generally
    isn't and the chances of a SC failure increase exponentially with
    the processor count.

    On typical interconnects, it is only cubically not exponentially:

    Isn't 'cubically' 'exponentially' when the exponent is 3?

    No:: All of the function variables with constant exponents fall into
    the category of Polynomials, whereas the very first constant with
    a variable exponent falls into the Exponential category. They are
    <numerically> a different species of function (and a pet peeve).

    This is numerical analysis 101

    A function which is Exponentially hard is vastly harder than a function
    which is polynomial hard. In order to represent an Exponential function
    with a Polynomial, one needs an infinite set of terms (or domain limits).

    Then one must also factor in NUMA latencies (we had a NUMA system
    where worst case round trip latency between nodes was
    800ns over Inifiniband). Broadcast probes (HT) between NUMA nodes
    was counterindicated, so a directory was added.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Terje Mathisen on Sat Nov 25 15:13:01 2023
    On 11/25/2023 3:00 AM, Terje Mathisen wrote:
    aph@littlepinkcloud.invalid wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    Side note, XCHG all by itself automatically implies a LOCK prefix... :^)

    Yeah, I remember reading that once, but I still see LOCK prefxies
    being used. I guess a LOCK doesn't hurt, and people want to be on the
    safe side, or something. Ah, programmers...  :-)


    It is only the XCHG reg,[mem] forms which imply a LOCK!

    XCHG reg,reg has no bus transfers and does not assert LOCK.

    Thanks for the correction Terje! I must have missed that aspect in the
    docs. I need to reread it.



    NOP uses the opcode for XCHG AX,AX and that is how it was implemented on
    the original 8086, before pipelined/superscalar implementations needed a
    way to avoid false dependencies on (E)AX.

    It is interesting that LOCK is still implied in certain cases of XCHG. I
    wonder what might happen if they changed that. How much software might
    break... A lot?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Terje Mathisen on Sat Nov 25 15:19:34 2023
    On 11/25/2023 2:54 AM, Terje Mathisen wrote:
    MitchAlsup wrote:
    Scott Lurndal wrote:
    The fatal problem with ll/sc is scaling.  They scale poorly with large
    processor counts.


    One can take the PoV that the fatal problem is that only 1 memory
    container
    is used !! And it does not matter a whole lot if the underpinnings are
    LL/SC
    or 1CA1S.

    One can also take the PoV that all currently known CPU driven ATOMICs
    suffer at the BigO( level ) rather similarly, whereas Memory ATOMICs
    at least guarantee forward progress::: ADDtoMemory( addr, value )
    without scaling poorly.....

    And this is why XADD is my favorite building block when writing multi-threaded code.

    XADD is absolutely wonderful wrt bakery algorithms, and other things,
    queues, ect... Fwiw Terje, here is one of my read/write locks that uses
    XADD (aka, fetch-and-add, ;^) for all of its fast-paths. Basically, its
    all accounting:

    https://vorbrodt.blog/2019/02/14/read-write-mutex




    I used this as the only primitive in the NTPD server version I optimized
    for Meinberg's beefiest server, a 1U 4-core box with multiple ethernet
    ports, a timing gps receiver and a very stable local oscillator.

    The model I used was so simple that it was easy to show that it would be correct, have close to zero contention, and guarantee forward progress.

    This was made much easier because I was also forced to handle all
    incoming packets/requests in a single thread, so I decided to use N-1
    single writer/single reader queues in a round robin fashion, but XADD
    tends to work acceptably well even in scenarios with much more potential
    for contention.

    Nice! I remember reading one of your lock-free
    multi-producer/multi-consumer queues some years ago on this very group.
    Fwiw, check this out:

    https://groups.google.com/g/lock-free/c/acjQ3-89abE/m/a6-Di0GZsyEJ


    <pseudo code, membars aside for a moment> ______________________________________________
    struct cell { uint32_t ver; double state; };

    uint32_t head = 0;
    uint32_t tail = 0;
    cell cells[N]; // N must be a power of 2

    void init() {
    for (uint32_t i = 0; i < N; ++i) cells[i].ver = i;
    }

    void producer(double state) {
    uint32_t ver = XADD(&head, 1);
    cell& c = cells[ver & (N - 1)];
    while (LOAD(&c.ver) != ver) backoff();
    c.state = state;
    STORE(&c.ver, ver + 1);
    }

    double consumer() {
    uint32_t ver = XADD(&tail, 1);
    cell& c = cells[ver & (N - 1)];
    while (LOAD(&c.ver) != ver + 1) backoff();
    double state = c.state;
    STORE(&c.ver, ver + N);
    return state;
    }
    ______________________________________________


    This older experimental FIFO bakery algorithm is interesting to me. Btw,
    there is a way to get rid of the spin/backoff.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Chris M. Thomasson on Sat Nov 25 15:24:10 2023
    On 11/25/2023 3:19 PM, Chris M. Thomasson wrote:
    On 11/25/2023 2:54 AM, Terje Mathisen wrote:
    MitchAlsup wrote:
    Scott Lurndal wrote:
    The fatal problem with ll/sc is scaling.  They scale poorly with large >>>> processor counts.


    One can take the PoV that the fatal problem is that only 1 memory
    container
    is used !! And it does not matter a whole lot if the underpinnings
    are LL/SC
    or 1CA1S.

    One can also take the PoV that all currently known CPU driven ATOMICs
    suffer at the BigO( level ) rather similarly, whereas Memory ATOMICs
    at least guarantee forward progress::: ADDtoMemory( addr, value )
    without scaling poorly.....

    And this is why XADD is my favorite building block when writing
    multi-threaded code.

    XADD is absolutely wonderful wrt bakery algorithms, and other things,
    queues, ect... Fwiw Terje, here is one of my read/write locks that uses
    XADD (aka, fetch-and-add, ;^) for all of its fast-paths. Basically, its
    all accounting:

    https://vorbrodt.blog/2019/02/14/read-write-mutex

    [...]

    Take notice of the acq_rel barrier in the following function
    (read_unlock) on line 14:
    ____________________________
    void read_unlock()
    {
    if (m_count.fetch_add(1, std::memory_order_release) < 0)
    if (m_rdwake.fetch_add(-1, std::memory_order_acq_rel) == 1)
    m_wrwset.post();
    }
    ____________________________

    This level of memory ordering is required here.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Scott Lurndal on Sat Nov 25 15:33:20 2023
    On 11/25/2023 9:05 AM, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Chris M. Thomasson wrote:

    Afaict MFENCE is basically a #StoreLoad. A LOCK RMW is a membar but in >>>>> an interesting way where there is a word(s) being updated. I say word's >>>>> is because of cmpxchg8b on a 32 bit system. Or cmpxchg16b on a 64 bit >>>>> system. Basically a DWCAS, or double-word compare-and-swap where it
    works with two contiguous words. This is different that a DCAS that can >>>>> work with two non-contiguous words.


    The later is generally known as DCADS Double compare and Double swap.
    I did see some academic literature a decade ago in wanting TCADS
    triple compare and double swap.

    It is for stuff like this that I invented esm so SW writers can
    program up any number of compares and and width of swapping. This
    means ISA does not have to change in the future wrt synchronization
    capabilities.

    It ends up that one of the most important properties is also found
    in LL/SC--and that is the the LL denotes the beginning of an ATOMIC
    event and SC denotes the end. LL/SC provide a bridge model to SW
    while ATOMIC events ending with xCAxS only provide notion one is
    in a ATOMIC event at the last instruction of the event.

    LL/SC can perform interference detection based on address while;
    xCAxS can only perform the interference based on the data at that address.....

    The fatal problem with ll/sc is scaling. They scale poorly with large
    processor counts.


    One can take the PoV that the fatal problem is that only 1 memory container >> is used !! And it does not matter a whole lot if the underpinnings are LL/SC >> or 1CA1S.

    Clearly 1000 processors beating on the same cache line is
    something to be avoided. There still seem to be a dearth
    of programmers that understand multithreaded programming.

    1000 processors requires a distributed setup wrt NUMA. For instance, an amortized multi-queue that knows about locality tends to work rather
    well. The queue is distributed out where its unlikely that two
    processors might contend on the same thing in the queue logic.





    One can also take the PoV that all currently known CPU driven ATOMICs
    suffer at the BigO( level ) rather similarly, whereas Memory ATOMICs
    at least guarantee forward progress::: ADDtoMemory( addr, value )
    without scaling poorly.....

    The difference between an atomic and 'll' is that the atomic is
    guaranteed to make forward progress, while the 'll' generally
    isn't and the chances of a SC failure increase exponentially with
    the processor count.

    Atomics are often supported directly by the shared LLC as well.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to MitchAlsup on Sat Nov 25 15:34:41 2023
    On 11/25/2023 11:49 AM, MitchAlsup wrote:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Chris M. Thomasson wrote:

    Afaict MFENCE is basically a #StoreLoad. A LOCK RMW is a membar
    but in an interesting way where there is a word(s) being updated.
    I say word's is because of cmpxchg8b on a 32 bit system. Or
    cmpxchg16b on a 64 bit system. Basically a DWCAS, or double-word
    compare-and-swap where it works with two contiguous words. This is >>>>>> different that a DCAS that can work with two non-contiguous words.


    The later is generally known as DCADS Double compare and Double swap. >>>>> I did see some academic literature a decade ago in wanting TCADS
    triple compare and double swap.

    It is for stuff like this that I invented esm so SW writers can
    program up any number of compares and and width of swapping. This
    means ISA does not have to change in the future wrt synchronization
    capabilities.

    It ends up that one of the most important properties is also found
    in LL/SC--and that is the the LL denotes the beginning of an ATOMIC
    event and SC denotes the end. LL/SC provide a bridge model to SW
    while ATOMIC events ending with xCAxS only provide notion one is
    in a ATOMIC event at the last instruction of the event.

    LL/SC can perform interference detection based on address while;
    xCAxS can only perform the interference based on the data at that
    address.....

    The fatal problem with ll/sc is scaling.  They scale poorly with large >>>> processor counts.


    One can take the PoV that the fatal problem is that only 1 memory
    container
    is used !! And it does not matter a whole lot if the underpinnings
    are LL/SC
    or 1CA1S.

    Clearly 1000 processors beating on the same cache line is
    something to be avoided.   There still seem to be a dearth
    of programmers that understand multithreaded programming.


    One can also take the PoV that all currently known CPU driven ATOMICs
    suffer at the BigO( level ) rather similarly, whereas Memory ATOMICs
    at least guarantee forward progress::: ADDtoMemory( addr, value )
    without scaling poorly.....

    The difference between an atomic and 'll' is that the atomic is
    guaranteed to make forward progress, while the 'll' generally
    isn't and the chances of a SC failure increase exponentially with
    the processor count.

    On typical interconnects, it is only cubically not exponentially: Each processor agrees that one of them made forward progress after a
    quadratic amount of memory requests and a cubic amount of memory
    responses (including SNOOP requests).

    Maintaining forward progress is very important. You do want some periods
    of live lock to show up when the system is under stress, high load, ect...



    Atomics are often supported directly by the shared LLC as well.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Scott Lurndal on Sat Nov 25 15:37:49 2023
    On 11/25/2023 12:09 PM, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Chris M. Thomasson wrote:

    Afaict MFENCE is basically a #StoreLoad. A LOCK RMW is a membar but in >>>>>>> an interesting way where there is a word(s) being updated. I say word's >>>>>>> is because of cmpxchg8b on a 32 bit system. Or cmpxchg16b on a 64 bit >>>>>>> system. Basically a DWCAS, or double-word compare-and-swap where it >>>>>>> works with two contiguous words. This is different that a DCAS that can >>>>>>> work with two non-contiguous words.


    The later is generally known as DCADS Double compare and Double swap. >>>>>> I did see some academic literature a decade ago in wanting TCADS
    triple compare and double swap.

    It is for stuff like this that I invented esm so SW writers can
    program up any number of compares and and width of swapping. This
    means ISA does not have to change in the future wrt synchronization >>>>>> capabilities.

    It ends up that one of the most important properties is also found >>>>>> in LL/SC--and that is the the LL denotes the beginning of an ATOMIC >>>>>> event and SC denotes the end. LL/SC provide a bridge model to SW
    while ATOMIC events ending with xCAxS only provide notion one is
    in a ATOMIC event at the last instruction of the event.

    LL/SC can perform interference detection based on address while;
    xCAxS can only perform the interference based on the data at that address.....

    The fatal problem with ll/sc is scaling. They scale poorly with large >>>>> processor counts.


    One can take the PoV that the fatal problem is that only 1 memory container
    is used !! And it does not matter a whole lot if the underpinnings are LL/SC
    or 1CA1S.

    Clearly 1000 processors beating on the same cache line is
    something to be avoided. There still seem to be a dearth
    of programmers that understand multithreaded programming.


    One can also take the PoV that all currently known CPU driven ATOMICs
    suffer at the BigO( level ) rather similarly, whereas Memory ATOMICs
    at least guarantee forward progress::: ADDtoMemory( addr, value )
    without scaling poorly.....

    The difference between an atomic and 'll' is that the atomic is
    guaranteed to make forward progress, while the 'll' generally
    isn't and the chances of a SC failure increase exponentially with
    the processor count.

    On typical interconnects, it is only cubically not exponentially:

    Isn't 'cubically' 'exponentially' when the exponent is 3?

    Then one must also factor in NUMA latencies (we had a NUMA system
    where worst case round trip latency between nodes was
    800ns over Inifiniband). Broadcast probes (HT) between NUMA nodes
    was counterindicated, so a directory was added.

    Fwiw, one time I had an idea of using fractals to create locality in
    NUMA systems.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Scott Lurndal on Sat Nov 25 15:42:32 2023
    On 11/25/2023 12:09 PM, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Chris M. Thomasson wrote:

    Afaict MFENCE is basically a #StoreLoad. A LOCK RMW is a membar but in >>>>>>> an interesting way where there is a word(s) being updated. I say word's >>>>>>> is because of cmpxchg8b on a 32 bit system. Or cmpxchg16b on a 64 bit >>>>>>> system. Basically a DWCAS, or double-word compare-and-swap where it >>>>>>> works with two contiguous words. This is different that a DCAS that can >>>>>>> work with two non-contiguous words.


    The later is generally known as DCADS Double compare and Double swap. >>>>>> I did see some academic literature a decade ago in wanting TCADS
    triple compare and double swap.

    It is for stuff like this that I invented esm so SW writers can
    program up any number of compares and and width of swapping. This
    means ISA does not have to change in the future wrt synchronization >>>>>> capabilities.

    It ends up that one of the most important properties is also found >>>>>> in LL/SC--and that is the the LL denotes the beginning of an ATOMIC >>>>>> event and SC denotes the end. LL/SC provide a bridge model to SW
    while ATOMIC events ending with xCAxS only provide notion one is
    in a ATOMIC event at the last instruction of the event.

    LL/SC can perform interference detection based on address while;
    xCAxS can only perform the interference based on the data at that address.....

    The fatal problem with ll/sc is scaling. They scale poorly with large >>>>> processor counts.


    One can take the PoV that the fatal problem is that only 1 memory container
    is used !! And it does not matter a whole lot if the underpinnings are LL/SC
    or 1CA1S.

    Clearly 1000 processors beating on the same cache line is
    something to be avoided. There still seem to be a dearth
    of programmers that understand multithreaded programming.


    One can also take the PoV that all currently known CPU driven ATOMICs
    suffer at the BigO( level ) rather similarly, whereas Memory ATOMICs
    at least guarantee forward progress::: ADDtoMemory( addr, value )
    without scaling poorly.....

    The difference between an atomic and 'll' is that the atomic is
    guaranteed to make forward progress, while the 'll' generally
    isn't and the chances of a SC failure increase exponentially with
    the processor count.

    On typical interconnects, it is only cubically not exponentially:

    Isn't 'cubically' 'exponentially' when the exponent is 3?

    cubic wrt the 4 roots of a complex number in the form of z^4.

    z^4 has four unique roots such that when any of those roots raised to
    the fourth power equals, z.

    Fwiw, check out this fractal system of mine that uses roots of complex
    numbers to create some rather interesting things. Paul, one of m
    friends, was kind enough to do a little write up of my method:

    https://paulbourke.net/fractals/multijulia


    Then one must also factor in NUMA latencies (we had a NUMA system
    where worst case round trip latency between nodes was
    800ns over Inifiniband). Broadcast probes (HT) between NUMA nodes
    was counterindicated, so a directory was added.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Chris M. Thomasson on Sat Nov 25 15:45:06 2023
    On 11/25/2023 3:34 PM, Chris M. Thomasson wrote:
    On 11/25/2023 11:49 AM, MitchAlsup wrote:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Chris M. Thomasson wrote:

    Afaict MFENCE is basically a #StoreLoad. A LOCK RMW is a membar
    but in an interesting way where there is a word(s) being updated. >>>>>>> I say word's is because of cmpxchg8b on a 32 bit system. Or
    cmpxchg16b on a 64 bit system. Basically a DWCAS, or double-word >>>>>>> compare-and-swap where it works with two contiguous words. This
    is different that a DCAS that can work with two non-contiguous
    words.


    The later is generally known as DCADS Double compare and Double swap. >>>>>> I did see some academic literature a decade ago in wanting TCADS
    triple compare and double swap.

    It is for stuff like this that I invented esm so SW writers can
    program up any number of compares and and width of swapping. This
    means ISA does not have to change in the future wrt synchronization >>>>>> capabilities.

    It ends up that one of the most important properties is also found >>>>>> in LL/SC--and that is the the LL denotes the beginning of an ATOMIC >>>>>> event and SC denotes the end. LL/SC provide a bridge model to SW
    while ATOMIC events ending with xCAxS only provide notion one is
    in a ATOMIC event at the last instruction of the event.

    LL/SC can perform interference detection based on address while;
    xCAxS can only perform the interference based on the data at that
    address.....

    The fatal problem with ll/sc is scaling.  They scale poorly with large >>>>> processor counts.


    One can take the PoV that the fatal problem is that only 1 memory
    container
    is used !! And it does not matter a whole lot if the underpinnings
    are LL/SC
    or 1CA1S.

    Clearly 1000 processors beating on the same cache line is
    something to be avoided.   There still seem to be a dearth
    of programmers that understand multithreaded programming.


    One can also take the PoV that all currently known CPU driven ATOMICs
    suffer at the BigO( level ) rather similarly, whereas Memory ATOMICs
    at least guarantee forward progress::: ADDtoMemory( addr, value )
    without scaling poorly.....

    The difference between an atomic and 'll' is that the atomic is
    guaranteed to make forward progress, while the 'll' generally
    isn't and the chances of a SC failure increase exponentially with
    the processor count.

    On typical interconnects, it is only cubically not exponentially: Each
    processor agrees that one of them made forward progress after a
    quadratic amount of memory requests and a cubic amount of memory
    responses (including SNOOP requests).

    Maintaining forward progress is very important. You do typo(_NOT_) want some periods
    of live lock to show up when the system is under stress, high load, ect...
    [...]

    Damn it:
    ____________
    Maintaining forward progress is very important. You do _NOT_ want some
    periods of live lock to show up when the system is under stress, high
    load, ect...
    ____________

    Damn typos! Sorry everybody. ;^o

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul A. Clayton@21:1/5 to Scott Lurndal on Sat Nov 25 21:45:28 2023
    On 11/25/23 12:05 PM, Scott Lurndal wrote:
    [snip]
    The difference between an atomic and 'll' is that the atomic is
    guaranteed to make forward progress, while the 'll' generally
    isn't and the chances of a SC failure increase exponentially with
    the processor count.

    An ISA could guarantee forward progress or forward progress for
    certain more limited cases (conceptually similar to ZArch
    constrained transactions). Even if the ISA does not make that
    guarantee, a "profile" of that ISA could make that guarantee.

    It should also be noted that a three instruction sequence
    beginning with LL and ending with SC could be subject to idiom
    recognition and translated into the equivalent of an atomic
    instruction. The only difference would be in the size of the
    "instruction" (and that the "instruction" could cross a cache
    block and/or page boundary).

    With 4-instruction idiom recognition, one could even translate to
    a fire-and-forget by destroying the value after SC and before any
    further use.

    Providing a special LL that inserted an implied SC after one
    computation instruction. With the "clever" interface I once
    proposed of having a register that zeroed after value consumption,
    one could use that "zero" register to indicate that the implied SC
    clears the value, i.e., a fire-and-forget; however, the special LL
    could include the information that the result is not written to a
    register. Such a special LL seems to reduce the code bloat
    compared to idiom recognition and make explicit that the atomic
    operation is specifically simple.

    With such a special LL, all single computation instruction atomics
    could be encoded and performance guarantees/guidelines could be
    extended at leisure. If the special LL provided encoding space for
    the instruction count before the SC is inserted — likely with
    larger values being illegal in the first implementation — one
    extension path would be easier to provide.

    The LL/SC interface could also be naturally extended to a limited
    transactional memory where internal loads and stores are allowed
    and are included in the atomic operation. Extending forward
    progress guarantees to such would be difficult, except when all
    memory operations are within a (larger-than-word) reservation
    granule.

    (One might be able to provide better guarantees than for general
    small transactions when the write set is singular with a
    sparser/larger read set. Ordering members of a collection of one- write-granule/multile-read-granule atomics seems likely to be much
    less challenging that ordering for larger write sets. If such less
    complex atomics were given priority, perhaps their forward
    progress could be guaranteed. I not certain this is practical and
    doubtful that it would be especially useful, but finding special
    cases that can be handled better is a standard optimization
    method.)

    Generating complexity is easy; producing elegance, not so much.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Paul A. Clayton on Sun Nov 26 16:12:36 2023
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 11/25/23 12:05 PM, Scott Lurndal wrote:
    [snip]
    The difference between an atomic and 'll' is that the atomic is
    guaranteed to make forward progress, while the 'll' generally
    isn't and the chances of a SC failure increase exponentially with
    the processor count.

    An ISA could guarantee forward progress or forward progress for
    certain more limited cases (conceptually similar to ZArch
    constrained transactions). Even if the ISA does not make that
    guarantee, a "profile" of that ISA could make that guarantee.

    It should also be noted that a three instruction sequence
    beginning with LL and ending with SC could be subject to idiom
    recognition and translated into the equivalent of an atomic
    instruction. The only difference would be in the size of the
    "instruction" (and that the "instruction" could cross a cache
    block and/or page boundary).

    With 4-instruction idiom recognition, one could even translate to
    a fire-and-forget by destroying the value after SC and before any
    further use.

    Providing a special LL that inserted an implied SC after one
    computation instruction. With the "clever" interface I once
    proposed of having a register that zeroed after value consumption,
    one could use that "zero" register to indicate that the implied SC
    clears the value, i.e., a fire-and-forget; however, the special LL
    could include the information that the result is not written to a
    register. Such a special LL seems to reduce the code bloat
    compared to idiom recognition and make explicit that the atomic
    operation is specifically simple.

    With such a special LL, all single computation instruction atomics
    could be encoded and performance guarantees/guidelines could be
    extended at leisure. If the special LL provided encoding space for
    the instruction count before the SC is inserted — likely with
    larger values being illegal in the first implementation — one
    extension path would be easier to provide.

    The LL/SC interface could also be naturally extended to a limited >transactional memory where internal loads and stores are allowed
    and are included in the atomic operation. Extending forward
    progress guarantees to such would be difficult, except when all
    memory operations are within a (larger-than-word) reservation
    granule.

    (One might be able to provide better guarantees than for general
    small transactions when the write set is singular with a
    sparser/larger read set. Ordering members of a collection of one- >write-granule/multile-read-granule atomics seems likely to be much
    less challenging that ordering for larger write sets. If such less
    complex atomics were given priority, perhaps their forward
    progress could be guaranteed. I not certain this is practical and
    doubtful that it would be especially useful, but finding special
    cases that can be handled better is a standard optimization
    method.)

    Generating complexity is easy; producing elegance, not so much.

    Seems like far more work and complexity than just adding a set
    of atomic instructions to the ISA (like ARM did with LSE in V8.1).

    The CPU can perform the atomic immediately if it already has
    the cache line, or it can send it to LLC if it doesn't. It
    can also send the atomic to a PCI device or coprocessor. LL/SC is far
    less efficient and adding complexity to the implementation
    to fix a broken interface doesn't make much sense to me.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Sun Nov 26 16:21:10 2023
    scott@slp53.sl.home (Scott Lurndal) writes:
    Isn't 'cubically' 'exponentially' when the exponent is 3?

    To complement the answer by Mitch Alsup:

    No. Cubic is an instance of polynomial.

    For a given c1, c2 and a growing n:

    Polynomial: n^c1 (simplified: there may be a multiplier and there may
    be terms with a smaller exponent, but for large n, the largest
    exponent is the one that dominates).

    Exponential: c2^n

    Even with a very large c1 and a small c2>1, there is an n for which
    n^c1<c2^n. That's why exponential is a different complexity class
    from polynomial (in complexity theory).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Sun Nov 26 12:06:26 2023
    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup) writes:

    One can also take the PoV that all currently known CPU driven ATOMICs
    suffer at the BigO( level ) rather similarly, whereas Memory ATOMICs
    at least guarantee forward progress::: ADDtoMemory( addr, value )
    without scaling poorly.....

    The difference between an atomic and 'll' is that the atomic is
    guaranteed to make forward progress, while the 'll' generally
    isn't and the chances of a SC failure increase exponentially with
    the processor count.

    Atomics are often supported directly by the shared LLC as well.

    An implementation of LL/SC can guarantee forward progress by pinning
    the cache line for a period of time. The cache pin flag is reset when
    the timer times out or an interrupt occurs. Store Conditional just checks
    if the cache pin flag is still set and if it is performs the store.

    One cheap way to pin a cache line is to simply shut off processing
    inbound coherence command msgs so it doesn't see any invalidates.
    Or it could continue processing inbound coherence msgs until it sees
    an invalidate for the pinned line then stop processing the inbound msgs.
    Or it can go full asynchronous and remember the pinned line invalidate
    and perform it after the pin flag is reset by timeout or SC.

    The main difference between LL/SC and atomic RMW like Atomic Fetch Add
    is their effect on coherence message latency,
    which in turn can affect the whole system performance.

    LL/SC must be implemented by moving the value from cache into the core
    for general instructions to operate on it, so the duration the line is
    pinned is lengthened and the latency for coherence message replies.

    While AFADD may also be implemented by pinning a line and moving the data through the core, the atomic RMW operations, ADD AND OR XOR are simple
    enough that they can be moved out to the cache controller where it can be
    done in 1 clock, so the line doesn't need to be pinned at all and
    coherence msg processing is not delayed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to aph@littlepinkcloud.invalid on Sun Nov 26 11:58:57 2023
    aph@littlepinkcloud.invalid wrote:
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    WRT the draining for synchronization, I'm not sure what you think
    the difference is between making it appear to be done and actually
    doing so. Sure there may be some optimizations possible, but the net
    effect on the externally observable cache state must be the same.

    As I wrote, you don't have to actually do the synchronization dance if
    it's not needed. You can speculate that there won't be a conflict,
    detect any memory-ordering violation, and roll back and replay.

    Adrew.

    Yeah, you are correct. I needed to think about it overnight
    to see how the whole thing could work.

    I still don't see (much) advantage to the cache transparently performing
    the stores out of order as they are already retired so disposing of a
    store value sooner won't increase processor concurrency and IPC.
    The advantage, if there is one, would be if the core was resource
    constrained, like it had run out of LSQ entries or miss buffers,
    and it releases resources for reuse sooner.

    But cache responding to the loads sooner would increase the concurrency
    and IPC by having more instructions ready to retire sooner.
    This could be relatively easy to do provided the core has the ability
    to replay instructions from an affected load onwards.

    There can be multiple store misses in flight at once, but as long as
    the actual line store changes are performed in order then cache controller doesn't need a mechanism to detect and correct visible order violations.

    This greatly simplifies the problem because it means a younger load
    will only bypass an older load or store if the older op missed the cache, because if it hit it would have been performed immediately.

    When a load op reaches the end of the cache access pipeline and detects
    a hit, it copies the value and also sets a miss-in-flight flag on the
    result if there are any miss buffers allocated. That miss-in-flight flag
    means this result just bypassed an older load or store.

    The load-store queue has a CAM for detecting same-cache-line operations. Assuming stores stay in the LSQ until they are complete and the line updated,
    a store that misses cache will prevent younger loads from being removed
    from the LSQ, which will keep their CAM entries present.
    Those CAM entries can be used to detect ordering violations
    and trigger a replay from the load onward.

    When the cache replies to a load op with the miss-in-flight flag set
    the load marks its CAM entry as a bypasser. When invalidate messages
    arrive at the cache controller the line number would also be passed
    through the LSQ same-line CAM and it it hits and was a bypasser then
    we trigger a replay trap, with the LSQ load entry providing the map
    back to the instruction queue point to trigger the replay.

    When the oldest load or store completes it is removed from the LSQ,
    allowing the younger bypasser load to be removed from the LSQ,
    which removes its same-line CAM entry and the order violation detector.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Sun Nov 26 20:59:02 2023
    EricP wrote:

    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup) writes:

    One can also take the PoV that all currently known CPU driven ATOMICs
    suffer at the BigO( level ) rather similarly, whereas Memory ATOMICs
    at least guarantee forward progress::: ADDtoMemory( addr, value )
    without scaling poorly.....

    The difference between an atomic and 'll' is that the atomic is
    guaranteed to make forward progress, while the 'll' generally
    isn't and the chances of a SC failure increase exponentially with
    the processor count.

    Atomics are often supported directly by the shared LLC as well.

    An implementation of LL/SC can guarantee forward progress by pinning
    the cache line for a period of time. The cache pin flag is reset when
    the timer times out or an interrupt occurs. Store Conditional just checks
    if the cache pin flag is still set and if it is performs the store.

    One cheap way to pin a cache line is to simply shut off processing
    inbound coherence command msgs so it doesn't see any invalidates.
    Or it could continue processing inbound coherence msgs until it sees
    an invalidate for the pinned line then stop processing the inbound msgs.
    Or it can go full asynchronous and remember the pinned line invalidate
    and perform it after the pin flag is reset by timeout or SC.

    The main difference between LL/SC and atomic RMW like Atomic Fetch Add
    is their effect on coherence message latency,
    which in turn can affect the whole system performance.

    LL/SC must be implemented by moving the value from cache into the core
    for general instructions to operate on it, so the duration the line is
    pinned is lengthened and the latency for coherence message replies.

    While AFADD may also be implemented by pinning a line and moving the data through the core, the atomic RMW operations, ADD AND OR XOR are simple
    enough that they can be moved out to the cache controller where it can be done in 1 clock, so the line doesn't need to be pinned at all and
    coherence msg processing is not delayed.

    Compared to the Signal strength buffers and various multiplexers,
    the integer ALU is very inexpensive. IIRC, an 32-bit integer ALU
    is less than ¼th the size of a 2KB SRAM Macro. Given an L2 will
    have at least 256 of these macro blocks, the Op to Memory func-
    tionality is vey inexpensive.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Chris M. Thomasson on Sun Nov 26 13:55:42 2023
    On 11/3/2023 3:29 PM, Chris M. Thomasson wrote:
    On 11/3/2023 2:15 AM, Anton Ertl wrote:
    I have written a microbenchmark for measuring how memory dependencies
    affect the performance of various microarchitectures.  You can find it
    along with a description and results on
    <http://www.complang.tuwien.ac.at/anton/memdep/>.
    [...]



    Is the only arch out there that does not require an explicit memory
    barrier for data-dependent loads a DEC Alpha? I think so.
    ^^^^^^^^^^^^^^^^^^^^^^^

    God damn it! Another typo:

    Let me correct that, argh!
    ________________________
    Is the only arch out there that _does_ __require__ an explicit memory
    barrier for data-dependent loads a DEC Alpha? I think so. ________________________

    Thank you for not flaming me to death on that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Anton Ertl on Sun Nov 26 14:00:15 2023
    On 11/4/2023 10:40 AM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/3/2023 2:15 AM, Anton Ertl wrote:
    I have written a microbenchmark for measuring how memory dependencies
    affect the performance of various microarchitectures. You can find it
    along with a description and results on
    <http://www.complang.tuwien.ac.at/anton/memdep/>.
    [...]

    Is the only arch out there that does not require an explicit memory
    barrier for data-dependent loads a DEC Alpha? I think so.

    I don't know any architecture that requires memory barriers for single-threaded programs that access just memory, not even Alpha.

    I am referring to a program that uses multiple threads.


    You may be thinking of the memory consistency model of Alpha, which is
    even weaker than everything else I know of. This is not surprising,
    given that a prominent advocacy paper for weak consistency [adve&gharachorloo95] came out of DEC.
    [...]

    Yup. A DEC alpha requires a memory barrier even for RCU iteration on the
    read side of the algorithm. Afaict, std::memory_order_consume fits the
    bill. It's a nop on every arch, except a DEC alpha. Unless I am missing
    an arch that is as weak or weaker than a DEC alpha.

    https://en.cppreference.com/w/cpp/atomic/memory_order
    _______________
    A load operation with this memory order performs a consume operation on
    the affected memory location: no reads or writes in the current thread dependent on the value currently loaded can be reordered before this
    load. Writes to data-dependent variables in other threads that release
    the same atomic variable are visible in the current thread. On most
    platforms, this affects compiler optimizations only (see Release-Consume ordering below).
    _______________

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Sun Nov 26 23:32:53 2023
    Chris M. Thomasson wrote:

    On 11/4/2023 10:40 AM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/3/2023 2:15 AM, Anton Ertl wrote:
    I have written a microbenchmark for measuring how memory dependencies
    affect the performance of various microarchitectures. You can find it >>>> along with a description and results on
    <http://www.complang.tuwien.ac.at/anton/memdep/>.
    [...]

    Is the only arch out there that does not require an explicit memory
    barrier for data-dependent loads a DEC Alpha? I think so.

    I don't know any architecture that requires memory barriers for
    single-threaded programs that access just memory, not even Alpha.

    CDC 6600 (via its stunt box) could calculate addresses in program order
    and access memory in reverse order due to a conflict and the timing of
    the stunt box loop versus the timing of a memory bank busy checks. Here
    is a single processor that sees memory order different than processor order.......

    I am referring to a program that uses multiple threads.


    You may be thinking of the memory consistency model of Alpha, which is
    even weaker than everything else I know of. This is not surprising,
    given that a prominent advocacy paper for weak consistency
    [adve&gharachorloo95] came out of DEC.
    [...]

    Yup. A DEC alpha requires a memory barrier even for RCU iteration on the
    read side of the algorithm. Afaict, std::memory_order_consume fits the
    bill. It's a nop on every arch, except a DEC alpha. Unless I am missing
    an arch that is as weak or weaker than a DEC alpha.

    https://en.cppreference.com/w/cpp/atomic/memory_order
    _______________
    A load operation with this memory order performs a consume operation on
    the affected memory location: no reads or writes in the current thread dependent on the value currently loaded can be reordered before this
    load. Writes to data-dependent variables in other threads that release
    the same atomic variable are visible in the current thread. On most platforms, this affects compiler optimizations only (see Release-Consume ordering below).
    _______________

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul A. Clayton@21:1/5 to Scott Lurndal on Mon Nov 27 09:33:59 2023
    On 11/26/23 11:12 AM, Scott Lurndal wrote:
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    [snip paths for extending LL/SC guarantees]
    Generating complexity is easy; producing elegance, not so much.

    Seems like far more work and complexity than just adding a set
    of atomic instructions to the ISA (like ARM did with LSE in V8.1).

    Composing primitives provides more flexibility at a higher cost
    for hardware management and communication to the programmer/
    compiler of which special cases are optimized by hardware (and
    code density).

    For a special LL — or perhaps better a short prefix to modify
    ordinary load instructions — hardware's work seems not to be that
    much more complex. (Note: I do not design hardware!) This is
    effectively a form of instruction fusion. With a special LL, the
    decoder would know early that this is a case for fusion/special
    translation.

    Instruction fusion is more complex than instruction cracking, but
    atomic operations are already complex.

    This also shifts the communication of performance guarantees from
    "instruction is valid" to "processor implements profile that
    provides that guarantee". However, even with atomic instructions
    one is likely to have different performance aspects for different implementations. All might guarantee local forward progress as
    part of the instruction definition, but the performance of the
    operation would vary especially under contention. The performance
    might also vary depending on the operation (e.g., adding a small
    positive value might be faster/more parallel via sharing the
    higher bit carry-out)

    Atomic instructions are denser, but are subject to instruction
    type expansion. Software written using a generic interface could
    be more nearly best-performance for future implementations that
    improve, e.g., an atomic floating-point addition operation.

    Fitting atomic instructions into a fixed-size 32-bit instruction
    also means that addressing modes are highly constrained. With such
    a constraint, one could conceive of a prefix that makes a
    following compute operation into an atomic operation (where a
    particular source register is actually an indirect operand and
    external memory destination). (Non-commutative operations would
    require the ability to specify any operand as being in memory.)

    If one need not consider binary compatibility (e.g., a translation
    layer between the distribution format and the machine format),
    the encoding choices would be quite different.

    The CPU can perform the atomic immediately if it already has
    the cache line, or it can send it to LLC if it doesn't. It
    can also send the atomic to a PCI device or coprocessor. LL/SC is far
    less efficient and adding complexity to the implementation
    to fix a broken interface doesn't make much sense to me.

    I do not consider LL/SC a "broken interface", except perhaps in
    facilitating inappropriate implementations. More general
    interfaces tend to move more of the implementation behind the
    abstraction, so performance and other "abstraction leakage" must
    be communicated/contracted outside of the interface itself.
    Without such communication/guarantees, implementations can
    choose the minimum-effort design that barely meets the interface
    definition or even be able to get away with an implementation
    that fails to meet the specification (if that failure is
    sufficiently obscure by rarity and other possible failure causes).☹

    More general interfaces will be less efficient when the important
    or common use cases are few (over all users of the interface) and
    known beforehand.

    Accumulating special cases, especially without foresight, can
    easily lead to inelegance. (I like the density and semantic
    clarity of atomic instructions — and they are not inconsistent
    with also having a more general atomic interface.)

    The "provide primitives not solutions" guideline for interfaces
    (specifically ISAs) is a *guideline* reacting against a bias
    toward localized solutions. An opposite bias also exists, to
    incorporate every possible use case into a general interface,
    typically leading to excess complexity.

    LL/SC itself is a "solution" to optimize atomicity, which would be
    physically possible otherwise but at greater overhead.

    My 66000's exotic synchronization mechanism has substantial
    generality and requires both a initiating instruction (a special
    load) and a terminating instruction (which could be just a control
    flow instruction leaving the basic block) as well as any
    computation and predication. While it is described as handling
    failures with replay, "as if" can be applied, especially for
    simple atomic operations. Three instruction fusion might be
    challenging (and I suspect providing hints/directives that the
    operation is of that kind might be helpful). I do not know how
    Mitch Alsup implemented ESM for his scalar core or how he
    envisions implementing it in a "maximum-effort" core.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Mon Nov 27 18:51:58 2023
    Paul A. Clayton wrote:

    On 11/26/23 11:12 AM, Scott Lurndal wrote:
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    [snip paths for extending LL/SC guarantees]
    Generating complexity is easy; producing elegance, not so much.

    Seems like far more work and complexity than just adding a set
    of atomic instructions to the ISA (like ARM did with LSE in V8.1).

    Composing primitives provides more flexibility at a higher cost
    for hardware management and communication to the programmer/
    compiler of which special cases are optimized by hardware (and
    code density).

    For a special LL — or perhaps better a short prefix to modify
    ordinary load instructions — hardware's work seems not to be that
    much more complex. (Note: I do not design hardware!) This is
    effectively a form of instruction fusion. With a special LL, the
    decoder would know early that this is a case for fusion/special
    translation.

    Instruction fusion is more complex than instruction cracking, but
    atomic operations are already complex.

    This also shifts the communication of performance guarantees from "instruction is valid" to "processor implements profile that
    provides that guarantee". However, even with atomic instructions
    one is likely to have different performance aspects for different implementations. All might guarantee local forward progress as
    part of the instruction definition, but the performance of the
    operation would vary especially under contention. The performance
    might also vary depending on the operation (e.g., adding a small
    positive value might be faster/more parallel via sharing the
    higher bit carry-out)

    Atomic instructions are denser, but are subject to instruction
    type expansion. Software written using a generic interface could
    be more nearly best-performance for future implementations that
    improve, e.g., an atomic floating-point addition operation.

    Fitting atomic instructions into a fixed-size 32-bit instruction
    also means that addressing modes are highly constrained. With such
    a constraint, one could conceive of a prefix that makes a
    following compute operation into an atomic operation (where a
    particular source register is actually an indirect operand and
    external memory destination). (Non-commutative operations would
    require the ability to specify any operand as being in memory.)

    If one need not consider binary compatibility (e.g., a translation
    layer between the distribution format and the machine format),
    the encoding choices would be quite different.

    The CPU can perform the atomic immediately if it already has
    the cache line, or it can send it to LLC if it doesn't. It

    There comes a time when you want and need ATOMICs with more than
    one addressed location. Things like DCADS.....

    can also send the atomic to a PCI device or coprocessor. LL/SC is far
    less efficient and adding complexity to the implementation
    to fix a broken interface doesn't make much sense to me.

    I do not consider LL/SC a "broken interface", except perhaps in
    facilitating inappropriate implementations. More general
    interfaces tend to move more of the implementation behind the
    abstraction, so performance and other "abstraction leakage" must
    be communicated/contracted outside of the interface itself.
    Without such communication/guarantees, implementations can
    choose the minimum-effort design that barely meets the interface
    definition or even be able to get away with an implementation
    that fails to meet the specification (if that failure is
    sufficiently obscure by rarity and other possible failure causes).☹

    More general interfaces will be less efficient when the important
    or common use cases are few (over all users of the interface) and
    known beforehand.

    Accumulating special cases, especially without foresight, can
    easily lead to inelegance. (I like the density and semantic
    clarity of atomic instructions — and they are not inconsistent
    with also having a more general atomic interface.)

    The "provide primitives not solutions" guideline for interfaces
    (specifically ISAs) is a *guideline* reacting against a bias
    toward localized solutions. An opposite bias also exists, to
    incorporate every possible use case into a general interface,
    typically leading to excess complexity.

    LL/SC itself is a "solution" to optimize atomicity, which would be
    physically possible otherwise but at greater overhead.

    My 66000's exotic synchronization mechanism has substantial
    generality and requires both a initiating instruction (a special
    load) and a terminating instruction (which could be just a control
    flow instruction leaving the basic block) as well as any
    computation and predication. While it is described as handling
    failures with replay, "as if" can be applied, especially for
    simple atomic operations.

    a) the terminator must be an outbound memory reference (not a branch
    because I specifically want to perform flow control within the event.

    b) When the initiator is decoded, the recovery point is the initiator.
    When the interrogator is decoded, the recovery point changes to
    the branch target.
    Three instruction fusion might be
    challenging (and I suspect providing hints/directives that the
    operation is of that kind might be helpful). I do not know how
    Mitch Alsup implemented ESM for his scalar core or how he
    envisions implementing it in a "maximum-effort" core.

    I use the miss buffer to watch for SNOOPs to participating cache
    lines, and I have a branch instruction which interrogates the
    miss buffer to see if any interference has transpired. Using
    something that is already present and already doing pretty
    much what is desired saves complexity.

    In the maximal effort design, I use the Memory Dependence Matrix
    to control memory order, each memory reference has a 2-bit state
    which determines its allowed memory order {no order, causal,
    sequential consistency, strongly ordered} and this controls
    the sequencing of the memory events.

    Any interrupt, exception, SysCall, will cause the ATOMIC event
    to fail to the recovery point. So, you cannot single step through
    events.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to Paul A. Clayton on Tue Nov 28 10:11:24 2023
    Paul A. Clayton <paaronclayton@gmail.com> wrote:
    On 11/26/23 11:12?AM, Scott Lurndal wrote:
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    [snip paths for extending LL/SC guarantees]
    Generating complexity is easy; producing elegance, not so much.

    Seems like far more work and complexity than just adding a set
    of atomic instructions to the ISA (like ARM did with LSE in V8.1).

    Composing primitives provides more flexibility at a higher cost
    for hardware management and communication to the programmer/
    compiler of which special cases are optimized by hardware (and
    code density).

    Mmm, but the point of LSE, as the name suggests, is to facilitate
    things like fire-and-forget counters on remote nodes without ping-
    ponging cache lines.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to aph@littlepinkcloud.invalid on Tue Nov 28 02:22:31 2023
    On 11/28/2023 2:11 AM, aph@littlepinkcloud.invalid wrote:
    Paul A. Clayton <paaronclayton@gmail.com> wrote:
    On 11/26/23 11:12?AM, Scott Lurndal wrote:
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    [snip paths for extending LL/SC guarantees]
    Generating complexity is easy; producing elegance, not so much.

    Seems like far more work and complexity than just adding a set
    of atomic instructions to the ISA (like ARM did with LSE in V8.1).

    Composing primitives provides more flexibility at a higher cost
    for hardware management and communication to the programmer/
    compiler of which special cases are optimized by hardware (and
    code density).

    Mmm, but the point of LSE, as the name suggests, is to facilitate
    things like fire-and-forget counters on remote nodes without ping-
    ponging cache lines.

    Keep split counters in mind as well...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Chris M. Thomasson on Tue Nov 28 02:45:33 2023
    On 11/23/2023 11:41 PM, Chris M. Thomasson wrote:
    On 11/23/2023 11:38 PM, Chris M. Thomasson wrote:
    On 11/23/2023 8:33 AM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/19/2023 2:23 PM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    So, while it does not eliminate live/dead-lock situations, it
    allows SW
    to be constructed to avoid live/dead lock situations:: Why is a value >>>>> which is provided when an ATOMIC event fails. 0 means success,
    negative
    values are spurious (buffer overflows,...) while positives represent >>>>> the number of competing threads, so the following case, skips elements >>>>> on a linked list to decrease future initerference.

    Element* getElement( unSigned  Key )
    {
         int count = 0;
         for( p = structure.head; p ; p = p->next )
         {
              if( p->Key == Key )
              {
                   if( count-- < 0 )
                   {
                        esmLOCK( p );
                        prev = p->prev;
                        esmLOCK( prev );
                        next = p->next;
                        esmLOCK( next );
                        if( !esmINTERFERENCE() )
                        {
                             p->prev = next;
                             p->next = prev;
                             p->prev = NULL;
                    esmLOCK( p->next = NULL );
                             return p;
                        }
                        else
                        {
                             count = esmWHY();
                             p = structure.head; >>>>>                     }
                   }
              }
         }
         return NULL;
    }

    Doing ATOMIC things like this means one can take the BigO( n^3 )
    activity
    that happens when a timer goes off and n threads all want access to
    the
    work queue, down to BigO( 3 ) yes=constant, but in practice it is
    reduced
    to BigO( ln( n ) ) when requesters arrive in random order at random
    time.

    I remember hearing from my friend Joe Seigh, who worked at IBM,
    that they had some sort of logic that would prevent live lock in a >>>>>> compare and swap wrt their free pool manipulation logic. Iirc, it
    was somewhat related to ABA, hard to remember right now, sorry. I
    need to find that old thread back in comp.programming.threads.

    Depending on system size: there can be several system function
    units that grant "order" for ATOMIC events. These are useful for
    64+node systems
    and unnecessary for less than 8-node systems. Disjoint memory spaces >>>>> can use independent ATOMIC arbiters and whether they are in use or
    not is
    invisible to SW.


    ?

    Check this out the old thread:

    https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ

    Humm, you arch seems pretty neat/interesting to me. I need to learn
    more about it. Can it be abused with a rouge thread that keeps
    altering a cacheline(s) that are participating in the atomic block,
    so to speak? Anyway, I am busy with family time. Will get back to you.

    While possible, it is a lot less likely than on a similar architecture
    without any of the bells and whistles.

    Iirc, Jow Seigh mentioned how CS on IBM systems would prevent live
    lock by locking a bus or asserting a signal, that would ensure that a
    compare-and-swap would never get into death spiral of always failing.
    Iirc, Microsoft has something like this in its lock-free stack SList
    or something, Cannot remember exactly right now. Sorry.


    Wrt Microsoft's SList, I think it goes into the kernel to handle memory reclamation issues, aba, and such...

    SList is basically identical to IBM's free pool manipulation, iirc.




    Joe Seigh mentioned internal docs, candy striped.


    Fwiw, here is some of my work:

    https://youtu.be/HwIkk9zENcg

    Octopi in a box playing ball.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to Chris M. Thomasson on Tue Nov 28 17:01:28 2023
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/28/2023 2:11 AM, aph@littlepinkcloud.invalid wrote:
    Paul A. Clayton <paaronclayton@gmail.com> wrote:
    Composing primitives provides more flexibility at a higher cost
    for hardware management and communication to the programmer/
    compiler of which special cases are optimized by hardware (and
    code density).

    Mmm, but the point of LSE, as the name suggests, is to facilitate
    things like fire-and-forget counters on remote nodes without ping-
    ponging cache lines.

    Keep split counters in mind as well...

    Sure, I know (we all know?) that there are algorithms to do
    distributed counting, but if hardware can do something easily and
    cheaply, why not? Sure, it won't scale as well up to very large node
    counts.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to MitchAlsup on Tue Nov 28 16:20:44 2023
    On 11/24/2023 10:32 AM, MitchAlsup wrote:
    Chris M. Thomasson wrote:

    On 11/23/2023 10:52 AM, EricP wrote:
    MitchAlsup wrote:

    There are two kinds of barriers/fences (I don't know if there are
    official
    terms for them), which are local bypass barriers, and global completion
    barriers.

    Bypass barriers restrict which younger ops in the local load-store queue >>> may bypass and start execution before older ops have made a value
    locally
    visible.

    Completion barriers block younger ops from starting execution before
    older ops have completed and read or written globally visible values.

    You appear to be referring to bypass barriers whereas I'm referring to
    completion barriers which require globally visible results.


    Basically, how to map the various membar ops into an arch that can be
    RMO. Assume the programmers have no problem with it... ;^o SPARC did
    it, but, is it worth it now? Is my knowledge of dealing with relaxed
    systems, threads/processes and membars obsoleted? shit man... ;^o

    If the page being mapped is properly identified in the PTE, then there is
    no reason to need any MemBars.
    Also Note:: MemBars are the WRONG abstraction--a MemBar is like a wall whereas what programmers want is a bridge. As long as you are on the
    bridge (inside an ATOMIC event) you want one memory model and when you
    leave the bridge you are free to use a more performant model. MemBars
    only demark the edges of the bridge, they don't cover the whole bridge.

    Well, there are different types of membars...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul A. Clayton@21:1/5 to MitchAlsup on Tue Nov 28 21:48:12 2023
    On 11/27/23 1:51 PM, MitchAlsup wrote:
    Paul A. Clayton wrote:
    [snip]
    My 66000's exotic synchronization mechanism has substantial
    generality and requires both a initiating instruction (a special
    load) and a terminating instruction (which could be just a control
    flow instruction leaving the basic block) as well as any
    computation and predication. While it is described as handling
    failures with replay, "as if" can be applied, especially for
    simple atomic operations.

    a) the terminator must be an outbound memory reference (not a branch
    because I specifically want to perform flow control within the event.

    I am sorry for the misinformation. I should have just left off
    that paragraph since I hurriedly looked at the My 66000
    documentation (specifically the assembly for a simple lock
    setting function). I thought that the store was not annotated
    (even though the comment said "unlocked") and assumed that a
    control flow statement (which I vaguely remembered was not
    allowed inside such atomics, being restricted to predication —
    assuming my memory for that is correct!) indicated an end.

    Thank you for the quick correction.

    b) When the initiator is decoded, the recovery point is the
    initiator.
      When the interrogator is decoded, the recovery point changes to
      the branch target.
                              Three instruction fusion might be
    challenging (and I suspect providing hints/directives that the
    operation is of that kind might be helpful). I do not know how
    Mitch Alsup implemented ESM for his scalar core or how he
    envisions implementing it in a "maximum-effort" core.

    I use the miss buffer to watch for SNOOPs to participating cache
    lines, and I have a branch instruction which interrogates the miss
    buffer to see if any interference has transpired. Using
    something that is already present and already doing pretty
    much what is desired saves complexity.

    In the maximal effort design, I use the Memory Dependence Matrix
    to control memory order, each memory reference has a 2-bit state
    which determines its allowed memory order {no order, causal,
    sequential consistency, strongly ordered} and this controls the
    sequencing of the memory events.

    Any interrupt, exception, SysCall, will cause the ATOMIC event
    to fail to the recovery point. So, you cannot single step through
    events.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to Chris M. Thomasson on Mon Dec 4 15:34:03 2023
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:

    For some damn reason, the more and more I look at your work wrt the
    atomics in here, the more I think about a special deadlock free locking mechanism of mine called the multex. It would sort addresses and remove duplicates, then lock them using a mutex locking table via hash. Removal
    of duplicates removed the need of recursive locking, the sorting allowed
    for deadlock free locking... Therefore the hash lock table was 100%
    separated from the user logic. Are there any relations to this, or am
    off base here? Thanks.

    https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

    This isn't a million miles from TL2, http://people.csail.mit.edu/shanir/publications/Transactional_Locking.pdf.
    It may not be immediately obvious, but it's essentially the same
    solution.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Tue Dec 19 22:36:17 2023
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I have written a microbenchmark for measuring how memory dependencies
    affect the performance of various microarchitectures. You can find it
    along with a description and results on ><http://www.complang.tuwien.ac.at/anton/memdep/>.

    We have now a new toy, a NUC with a Core i3-1315U (which runs it cores
    700MHz slower than the documented max turbo, but that does not hurt my measurements). I have run memdep with that, which needs some
    interesting twists:

    taskset -c 2 make CYCLES=cpu_core/cycles:u/ #for Golden Cove results
    taskset -c 5 make CYCLES=cpu_atom/cycles:u/ #for Gracemont results

    The results are also interesting; I show here each of the Alder Lake
    cores and its predecessor, and also Firestorm (Apple M1 performance
    core) for good measure:

    X Y Z A B B1 B2 C D E F G H
    0.55 0.55 0.55 0.52 0.84 0.84 0.61 1.68 3.32 3.29 5.00 4.94 6.36 FireStorm
    1.21 1.00 1.00 1.00 1.53 1.55 1.19 1.49 3.00 3.00 4.50 4.50 6.11 Tremont
    1.00 1.00 1.00 0.75 0.75 0.70 0.70 0.75 0.70 1.00 0.75 0.75 1.00 Gracemont
    1.00 1.00 1.00 0.81 0.81 0.81 0.81 0.75 0.81 0.84 0.82 0.86 1.00 Tiger Lake 1.00 1.00 1.00 0.55 0.75 0.65 0.65 0.78 0.68 0.67 0.66 0.65 0.65 Golden Cove

    So Gracemont (the small core) includes zero-cycle store-to-load
    forwarding, a feature that even Firestorm does not have.

    Golden Cove is even more interesting. Apparently it can not only
    perform zero-cycle store-to-load forwarding, it can even perform adds
    that take at most 0.65 cycles (Column H). My guess is that it can
    perform two back-to-back adds in one cycle, and the store-to-load
    optimization is so good that it does not prevent this combination of
    two adds.

    But I have not read anything about Golden Cove performing back-to-back
    adds in one cycle. Did I miss that, or are my measurements (or the computations) wrong? It should be easy enough to test the
    back-to-back loads, but I am too tired to do it now.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Thu Dec 21 17:21:44 2023
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Golden Cove is even more interesting. Apparently it can not only
    perform zero-cycle store-to-load forwarding, it can even perform adds
    that take at most 0.65 cycles (Column H). My guess is that it can
    perform two back-to-back adds in one cycle, and the store-to-load >optimization is so good that it does not prevent this combination of
    two adds.

    It cannot do back-to-back adds in general, but it can perform 0-cycle
    (or close to 0) constant adds, as I found out with the microbenchmark
    at <http://www.complang.tuwien.ac.at/anton/additions/>.

    You can find the text portion of this page below; there are a few
    links on the page that are not produced below.

    If you have information on any of the things I don't know, please let
    me know about it.

    - anton

    Dependent Additions Microbenchmark

    Most CPU cores have a latency of one cycle for an integer addition,
    so a sequence like

    addq $1, %rax
    addq $1, %rax
    addq $1, %rax
    addq $1, %rax
    addq $1, %rax

    usually has 5 cycles of total latency. The P-Core of the Core
    i3-1315U (a Raptor Cove with 1.25MB L2 cache (or does this mean it is
    a Golden Cove?)), however, executes this sequence much faster. This microbenchmark finds out performance characteristics about that.
    There are 14 loops, most containing sequences like the one shown
    above, but there are others:

    a i j n k-m
    addq %r8,%rax leaq 1(%rax),%r8 addq $1,%rax imul %rsi,%rax imul %rsi,%rax
    addq %rcx,%rax leaq 1(%r8),%rcx addq %rax,%r8 addq %rsi,%rax addq $1,%rax addq %rdx,%rax leaq 1(%rcx),%rdx addq $1,%rax addq $1,%rax addq %rsi,%rax leaq 1(%rdx),%rsi addq %rax,%rcx ...
    addq %rdi,%rax leaq 1(%rsi),%rax addq $1,%rax
    cmpq %rax,%r9 cmpq %rax, %rdi addq %rax,%rdx
    jg 2b jg 2b addq $1,%rax
    addq %rax,%rsi
    addq $1,%rax
    addq %rax,%r9
    addq $1,%rax
    cmpq %rax,%rdi
    jg 2b

    * Loop a tests if the speedup also holds if both operands to the
    addition are registers.
    * Loop i tests if it also holds if we use leaq instead of addq, and
    if we use more than one register.
    * Loop j tests if it also holds if every intermediate result is
    actually being used.
    * Loop n measures the latency of a multiply with a register (that
    contains 1), and a register-register add, in preparation for:
    * Loops k-m measure the pure latency of the add $1, %eax sequences
    by masking the ALU resource contention with the latency of an
    imul instruction. I.e., these resources can be consumed for free
    during the 3-cycle latency of imul.

    The results we see are:

    loop cyc/it Mops dep.adds remarks
    a 5.00 6 5 addq reg, %eax
    b 1.01 6 5 addq $1, %rax
    f 1.23 7 6 addq $1, %rax
    c 1.98 11 10 addq $1, %rax
    d 2.00 12 11 addq $1, %rax
    e 2.21 13 12 addq $1, %rax
    g 3.01 18 17 addq $1, %rax
    h 3.25 19 18 addq $1, %rax
    i 1.00 6 5 leaq 1(reg1),reg2
    j 2.00 12 6 addq $1, %rax; addq %rax, reg
    n 4.00 3 1 imul %rsi, %rax; addq %rsi, %rax
    k 3.00 3 1 imul %rsi, %rax; addq $1, %rax
    l 3.01 12 10 imul %rsi, %rax; addq $1, %rax
    m 3.20 17 15 imul %rsi, %rax; addq $1, %rax

    The Mops column counts the addq, cmpq+jg, leaq, and imul as
    Macro-ops.

    * We see from loop a that a general add has a latency of one cycle.
    * But adds with a constant that depend on each other run at 6
    macro-ops per cycle (where cmpq+jg is a macro-op), at least if
    the loop matches the constraints perfectly (loops b,d,g).
    * For slight mismatches (loops f,c,e,h), there is a penalty beyond
    just needing a resource for a cycle.
    * Using leaq (loop i) works just as fast as using addq, and using
    several registers does not slow execution down, either.
    * If every intermediate result is used (loop j), the dependent adds
    are not slowed down to 1 per cycle, but of course the additional
    adds consume resources and that slows the whole loop down to 2
    cycles/iteration.
    * We see a latency of 4 cycles (3 for the imul, 1 for the addq) in
    loop n.
    * Interestingly, if we replace the addq %rsi, %rax with addq $1,
    %rax, the latency goes down to 3 cycles (k). Additional adds
    don't increase the cycles for a loop iteration up to and
    including a sequence of 10 addq $1, %eax (l). After that it rises
    slowly (m).

    What is happening here? I can only guess:

    I believe that the hardware optimizes sequences of constant adds at
    the renaming stage of the microarchitecture. Not later, because
    according to Chips and Cheese Golden Cove (and, by extension, Raptor
    Cove) has only 5 ALUs, and we see >5 addqs per cycle. Not earlier,
    because it works across the zero-cycle store-to-load-forwarding
    (which probably happens not earlier than the renaming stage). The
    renamer already performs move elimination, the constant-add
    elimination could be a generalized form of that.

    I don't have a good explanation for the apparent latency of 0 for the
    adds following the imul in loops k and l. I would think that the
    additions accumulated in some state in the renamer have to be reified
    in a physical register at some point, and I would expect that to cost
    one cycle of latency. One theory I considered was that the
    microarchitecture internally has a multiply-add unit, but one would
    expect that to work also for loop n, but there we see an additional
    cycle of latency from the addition.

    How to run this yourself

    Download this directory and do make or possibly something like
    taskset -c 2 make on a Linux system with perf. The calls are scaled
    for 1G iterations (which you can see in the number of branches shown
    in the perf output), so dividing the number of cycles by 1G gives the cycles/iteration.

    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Thu Dec 21 18:12:10 2023
    Anton Ertl wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Golden Cove is even more interesting. Apparently it can not only
    perform zero-cycle store-to-load forwarding, it can even perform adds
    that take at most 0.65 cycles (Column H). My guess is that it can
    perform two back-to-back adds in one cycle, and the store-to-load >>optimization is so good that it does not prevent this combination of
    two adds.

    It cannot do back-to-back adds in general, but it can perform 0-cycle
    (or close to 0) constant adds, as I found out with the microbenchmark
    at <http://www.complang.tuwien.ac.at/anton/additions/>.

    You can find the text portion of this page below; there are a few
    links on the page that are not produced below.

    If you have information on any of the things I don't know, please let
    me know about it.

    - anton

    Dependent Additions Microbenchmark

    Most CPU cores have a latency of one cycle for an integer addition,
    so a sequence like

    addq $1, %rax
    addq $1, %rax
    addq $1, %rax
    addq $1, %rax
    addq $1, %rax

    K9, my P4 lookalike, would perform peephole optimizations (mainly due to
    stack pushes and pops) as it converted x86-64 code into µOps. The above
    would end up as::

    addq $5,%rax

    in µOp cache (which we called a trace cache). We did this particular optimization
    so that::

    push Rax
    push Rbx
    push Rcx

    would get changed from 3 serially dependent stores into 3 parallel stores

    ST Rax,[SP]
    ST Rbx,[SP+4]
    ST Rcs,[SP+8]
    ADD SP,SP,12

    usually has 5 cycles of total latency. The P-Core of the Core
    i3-1315U (a Raptor Cove with 1.25MB L2 cache (or does this mean it is
    a Golden Cove?)), however, executes this sequence much faster. This microbenchmark finds out performance characteristics about that.
    There are 14 loops, most containing sequences like the one shown
    above, but there are others:

    a i j n k-m addq %r8,%rax leaq 1(%rax),%r8 addq $1,%rax imul %rsi,%rax imul %rsi,%rax
    addq %rcx,%rax leaq 1(%r8),%rcx addq %rax,%r8 addq %rsi,%rax addq $1,%rax
    addq %rdx,%rax leaq 1(%rcx),%rdx addq $1,%rax addq $1,%rax
    addq %rsi,%rax leaq 1(%rdx),%rsi addq %rax,%rcx ...
    addq %rdi,%rax leaq 1(%rsi),%rax addq $1,%rax
    cmpq %rax,%r9 cmpq %rax, %rdi addq %rax,%rdx
    jg 2b jg 2b addq $1,%rax
    addq %rax,%rsi
    addq $1,%rax
    addq %rax,%r9
    addq $1,%rax
    cmpq %rax,%rdi
    jg 2b

    * Loop a tests if the speedup also holds if both operands to the
    addition are registers.
    * Loop i tests if it also holds if we use leaq instead of addq, and
    if we use more than one register.
    * Loop j tests if it also holds if every intermediate result is
    actually being used.
    * Loop n measures the latency of a multiply with a register (that
    contains 1), and a register-register add, in preparation for:
    * Loops k-m measure the pure latency of the add $1, %eax sequences
    by masking the ALU resource contention with the latency of an
    imul instruction. I.e., these resources can be consumed for free
    during the 3-cycle latency of imul.

    The results we see are:

    loop cyc/it Mops dep.adds remarks
    a 5.00 6 5 addq reg, %eax
    b 1.01 6 5 addq $1, %rax
    f 1.23 7 6 addq $1, %rax
    c 1.98 11 10 addq $1, %rax
    d 2.00 12 11 addq $1, %rax
    e 2.21 13 12 addq $1, %rax
    g 3.01 18 17 addq $1, %rax
    h 3.25 19 18 addq $1, %rax
    i 1.00 6 5 leaq 1(reg1),reg2
    j 2.00 12 6 addq $1, %rax; addq %rax, reg
    n 4.00 3 1 imul %rsi, %rax; addq %rsi, %rax
    k 3.00 3 1 imul %rsi, %rax; addq $1, %rax
    l 3.01 12 10 imul %rsi, %rax; addq $1, %rax
    m 3.20 17 15 imul %rsi, %rax; addq $1, %rax

    The Mops column counts the addq, cmpq+jg, leaq, and imul as
    Macro-ops.

    * We see from loop a that a general add has a latency of one cycle.
    * But adds with a constant that depend on each other run at 6
    macro-ops per cycle (where cmpq+jg is a macro-op), at least if
    the loop matches the constraints perfectly (loops b,d,g).
    * For slight mismatches (loops f,c,e,h), there is a penalty beyond
    just needing a resource for a cycle.
    * Using leaq (loop i) works just as fast as using addq, and using
    several registers does not slow execution down, either.
    * If every intermediate result is used (loop j), the dependent adds
    are not slowed down to 1 per cycle, but of course the additional
    adds consume resources and that slows the whole loop down to 2
    cycles/iteration.
    * We see a latency of 4 cycles (3 for the imul, 1 for the addq) in
    loop n.
    * Interestingly, if we replace the addq %rsi, %rax with addq $1,
    %rax, the latency goes down to 3 cycles (k). Additional adds
    don't increase the cycles for a loop iteration up to and
    including a sequence of 10 addq $1, %eax (l). After that it rises
    slowly (m).

    What is happening here? I can only guess:

    Peephole optimizations as µOps are combined into groups.

    I believe that the hardware optimizes sequences of constant adds at
    the renaming stage of the microarchitecture. Not later, because
    according to Chips and Cheese Golden Cove (and, by extension, Raptor
    Cove) has only 5 ALUs, and we see >5 addqs per cycle. Not earlier,
    because it works across the zero-cycle store-to-load-forwarding
    (which probably happens not earlier than the renaming stage). The
    renamer already performs move elimination, the constant-add
    elimination could be a generalized form of that.

    I don't have a good explanation for the apparent latency of 0 for the
    adds following the imul in loops k and l. I would think that the
    additions accumulated in some state in the renamer have to be reified
    in a physical register at some point, and I would expect that to cost
    one cycle of latency. One theory I considered was that the
    microarchitecture internally has a multiply-add unit, but one would
    expect that to work also for loop n, but there we see an additional
    cycle of latency from the addition.

    How to run this yourself

    Download this directory and do make or possibly something like
    taskset -c 2 make on a Linux system with perf. The calls are scaled
    for 1G iterations (which you can see in the number of branches shown
    in the perf output), so dividing the number of cycles by 1G gives the cycles/iteration.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Sat Dec 23 10:24:05 2023
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    [Golden/Raptor Cove]
    It cannot do back-to-back adds in general, but it can perform 0-cycle
    (or close to 0) constant adds, as I found out with the microbenchmark
    at <http://www.complang.tuwien.ac.at/anton/additions/>.

    I have now also tested the limits of this hardware optimization. The intermediate sum of the increments must stay within the 13-bit range -4096..4095, or you will see a slowdown.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to Chris M. Thomasson on Wed Dec 27 09:53:04 2023
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/17/2023 1:03 AM, aph@littlepinkcloud.invalid wrote:
    Kent Dickey <kegs@provalid.com> wrote:
    What I'm arguing is: the CPU should behave as if
    memory_order_seq_cst is set on all accesses with no special
    trickery.

    What I'm saying is: that isn't sufficient if you are using an
    optimizing compiler. And if you are programming for an optimizing
    compiler you have to follow the rules anway. And the optimizing
    compiler can reorder stores and loads as much, if not more, than the
    hardware does.

    This acquire/release nonsense is all weakly ordered brain
    damage. The problem is on weakly ordered CPUs, performance
    definitely does matter in terms of getting this stuff right, but
    that's their problem. Being weakly ordered makes them slower when
    they have to execute barriers for correctness, but it's the barriers
    themselves that are the slow down, not ordering the requests
    properly.

    How is that? All the barriers do is enforce the ordering.

    I feel the need to make the point that compiler barriers are different
    than memory barriers...

    Why do you feel the need to do that? Do you believe there is one
    single person reading this who doesn't realize?

    From the point of view of someone writing code in a high-level
    language, for example (Standard) C, there is no difference. Even if
    your machine was fully sequentially consistent, you'd still have to
    insert barriers for the compiler in the same places.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Wed Dec 27 17:21:34 2023
    Chris M. Thomasson wrote:

    On 11/25/2023 3:33 PM, Chris M. Thomasson wrote:
    On 11/25/2023 9:05 AM, Scott Lurndal wrote:


    Clearly 1000 processors beating on the same cache line is
    something to be avoided.   There still seem to be a dearth
    of programmers that understand multithreaded programming.

    1000 processors requires a distributed setup wrt NUMA. For instance, an
    amortized multi-queue that knows about locality tends to work rather
    well. The queue is distributed out where its unlikely that two
    processors might contend on the same thing in the queue logic.

    If a processor obtains some work from the "queue" API (FIFO is not
    guaranteed here.... ;^o), its highly likely that it is fairly local
    work, close to "it" wrt the calling processor, wrt the physical layout
    of the arch...

    When you have 1000 processors, and the number of cycles it takes
    to perform a synchronization is 100-cycles, you need the synchron-
    ization to hand off enough work so that the winner has enough
    cycles of computation before another synchronization to allow
    all other processors to obtain work.

    Hummm, I need to start another thread on something related.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to aph@littlepinkcloud.invalid on Wed Dec 27 17:23:31 2023
    aph@littlepinkcloud.invalid wrote:

    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/17/2023 1:03 AM, aph@littlepinkcloud.invalid wrote:
    Kent Dickey <kegs@provalid.com> wrote:
    What I'm arguing is: the CPU should behave as if
    memory_order_seq_cst is set on all accesses with no special
    trickery.

    What I'm saying is: that isn't sufficient if you are using an
    optimizing compiler. And if you are programming for an optimizing
    compiler you have to follow the rules anway. And the optimizing
    compiler can reorder stores and loads as much, if not more, than the
    hardware does.

    This acquire/release nonsense is all weakly ordered brain
    damage. The problem is on weakly ordered CPUs, performance
    definitely does matter in terms of getting this stuff right, but
    that's their problem. Being weakly ordered makes them slower when
    they have to execute barriers for correctness, but it's the barriers
    themselves that are the slow down, not ordering the requests
    properly.

    How is that? All the barriers do is enforce the ordering.

    I feel the need to make the point that compiler barriers are different
    than memory barriers...

    Why do you feel the need to do that? Do you believe there is one
    single person reading this who doesn't realize?

    There are often academicians who wander by now and then who are not
    up to these fine points.

    From the point of view of someone writing code in a high-level
    language, for example (Standard) C, there is no difference. Even if
    your machine was fully sequentially consistent, you'd still have to
    insert barriers for the compiler in the same places.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Wed Dec 27 17:50:59 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Chris M. Thomasson wrote:

    On 11/25/2023 3:33 PM, Chris M. Thomasson wrote:
    On 11/25/2023 9:05 AM, Scott Lurndal wrote:


    Clearly 1000 processors beating on the same cache line is
    something to be avoided.   There still seem to be a dearth
    of programmers that understand multithreaded programming.

    1000 processors requires a distributed setup wrt NUMA. For instance, an
    amortized multi-queue that knows about locality tends to work rather
    well. The queue is distributed out where its unlikely that two
    processors might contend on the same thing in the queue logic.

    If a processor obtains some work from the "queue" API (FIFO is not
    guaranteed here.... ;^o), its highly likely that it is fairly local
    work, close to "it" wrt the calling processor, wrt the physical layout
    of the arch...

    When you have 1000 processors, and the number of cycles it takes

    If you have a 1000 processors, you craft your workload to avoid
    synchronization contention to the greatest extent possible. Any
    workload with 1000 threads accessing a single resource is
    fundamentally broken.

    As for 'transactions', the problem with them isn't really hardware
    support, or the size or number of discreet elements in the transaction,
    it's how to leverage that capability from the standard languages like C,
    C++, Rust or Carbon. Nobody programs in assembler any more, even
    on microprocessors.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)