• Microarch Club

    From George Musk@21:1/5 to All on Thu Mar 21 19:34:50 2024
    Thought this may be interesting:
    https://microarch.club/
    https://www.youtube.com/@MicroarchClub/videos

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB-Alt on Mon Mar 25 22:17:03 2024
    BGB-Alt wrote:

    On 3/21/2024 2:34 PM, George Musk wrote:
    Thought this may be interesting:
    https://microarch.club/
    https://www.youtube.com/@MicroarchClub/videos

    At least sort of interesting...

    I guess one of the guys on there did a manycore VLIW architecture with
    the memory local to each of the cores. Seems like an interesting
    approach, though not sure how well it would work on a general purpose workload. This is also closer to what I had imagined when I first
    started working on this stuff, but it had drifted more towards a
    slightly more conventional design.


    But, admittedly, this is for small-N cores, 16/32K of L1 with a shared
    L2, seemed like a better option than cores with a very large shared L1
    cache.

    You appear to be "starting to get it"; congratulations.

    I am not sure that abandoning a global address space is such a great
    idea, as a lot of the "merits" can be gained instead by using weak
    coherence models (possibly with a shared 256K or 512K or so for each
    group of 4 cores, at which point it goes out to a higher latency global
    bus). In this case, the division into independent memory regions could
    be done in software.

    Most of the last 50 years has been towards a single global address space.

    It is unclear if my approach is "sufficiently minimal". There is more complexity than I would like in my ISA (and effectively turning it into
    the common superset of both my original design and RV64G, doesn't really
    help matters here).

    If going for a more minimal core optimized for perf/area, some stuff
    might be dropped. Would likely drop integer and floating-point divide

    I think this is pound foolish even if penny wise.

    again. Might also make sense to add an architectural zero register, and eliminate some number of encodings which exist merely because of the
    lack of a zero register (though, encodings are comparably cheap, as the

    I got an effective zero register without having to waste a register
    name to "get it". My 66000 gives you 32 registers of 64-bits each and
    you can put any bit pattern in any register and treat it as you like.
    Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally available.

    internal uArch has a zero register, and effectively treats immediate
    values as a special register as well, ...). Some of the debate is more related to the logic cost of dealing with some things in the decoder.

    The problem is universal constants. RISCs being notably poor in their support--however this is better than addressing modes which require
    µCode.

    Though, would likely still make a few decisions differently from those
    in RISC-V. Things like indexed load/store,

    Absolutely

    predicated ops (with a
    designated flag bit),

    Predicated then and else clauses which are branch free.
    {{Also good for constant time crypto in need of flow control...}}

    and large-immediate encodings,

    Nothing else is so poorly served in typical ISAs.

    help enough with performance (relative to cost)

    +40%

    to be worth keeping (though, mostly
    because the alternatives are not so good in terms of performance).

    Damage to pipeline ability less than -5%.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Mar 26 19:16:07 2024
    BGB wrote:

    On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:


    Say, "we have an instruction, but it is a boat anchor" isn't an ideal situation (unless to be a placeholder for if/when it is not a boat anchor).

    If the boat anchor is a required unit of functionality, and I believe
    IDIV and FPDIV is, it should be defined in ISA and if you can't afford
    it find some way to trap rapidly so you can fix it up without excessive overhead. Like a MIPS TLB reload. If you can't get trap and emulate at sufficient performance, then add the HW to perform the instruction.

    again. Might also make sense to add an architectural zero register,
    and eliminate some number of encodings which exist merely because of
    the lack of a zero register (though, encodings are comparably cheap,
    as the

    I got an effective zero register without having to waste a register name
    to "get it". My 66000 gives you 32 registers of 64-bits each and you can
    put any bit pattern in any register and treat it as you like.
    Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally
    available.


    I guess offloading this to the compiler can also make sense.

    Least common denominator would be, say, not providing things like NEG instructions and similar (pretending as-if one had a zero register), and
    if a program needs to do a NEG or similar, it can load 0 into a register itself.

    In the extreme case (say, one also lacks a designated "load immediate" instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy to
    zero a register...

    MOV Rd,#imm16

    Cost 1 instruction of 32-bits in size and can be performed in 0 cycles

    Say:
    XOR R14, R14, R14 //Designate R14 as pseudo-zero...
    ...
    ADD R14, 0x123, R8 //Load 0x123 into R8

    Though, likely still makes sense in this case to provide some
    "convenience" instructions.


    internal uArch has a zero register, and effectively treats immediate
    values as a special register as well, ...). Some of the debate is more
    related to the logic cost of dealing with some things in the decoder.

    The problem is universal constants. RISCs being notably poor in their
    support--however this is better than addressing modes which require
    µCode.


    Yeah.

    I ended up with jumbo-prefixes. Still not perfect, and not perfectly orthogonal, but mostly works.

    Allows, say:
    ADD R4, 0x12345678, R6

    To be performed in potentially 1 clock-cycle and with a 64-bit encoding, which is better than, say:
    LUI X8, 0x12345
    ADD X8, X8, 0x678
    ADD X12, X10, X8

    This strategy completely fails when the constant contains more than 32-bits

    FDIV R9,#3.141592653589247,R17

    When you have universal constants (including 5-bit immediates), you rarely
    need a register containing 0.

    Though, for jumbo-prefixes, did end up adding a special case in the
    compile where it will try to figure out if a constant will be used
    multiple times in a basic-block and, if so, will load it into a register rather than use a jumbo-prefix form.

    This is a delicate balance:: while each use of the constant takes a
    unit or 2 of the instruction stream, each use cost 0 more instructions.
    The breakeven point in My 66000 is about 4 uses in a small area (loop)
    means that it should be hoisted into a register.

    It could maybe make sense to have function-scale static-assigned
    constants, but have not done so yet.

    Though, it appears as if one of the "top contenders" here would be 0,
    mostly because things like:
    foo->x=0;
    And:
    bar[i]=0;

    I see no need for a zero register:: the following are 1 instruction !

    ST #0,[Rfoo,offset(x)]

    ST #0,[Rbar,Ri]

    Are semi-common, and as-is end up needing to load 0 into a register each
    time they appear.

    Had already ended up with a similar sort of special case to optimize
    "return 0;" and similar, mostly because this was common enough that it
    made more sense to have a special case:
    BRA .lbl_ret //if function does not end with "return 0;"
    .lbl_ret_zero:
    MOV 0, R2
    .lbl_ret:
    ... epilog ...

    For many functions, which allowed "return 0;" to be emitted as:
    BRA .lbl_ret_zero
    Rather than:
    MOV 0, R2
    BRA .lbl_ret
    Which on average ended up as a net-win when there are more than around 3
    of them per function.

    Special defined tails......


    Though, another possibility could be to allow constants to be included
    in the "statically assign variables to registers" logic (as-is, they are excluded except in "tiny leaf" functions).


    Though, would likely still make a few decisions differently from those
    in RISC-V. Things like indexed load/store,

    Absolutely

                                               predicated ops (with a
    designated flag bit),

    Predicated then and else clauses which are branch free.
    {{Also good for constant time crypto in need of flow control...}}


    I have per instruction predication:
    CMPxx ...
    OP?T //if-true
    OP?F //if-false
    Or:
    OP?T | OP?F //both in parallel, subject to encoding and ISA rules

    CMP Rt,Ra,#whatever
    PLE Rt,TTTTTEEE
    // This begins the then-clause 5Ts -> 5 instructions
    OP1
    OP2
    OP3
    OP4
    OP5
    // this begins the else-clause 3Es -> 3 instructions
    OP6
    OP7
    OP8
    // we are now back join point.

    Notice no internal flow control instructions.

    Performance gains are modest, but still noticeable (part of why
    predication ended up as a core ISA feature). Effect on pipeline seems to
    be small in its current form (it is handled along with register fetch,
    mostly turning non-executed instructions into NOPs during the EX stages).

    The effect is that one uses Predication whenever you will have already
    fetched instructions at the join point by the time you have determined
    the predicate value {then, else} clauses. The PARSE and DECODE do the
    flow control without bothering FETCH.

    For the most part, 1-bit seems sufficient.

    How do you do && and || predication with 1 bit ??

    More complex schemes generally ran into issues (had experimented with allowing a second predicate bit, or handling predicates as a
    stack-machine, but these ideas were mostly dead on arrival).

    Also note: the instructions in the then and else clauses know NOTHING
    about being under a predicate mask (or not) Thus, they waste no bit
    while retaining the ability to run under predication.


                          and large-immediate encodings, >>
    Nothing else is so poorly served in typical ISAs.


    Probably true.


                                                         help enough with
    performance (relative to cost)

    +40%


    I am mostly seeing around 30% or so, for Doom and similar.
    A few other programs still being closer to break-even at present.


    Things are a bit more contentious in terms of code density:
    With size-minimizing options to GCC:
    ".text" is slightly larger with BGBCC vs GCC (around 11%);
    However, the GCC output has significantly more ".rodata".

    A lot of this .rodata becomes constants in .text with universal constants.

    A reasonable chunk of the code-size difference could be attributed to
    jumbo prefixes making the average instruction size slightly bigger.

    Size is one thing and it primarily diddles in cache footprint statstics. Instruction count is another and primarily diddles in pipeline cycles
    to execute statistics.
    Fewer instruction wins almost all the time.

    More could be possible with more compiler optimization effort.
    Currently, a few recent optimization cases are disabled as they seem to
    be causing bugs that I haven't figured out yet.


                                   to be worth keeping (though, mostly
    because the alternatives are not so good in terms of performance).

    Damage to pipeline ability less than -5%.

    Yeah.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to BGB-Alt on Wed Mar 27 00:27:15 2024
    On Tue, 26 Mar 2024 16:59:57 -0500
    BGB-Alt <bohannonindustriesllc@gmail.com> wrote:

    On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:


    Say, "we have an instruction, but it is a boat anchor" isn't an
    ideal situation (unless to be a placeholder for if/when it is not
    a boat anchor).

    If the boat anchor is a required unit of functionality, and I
    believe IDIV and FPDIV is, it should be defined in ISA and if you
    can't afford it find some way to trap rapidly so you can fix it up
    without excessive overhead. Like a MIPS TLB reload. If you can't
    get trap and emulate at sufficient performance, then add the HW to
    perform the instruction.

    Though, 32-bit ARM managed OK without integer divide.


    For slightly less then 20 years ARM managed OK without integer divide.
    Then in 2004 they added integer divide instruction in ARMv7 (including
    ARMv7-M variant intended for small microcontroller cores like
    Cortex-M3) and for the following 20 years instead of merely OK they are
    doing great :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB-Alt on Wed Mar 27 00:02:05 2024
    BGB-Alt wrote:

    On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
    BGB wrote:

    I ended up with jumbo-prefixes. Still not perfect, and not perfectly
    orthogonal, but mostly works.

    Allows, say:
       ADD R4, 0x12345678, R6

    To be performed in potentially 1 clock-cycle and with a 64-bit
    encoding, which is better than, say:
       LUI X8, 0x12345
       ADD X8, X8, 0x678
       ADD X12, X10, X8

    This strategy completely fails when the constant contains more than 32-bits >>
        FDIV   R9,#3.141592653589247,R17

    When you have universal constants (including 5-bit immediates), you rarely >> need a register containing 0.


    The jumbo prefixes at least allow for a 64-bit constant load, but as-is
    not for 64-bit immediate values to 3RI ops. The latter could be done,
    but would require 128-bit fetch and decode, which doesn't seem worth it.

    There is the limbo feature of allowing for 57-bit immediate values, but
    this is optional.


    OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with
    Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline.

    Which the LLVM compiler for RISC-V does not do, instead it uses a AUPIC
    and a LD to get the value from data memory within ±2GB of IP. This takes
    3 instructions and 2 words in memory when universal constants do this in
    1 instruction and 2 words in the code stream to do this.

    Typical GCC response on RV64 seems to be to turn nearly all of the big-constant cases into memory loads, which kinda sucks.

    This is typical when the underlying architecture is not very extensible
    to 64-bit virtual address spaces; they have to waste a portion of the
    32-bit space to get access to all the 64-bit space. Universal constants
    makes this problem vanish.

    Even something like a "LI Xd, Imm17s" instruction, would notably reduce
    the number of constants loaded from memory (as GCC seemingly prefers to
    use a LHU or LW or similar rather than encode it using LUI+ADD).

    Reduce when compared to RISC-V but increased when compared to My 66000.
    My 66000 has (at 99& level) uses no instructions to fetch or create
    constants, nor does it waste any register (or registers) to hold use
    once constants.

    I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or
    S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping
    them enabled in the CPU core (they involved the non-zero cost of
    repacking them into Binary16 in ID1 and then throwing a
    Binary16->Binary64 converter into the ID2 stage).

    Generally, the "FLDCH Imm16, Rn" instruction works well enough here (and
    can leverage a more generic Binary16->Binary64 converter path).

    Sometimes I see a::

    CVTSD R2,#5

    Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed in register R2 so it can be accesses as an argument in the subroutine call
    to happen in a few instructions.

    Mostly, a floating point immediate is available from a 32-bit constant container. When accesses in a float calculation it is used as IEEE32
    when accessed by a 6double calculation IEEE32->IEEE64 promotion is
    performed in the constant delivery path. So, one can use almost any
    floating point constant that is representable in float as a double
    without eating cycles and while saving code footprint.

    For FPU compare with zero, can almost leverage the integer compare ops,
    apart from the annoying edge cases of -0.0 and NaN leading to "not
    strictly equivalent" behavior (though, an ASM programmer could more
    easily get away with this). But, not common enough to justify adding FPU specific ops for this.

    Actually, the edge/noise cases are not that many gates.
    a) once you are separating out NaNs, infinities are free !!
    b) once you are checking denorms for zero, infinites become free !!

    Having structured a Compare-to-zero circuit based on the fields in double;
    You can compose the terns to do all signed and unsigned integers and get
    a gate count, then the number of gates you add to cover all 10 cases of floating point is 12% gate count over the simple integer version. Also
    note:: this circuit is about 10% of the gate count of an integer adder.

    -----------------------

    Seems that generally 0 still isn't quite common enough to justify having
    one register fewer for variables though (or to have a designated zero register), but otherwise it seems there is not much to justify trying to exclude the "implicit zero" ops from the ISA listing.


    It is common enough,
    But there are lots of ways to get a zero where you want it for a return.


    Though, would likely still make a few decisions differently from
    those in RISC-V. Things like indexed load/store,

    Absolutely

                                               predicated ops (with a
    designated flag bit),

    Predicated then and else clauses which are branch free.
    {{Also good for constant time crypto in need of flow control...}}


    I have per instruction predication:
       CMPxx ...
       OP?T  //if-true
       OP?F  //if-false
    Or:
       OP?T | OP?F  //both in parallel, subject to encoding and ISA rules

        CMP  Rt,Ra,#whatever
        PLE  Rt,TTTTTEEE
        // This begins the then-clause 5Ts -> 5 instructions
        OP1
        OP2
        OP3
        OP4
        OP5
        // this begins the else-clause 3Es -> 3 instructions
        OP6
        OP7
        OP8
        // we are now back join point.

    Notice no internal flow control instructions.


    It can be similar in my case, with the ?T / ?F encoding scheme.

    Except you eat that/those bits in OpCode encoding.

    While poking at it, did go and add a check to exclude large struct-copy operations from predication, as it is slower to turn a large struct copy
    into NOPs than to branch over it.

    Did end up leaving struct-copies where sz<=64 as allowed though (where a
    64 byte copy at least has the merit of achieving full pipeline
    saturation and being roughly break-even with a branch-miss, whereas a
    128 byte copy would cost roughly twice as much as a branch miss).

    I decided to bite the bullet and have LDM, STM and MM so the compiler does
    not have to do any analysis. This puts the onus on the memory unit designer
    to process these at least as fast as a series of LDs and STs. Done right
    this saves ~40%of the power of the caches avoiding ~70% of tag accesses
    and 90% of TLB accesses. You access the tag only when/after crossing a line boundary and you access TLB only after crossing a page boundary.

    Performance gains are modest, but still noticeable (part of why
    predication ended up as a core ISA feature). Effect on pipeline seems
    to be small in its current form (it is handled along with register
    fetch, mostly turning non-executed instructions into NOPs during the
    EX stages).

    The effect is that one uses Predication whenever you will have already
    fetched instructions at the join point by the time you have determined
    the predicate value {then, else} clauses. The PARSE and DECODE do the
    flow control without bothering FETCH.


    Yeah, though in my pipeline, it is still a tradeoff of the relative cost
    of a missed branch, vs the cost of sliding over both the THEN and ELSE branches as a series of NOPs.


    For the most part, 1-bit seems sufficient.

    How do you do && and || predication with 1 bit ??


    Originally, it didn't.
    Now I added some 3R and 3RI CMPxx encodings.

    This allows, say:
    CMPGT R8, R10, R4
    CMPGT R8, R11, R5
    TST R4, R5
    ....


    All I had to do was to make the second predication overwrite the first predication's mask, and the compiler did the rest.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Mar 27 21:03:52 2024
    BGB wrote:

    On 3/26/2024 7:02 PM, MitchAlsup1 wrote:

    Sometimes I see a::

        CVTSD     R2,#5

    Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed
    in register R2 so it can be accesses as an argument in the subroutine call >> to happen in a few instructions.


    I had looked into, say:
    FADD Rm, Imm5fp, Rn
    Where, despite Imm5fp being severely limited, it had an OK hit rate.

    Unpacking imm5fp to Binary16 being, essentially:
    aee.fff -> 0.aAAee.fff0000000

    realistically ±{0, 1, 2, 3, 4, 5, .., 31} only misses a few of the often
    used fp constants--but does include 0, 1, 2, and 10. Also, realistically
    the missing cases are the 0.5s.

    OTOH, can note that a majority of typical floating point constants can
    be represented exactly in Binary16 (well, excluding "0.1" or similar),
    so it works OK as an immediate format.

    This allows a single 32-bit op to be used for constant loads (nevermind
    if one needs a 96 bit encoding for 0.1, or PI, or ...).


    Mostly, a floating point immediate is available from a 32-bit constant
    container. When accesses in a float calculation it is used as IEEE32
    when accessed by a 6double calculation IEEE32->IEEE64 promotion is
    performed in the constant delivery path. So, one can use almost any
    floating point constant that is representable in float as a double
    without eating cycles and while saving code footprint.


    Don't currently have the encoding space for this.

    Could in theory pull off truncated Binary32 an Imm29s form, but not
    likely worth it. Would also require putting a converted in the ID2
    stage, so not free.

    In this case, the issue is more one of LUT cost to support these cases.


    For FPU compare with zero, can almost leverage the integer compare
    ops, apart from the annoying edge cases of -0.0 and NaN leading to
    "not strictly equivalent" behavior (though, an ASM programmer could
    more easily get away with this). But, not common enough to justify
    adding FPU specific ops for this.

    Actually, the edge/noise cases are not that many gates.
    a) once you are separating out NaNs, infinities are free !!
    b) once you are checking denorms for zero, infinites become free !!

    Having structured a Compare-to-zero circuit based on the fields in double; >> You can compose the terns to do all signed and unsigned integers and get
    a gate count, then the number of gates you add to cover all 10 cases of
    floating point is 12% gate count over the simple integer version. Also
    note:: this circuit is about 10% of the gate count of an integer adder.


    I could add them, but, is it worth it?...

    Whether to add them or not is on you.
    I found things like this to be more straw on the camel's back
    {where the camel collapses to a unified register file model.}

    In this case, it is more a question of encoding space than logic cost.

    It is semi-common in FP terms, but likely not common enough to justify dedicated compare-and-branch ops and similar (vs the minor annoyance at
    the integer ops not quite working correctly due to edge cases).

    My model requires about ½ the instruction count when processing FP
    comparisons compared to RISC-V (big, no; around 5% in FP code and 0
    elsewhere.} Where it wins big is compare against a non-zero FP constant.
    My 66000 uses 1 instructions {FCMP, BB} whereas RISC=V uses 4 {AUPIC,
    LD, FCMP, BC}


    -----------------------

    Seems that generally 0 still isn't quite common enough to justify
    having one register fewer for variables though (or to have a
    designated zero register), but otherwise it seems there is not much to
    justify trying to exclude the "implicit zero" ops from the ISA listing.


    It is common enough,
    But there are lots of ways to get a zero where you want it for a return.


    I think the main use case for a zero register is mostly that it allows
    using it as a special case for pseudo-ops. I guess, not quite the same
    if it is a normal GPR that just so happens to be 0.

    Recently ended up fixing a bug where:
    y=-x;
    Was misbehaving with "unsigned int":
    "NEG" produces a value which falls outside of UInt range;
    But, "NEG; EXTU.L" is a 2-op sequence.
    It had the EXT for SB/UB/SW/UW, but not for UL.
    For SL, bare NEG almost works, apart from ((-1)<<31).
    Could encode it as:
    SUBU.L Zero, Rs, Rn

    ADD Rn,#0,-Rs

    But notice::
    y = -x;
    a = b + y;
    can be performed as if it had been written::
    y = -x;
    a = b + (-x);
    Which is encoded as::

    ADD Ry,#0,-Rx
    ADD Ra,Rb,-Rx

    But, without a zero register,

    #0 is not a register, but its value is 0x0000000000000000 anyway.

    You missed the point entirely, if you can get easy access to #0
    then you no longer need a register to hold this simple bit pattern.
    In fact a large portion of My 66000 ISA over RISC-V comes from this
    mechanism.

    the compiler needs to special-case

    The compiler needs easy access to #0 and the compiler needs to know
    that #0 exists, but the compiler does not need to know if some register contains that same bit pattern.

    provision this (or, in theory, add a "NEGU.L" instruction, but doesn't
    seem common enough to justify this).

    ....

    It is less bad than 32-bit ARM, where I only burnt 2 bits, rather than 4.

    I burned 0 per instruction, but you can claim I burned 1 instruction PRED and 6.4 bits of that instruction are used to create masks that project upon up to
    8 following instructions.

    Also seems like a reasonable tradeoff, as the 2 bits effectively gain:
    Per-instruction predication;
    WEX / Bundle encoding;
    Jumbo prefixes;
    ...

    But, maybe otherwise could have justified slightly bigger immediate
    fields, dunno.


    While poking at it, did go and add a check to exclude large
    struct-copy operations from predication, as it is slower to turn a
    large struct copy into NOPs than to branch over it.

    Did end up leaving struct-copies where sz<=64 as allowed though (where
    a 64 byte copy at least has the merit of achieving full pipeline
    saturation and being roughly break-even with a branch-miss, whereas a
    128 byte copy would cost roughly twice as much as a branch miss).

    I decided to bite the bullet and have LDM, STM and MM so the compiler does >> not have to do any analysis. This puts the onus on the memory unit designer >> to process these at least as fast as a series of LDs and STs. Done right
    this saves ~40%of the power of the caches avoiding ~70% of tag accesses
    and 90% of TLB accesses. You access the tag only when/after crossing a
    line boundary and you access TLB only after crossing a page boundary.

    OK.

    In my case, it was more a case of noting that sliding over, say, 1kB
    worth of memory loads/stores, is slower than branching around it.

    This is why My 66000 predication has use limits. Once you can get where
    you want faster with a branch, then a branch is what you should use.
    I reasoned that my 1-wide machine would fetch 16-bytes (4 words) per
    cycle and that the minimum DECODE time is 2 cycles, that Predication
    wins when the number of instructions <= FWidth × Dcycles = 8.
    Use predication and save cycles by not disrupting the front end.
    Use branching and save cycles by disrupting the front end.


    All I had to do was to make the second predication overwrite the first
    predication's mask, and the compiler did the rest.

    Not so simple in my case, but the hardware is simpler, since it just
    cares about the state of 1 bit (which is explicitly saved/restored along
    with the rest of the status-register if an interrupt occurs).

    Simpler than 8-flip flops used as a shift right register ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Mar 27 21:14:01 2024
    BGB wrote:

    On 3/26/2024 5:27 PM, Michael S wrote:


    For slightly less then 20 years ARM managed OK without integer divide.
    Then in 2004 they added integer divide instruction in ARMv7 (including
    ARMv7-M variant intended for small microcontroller cores like
    Cortex-M3) and for the following 20 years instead of merely OK they are
    doing great :-)


    OK.

    The point is they are doing better now after adding IDIV and FDIV.

    I think both modern ARM and AMD Zen went over to "actually fast" integer divide.

    I think for a long time, the de-facto integer divide was ~ 36-40 cycles
    for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I
    can get from a shift-add unit.

    While those numbers are acceptable for shift-subtract division (including
    SRT variants).

    What I don't get is the reluctance for using the FP multiplier as a fast divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
    FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
    and with special casing of leading 1s and 0s average around 10-cycles ???

    I submit that at 10-cycles for average latency, the need to invent screwy
    forms of even faster division fall by the wayside {accurate or not}.

    NOTE well:: The size of the FMUL (or FMAC) unit does increase, but its
    increase is less than that of an STR divisor unit.

    On my BJX2 core, it is currently similar (36 and 68 cycle for divide).
    This works out faster than a generic shift-subtract divider (or using a runtime call which then sorts out what to do).

    This is because you are using a linear iteration, try using a quadratic convergent iteration instead. OH but you CAN'T because your multiplier
    tree does not give accurate lower order bits.

    A special case allows turning small divisors internally into divide-by-reciprocal, which allows for a 3-cycle divide special case.
    But, this is a LUT cost tradeoff.

    It could be possible in theory to support a general 3-cycle integer
    divide, albeit if one can accept inexact results (would be faster than
    the software-based lookup table strategy).


    But, it is debatable. Pure minimalism would likely favor leaving out
    divide (and a bunch of other stuff). Usual rationale being, say, to try
    to fit the entire ISA listing on a single page of paper or similar (vs
    having a listing with several hundred defined encodings).


    Nevermind if the commonly used ISAs (x86 and 64-bit ARM) have ISA
    listings that are considerably larger (thousands of encodings).

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Mar 27 21:53:36 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:

    On 3/26/2024 5:27 PM, Michael S wrote:


    For slightly less then 20 years ARM managed OK without integer divide.
    Then in 2004 they added integer divide instruction in ARMv7 (including
    ARMv7-M variant intended for small microcontroller cores like
    Cortex-M3) and for the following 20 years instead of merely OK they are
    doing great :-)


    OK.

    The point is they are doing better now after adding IDIV and FDIV.

    I think both modern ARM and AMD Zen went over to "actually fast" integer
    divide.

    I think for a long time, the de-facto integer divide was ~ 36-40 cycles
    for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I
    can get from a shift-add unit.

    While those numbers are acceptable for shift-subtract division (including
    SRT variants).

    What I don't get is the reluctance for using the FP multiplier as a fast >divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
    FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
    and with special casing of leading 1s and 0s average around 10-cycles ???

    Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
    (where s is the number of significant digits in the quotient).

    https://www.quinapalus.com/cm7cycles.html


    I submit that at 10-cycles for average latency, the need to invent screwy >forms of even faster division fall by the wayside {accurate or not}.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Thu Mar 28 01:06:05 2024
    On Wed, 27 Mar 2024 21:14:01 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    What I don't get is the reluctance for using the FP multiplier as a
    fast divisor (IBM 360/91). AMD Opteron used this means to achieve
    17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ?? and with special casing of leading 1s and 0s average
    around 10-cycles ???

    I submit that at 10-cycles for average latency, the need to invent
    screwy forms of even faster division fall by the wayside {accurate or
    not}.


    All modern performance-oriented cores from Intel, AMD, ARM and Apple
    have fast integer dividers and typically even faster FP dividers.
    The last "big" cores with relatively slow 64-bit IDIV were Intel Skylake (launched in 2015) and AMD Zen2 (launched in 2019), but the later is
    slow only in the worst case, the best case is o.k.
    I'd guess, when Skylake was designed nobody at Intel could imagine that
    it and its variations would be manufactured in huge volumes up to 2021.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Mar 27 23:11:34 2024
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:

    On 3/26/2024 5:27 PM, Michael S wrote:


    For slightly less then 20 years ARM managed OK without integer divide. >>>> Then in 2004 they added integer divide instruction in ARMv7 (including >>>> ARMv7-M variant intended for small microcontroller cores like
    Cortex-M3) and for the following 20 years instead of merely OK they are >>>> doing great :-)


    OK.

    The point is they are doing better now after adding IDIV and FDIV.

    I think both modern ARM and AMD Zen went over to "actually fast" integer >>> divide.

    I think for a long time, the de-facto integer divide was ~ 36-40 cycles
    for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I >>> can get from a shift-add unit.

    While those numbers are acceptable for shift-subtract division (including >>SRT variants).

    What I don't get is the reluctance for using the FP multiplier as a fast >>divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
    FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ?? >>and with special casing of leading 1s and 0s average around 10-cycles ???

    Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
    (where s is the number of significant digits in the quotient).

    I submit that a 5+2×ln8(s) is faster still.
    32-bits = 15 cycles <not so much faster>
    64-bits = 17 cycles <A lot faster>

    {Log base 8, where one uses Newton-Raphson or Goldschmidt to get 8 significant digits (9.2 bits are correct) and double the significant bits each iteration (2-cycles). }

    5 comes from looking at numerator and denominator to find the first bit of significance, and then shifting numerator and denominator so that the FDIV algorithm can work.

    https://www.quinapalus.com/cm7cycles.html


    I submit that at 10-cycles for average latency, the need to invent screwy >>forms of even faster division fall by the wayside {accurate or not}.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Scott Lurndal on Thu Mar 28 09:31:11 2024
    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:

    On 3/26/2024 5:27 PM, Michael S wrote:


    For slightly less then 20 years ARM managed OK without integer divide. >>>> Then in 2004 they added integer divide instruction in ARMv7 (including >>>> ARMv7-M variant intended for small microcontroller cores like
    Cortex-M3) and for the following 20 years instead of merely OK they are >>>> doing great :-)


    OK.

    The point is they are doing better now after adding IDIV and FDIV.

    I think both modern ARM and AMD Zen went over to "actually fast" integer >>> divide.

    I think for a long time, the de-facto integer divide was ~ 36-40 cycles
    for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I >>> can get from a shift-add unit.

    While those numbers are acceptable for shift-subtract division (including
    SRT variants).

    What I don't get is the reluctance for using the FP multiplier as a fast
    divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
    FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
    and with special casing of leading 1s and 0s average around 10-cycles ???

    Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
    (where s is the number of significant digits in the quotient).

    https://www.quinapalus.com/cm7cycles.html

    That looks a lot like an SRT divisor with early out?

    Having variable timing DIV means that any crypto operating (including
    hashes?) where you use modulo operations, said modulus _must_ be a known constant, otherwise information about will leak from the timings, right?


    I submit that at 10-cycles for average latency, the need to invent screwy
    forms of even faster division fall by the wayside {accurate or not}.

    I agree.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Terje Mathisen on Thu Mar 28 14:03:18 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:

    On 3/26/2024 5:27 PM, Michael S wrote:


    For slightly less then 20 years ARM managed OK without integer divide. >>>>> Then in 2004 they added integer divide instruction in ARMv7 (including >>>>> ARMv7-M variant intended for small microcontroller cores like
    Cortex-M3) and for the following 20 years instead of merely OK they are >>>>> doing great :-)


    OK.

    The point is they are doing better now after adding IDIV and FDIV.

    I think both modern ARM and AMD Zen went over to "actually fast" integer >>>> divide.

    I think for a long time, the de-facto integer divide was ~ 36-40 cycles >>>> for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I >>>> can get from a shift-add unit.

    While those numbers are acceptable for shift-subtract division (including >>> SRT variants).

    What I don't get is the reluctance for using the FP multiplier as a fast >>> divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
    FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ?? >>> and with special casing of leading 1s and 0s average around 10-cycles ??? >>
    Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
    (where s is the number of significant digits in the quotient).

    https://www.quinapalus.com/cm7cycles.html

    That looks a lot like an SRT divisor with early out?

    Having variable timing DIV means that any crypto operating (including >hashes?) where you use modulo operations, said modulus _must_ be a known >constant, otherwise information about will leak from the timings, right?

    Perhaps, yet I suspect that the m7 isn't generally used for crypto,
    nor used in an SMP implementation where an observer can monitor
    a shared cache.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Thu Mar 28 22:38:41 2024
    On Thu, 28 Mar 2024 09:31:11 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:

    On 3/26/2024 5:27 PM, Michael S wrote:


    For slightly less then 20 years ARM managed OK without integer
    divide. Then in 2004 they added integer divide instruction in
    ARMv7 (including ARMv7-M variant intended for small
    microcontroller cores like Cortex-M3) and for the following 20
    years instead of merely OK they are doing great :-)


    OK.

    The point is they are doing better now after adding IDIV and FDIV.

    I think both modern ARM and AMD Zen went over to "actually fast"
    integer divide.

    I think for a long time, the de-facto integer divide was ~ 36-40
    cycles for 32-bit, and 68-72 cycles for 64-bit. This is also
    on-par with what I can get from a shift-add unit.

    While those numbers are acceptable for shift-subtract division
    (including SRT variants).

    What I don't get is the reluctance for using the FP multiplier as
    a fast divisor (IBM 360/91). AMD Opteron used this means to
    achieve 17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV
    not be under 20-cycles ?? and with special casing of leading 1s
    and 0s average around 10-cycles ???

    Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2]
    cycles (where s is the number of significant digits in the
    quotient).

    https://www.quinapalus.com/cm7cycles.html

    That looks a lot like an SRT divisor with early out?

    Having variable timing DIV means that any crypto operating (including hashes?) where you use modulo operations, said modulus _must_ be a
    known constant, otherwise information about will leak from the
    timings, right?

    Are you aware of any professional crypto algorithm, including hashes,
    that uses modulo operations by modulo that is neither power-of-two nor
    at least 192-bit wide?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Fri Mar 29 13:38:55 2024
    Michael S wrote:
    On Thu, 28 Mar 2024 09:31:11 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:

    On 3/26/2024 5:27 PM, Michael S wrote:


    For slightly less then 20 years ARM managed OK without integer
    divide. Then in 2004 they added integer divide instruction in
    ARMv7 (including ARMv7-M variant intended for small
    microcontroller cores like Cortex-M3) and for the following 20
    years instead of merely OK they are doing great :-)


    OK.

    The point is they are doing better now after adding IDIV and FDIV.

    I think both modern ARM and AMD Zen went over to "actually fast"
    integer divide.

    I think for a long time, the de-facto integer divide was ~ 36-40
    cycles for 32-bit, and 68-72 cycles for 64-bit. This is also
    on-par with what I can get from a shift-add unit.

    While those numbers are acceptable for shift-subtract division
    (including SRT variants).

    What I don't get is the reluctance for using the FP multiplier as
    a fast divisor (IBM 360/91). AMD Opteron used this means to
    achieve 17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV
    not be under 20-cycles ?? and with special casing of leading 1s
    and 0s average around 10-cycles ???

    Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2]
    cycles (where s is the number of significant digits in the
    quotient).

    https://www.quinapalus.com/cm7cycles.html

    That looks a lot like an SRT divisor with early out?

    Having variable timing DIV means that any crypto operating (including
    hashes?) where you use modulo operations, said modulus _must_ be a
    known constant, otherwise information about will leak from the
    timings, right?

    Are you aware of any professional crypto algorithm, including hashes,
    that uses modulo operations by modulo that is neither power-of-two nor
    at least 192-bit wide?

    I was involved with the optimization of DFC, the AES condidate from CERN:

    It uses a fixed prime just above 2^64 as the modulus (2^64+13 afair),
    and that resulted in a very simple reciprocal, i.e. no need for a DIV
    opcode.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Fri Mar 29 16:38:58 2024
    On Fri, 29 Mar 2024 13:38:55 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Thu, 28 Mar 2024 09:31:11 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:


    Are you aware of any professional crypto algorithm, including
    hashes, that uses modulo operations by modulo that is neither
    power-of-two nor at least 192-bit wide?

    I was involved with the optimization of DFC, the AES condidate from
    CERN:

    It uses a fixed prime just above 2^64 as the modulus (2^64+13 afair),
    and that resulted in a very simple reciprocal, i.e. no need for a DIV
    opcode.

    Terje


    Since DFC lost, I suppose that even ignoring reciprocal optimization
    the answer to my question is 'No'.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB-Alt on Fri Apr 5 21:43:08 2024
    BGB-Alt wrote:



    I have yet to decide on the default bit-depth for UI widgets, mostly
    torn between 16-color and 256-color. Probably don't need high bit-depths
    for "text and widgets" windows, but may need more than 16-color. Also

    When I write documents I have a favorite set of pastel colors I use to
    shade boxes, arrows, and text. I draw the figures first and then fill in
    the text later. To give the eye an easy means to go from reading the text
    to looking at a figure and finding the discussed item, I place a box around
    the text and shade the box with the same R-G-B color of the pre.jpg box in
    the figure. All figures are *.jpg. So, even a word processor needs 24-bit color.

    {{I have also found that Word R-G-B color values are not exactly the same
    as the R-G-V values in the *.jpg figure, too; but they are close enough
    to avoid "figuring it out and trying to fix it to perfection.}}

    TBD which color palette to use in the case if 256 color is used (the

    256 is OK enough for DOOM games, but not for professional documentation.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)