Forum: >>> Magnum BBS <<<

Microarch Club

From George Musk@21:1/5 to All on Thu Mar 21 19:34:50 2024

Thought this may be interesting:
https://microarch.club/
https://www.youtube.com/@MicroarchClub/videos

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB-Alt on Mon Mar 25 22:17:03 2024

BGB-Alt wrote:

On 3/21/2024 2:34 PM, George Musk wrote:

Thought this may be interesting:
https://microarch.club/
https://www.youtube.com/@MicroarchClub/videos

At least sort of interesting...

I guess one of the guys on there did a manycore VLIW architecture with
the memory local to each of the cores. Seems like an interesting
approach, though not sure how well it would work on a general purpose workload. This is also closer to what I had imagined when I first
started working on this stuff, but it had drifted more towards a
slightly more conventional design.

But, admittedly, this is for small-N cores, 16/32K of L1 with a shared
L2, seemed like a better option than cores with a very large shared L1
cache.

You appear to be "starting to get it"; congratulations.

I am not sure that abandoning a global address space is such a great
idea, as a lot of the "merits" can be gained instead by using weak
coherence models (possibly with a shared 256K or 512K or so for each
group of 4 cores, at which point it goes out to a higher latency global
bus). In this case, the division into independent memory regions could
be done in software.

Most of the last 50 years has been towards a single global address space.

It is unclear if my approach is "sufficiently minimal". There is more complexity than I would like in my ISA (and effectively turning it into
the common superset of both my original design and RV64G, doesn't really
help matters here).

If going for a more minimal core optimized for perf/area, some stuff
might be dropped. Would likely drop integer and floating-point divide

I think this is pound foolish even if penny wise.

again. Might also make sense to add an architectural zero register, and eliminate some number of encodings which exist merely because of the
lack of a zero register (though, encodings are comparably cheap, as the

I got an effective zero register without having to waste a register
name to "get it". My 66000 gives you 32 registers of 64-bits each and
you can put any bit pattern in any register and treat it as you like.
Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally available.

internal uArch has a zero register, and effectively treats immediate
values as a special register as well, ...). Some of the debate is more related to the logic cost of dealing with some things in the decoder.

The problem is universal constants. RISCs being notably poor in their support--however this is better than addressing modes which require
µCode.

Though, would likely still make a few decisions differently from those
in RISC-V. Things like indexed load/store,

Absolutely

predicated ops (with a
designated flag bit),

Predicated then and else clauses which are branch free.
{{Also good for constant time crypto in need of flow control...}}

and large-immediate encodings,

Nothing else is so poorly served in typical ISAs.

help enough with performance (relative to cost)

+40%

to be worth keeping (though, mostly
because the alternatives are not so good in terms of performance).

Damage to pipeline ability less than -5%.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Mar 26 19:16:07 2024

BGB wrote:

On 3/25/2024 5:17 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

Say, "we have an instruction, but it is a boat anchor" isn't an ideal situation (unless to be a placeholder for if/when it is not a boat anchor).

If the boat anchor is a required unit of functionality, and I believe
IDIV and FPDIV is, it should be defined in ISA and if you can't afford
it find some way to trap rapidly so you can fix it up without excessive overhead. Like a MIPS TLB reload. If you can't get trap and emulate at sufficient performance, then add the HW to perform the instruction.

again. Might also make sense to add an architectural zero register,
and eliminate some number of encodings which exist merely because of
the lack of a zero register (though, encodings are comparably cheap,
as the

I got an effective zero register without having to waste a register name
to "get it". My 66000 gives you 32 registers of 64-bits each and you can
put any bit pattern in any register and treat it as you like.
Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally
available.

I guess offloading this to the compiler can also make sense.

Least common denominator would be, say, not providing things like NEG instructions and similar (pretending as-if one had a zero register), and
if a program needs to do a NEG or similar, it can load 0 into a register itself.

In the extreme case (say, one also lacks a designated "load immediate" instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy to
zero a register...

MOV Rd,#imm16

Cost 1 instruction of 32-bits in size and can be performed in 0 cycles

Say:
XOR R14, R14, R14 //Designate R14 as pseudo-zero...
...
ADD R14, 0x123, R8 //Load 0x123 into R8

Though, likely still makes sense in this case to provide some
"convenience" instructions.

internal uArch has a zero register, and effectively treats immediate
values as a special register as well, ...). Some of the debate is more
related to the logic cost of dealing with some things in the decoder.

The problem is universal constants. RISCs being notably poor in their
support--however this is better than addressing modes which require
µCode.

Yeah.

I ended up with jumbo-prefixes. Still not perfect, and not perfectly orthogonal, but mostly works.

Allows, say:
ADD R4, 0x12345678, R6

To be performed in potentially 1 clock-cycle and with a 64-bit encoding, which is better than, say:
LUI X8, 0x12345
ADD X8, X8, 0x678
ADD X12, X10, X8

This strategy completely fails when the constant contains more than 32-bits

FDIV R9,#3.141592653589247,R17

When you have universal constants (including 5-bit immediates), you rarely
need a register containing 0.

Though, for jumbo-prefixes, did end up adding a special case in the
compile where it will try to figure out if a constant will be used
multiple times in a basic-block and, if so, will load it into a register rather than use a jumbo-prefix form.

This is a delicate balance:: while each use of the constant takes a
unit or 2 of the instruction stream, each use cost 0 more instructions.
The breakeven point in My 66000 is about 4 uses in a small area (loop)
means that it should be hoisted into a register.

It could maybe make sense to have function-scale static-assigned
constants, but have not done so yet.

Though, it appears as if one of the "top contenders" here would be 0,
mostly because things like:
foo->x=0;
And:
bar[i]=0;

I see no need for a zero register:: the following are 1 instruction !

ST #0,[Rfoo,offset(x)]

ST #0,[Rbar,Ri]

Are semi-common, and as-is end up needing to load 0 into a register each
time they appear.

Had already ended up with a similar sort of special case to optimize
"return 0;" and similar, mostly because this was common enough that it
made more sense to have a special case:
BRA .lbl_ret //if function does not end with "return 0;"
.lbl_ret_zero:
MOV 0, R2
.lbl_ret:
... epilog ...

For many functions, which allowed "return 0;" to be emitted as:
BRA .lbl_ret_zero
Rather than:
MOV 0, R2
BRA .lbl_ret
Which on average ended up as a net-win when there are more than around 3
of them per function.

Special defined tails......

Though, another possibility could be to allow constants to be included
in the "statically assign variables to registers" logic (as-is, they are excluded except in "tiny leaf" functions).

Though, would likely still make a few decisions differently from those
in RISC-V. Things like indexed load/store,

Absolutely

                                           predicated ops (with a
designated flag bit),

Predicated then and else clauses which are branch free.
{{Also good for constant time crypto in need of flow control...}}

I have per instruction predication:
CMPxx ...
OP?T //if-true
OP?F //if-false
Or:
OP?T | OP?F //both in parallel, subject to encoding and ISA rules

CMP Rt,Ra,#whatever
PLE Rt,TTTTTEEE
// This begins the then-clause 5Ts -> 5 instructions
OP1
OP2
OP3
OP4
OP5
// this begins the else-clause 3Es -> 3 instructions
OP6
OP7
OP8
// we are now back join point.

Notice no internal flow control instructions.

Performance gains are modest, but still noticeable (part of why
predication ended up as a core ISA feature). Effect on pipeline seems to
be small in its current form (it is handled along with register fetch,
mostly turning non-executed instructions into NOPs during the EX stages).

The effect is that one uses Predication whenever you will have already
fetched instructions at the join point by the time you have determined
the predicate value {then, else} clauses. The PARSE and DECODE do the
flow control without bothering FETCH.

For the most part, 1-bit seems sufficient.

How do you do && and || predication with 1 bit ??

More complex schemes generally ran into issues (had experimented with allowing a second predicate bit, or handling predicates as a
stack-machine, but these ideas were mostly dead on arrival).

Also note: the instructions in the then and else clauses know NOTHING
about being under a predicate mask (or not) Thus, they waste no bit
while retaining the ability to run under predication.

                      and large-immediate encodings, >>

Nothing else is so poorly served in typical ISAs.

Probably true.

                                                     help enough with
performance (relative to cost)

+40%

I am mostly seeing around 30% or so, for Doom and similar.
A few other programs still being closer to break-even at present.

Things are a bit more contentious in terms of code density:
With size-minimizing options to GCC:
".text" is slightly larger with BGBCC vs GCC (around 11%);
However, the GCC output has significantly more ".rodata".

A lot of this .rodata becomes constants in .text with universal constants.

A reasonable chunk of the code-size difference could be attributed to
jumbo prefixes making the average instruction size slightly bigger.

Size is one thing and it primarily diddles in cache footprint statstics. Instruction count is another and primarily diddles in pipeline cycles
to execute statistics.
Fewer instruction wins almost all the time.

More could be possible with more compiler optimization effort.
Currently, a few recent optimization cases are disabled as they seem to
be causing bugs that I haven't figured out yet.

                               to be worth keeping (though, mostly
because the alternatives are not so good in terms of performance).

Damage to pipeline ability less than -5%.

Yeah.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to BGB-Alt on Wed Mar 27 00:27:15 2024

On Tue, 26 Mar 2024 16:59:57 -0500
BGB-Alt <bohannonindustriesllc@gmail.com> wrote:

On 3/26/2024 2:16 PM, MitchAlsup1 wrote:

BGB wrote:

On 3/25/2024 5:17 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

Say, "we have an instruction, but it is a boat anchor" isn't an
ideal situation (unless to be a placeholder for if/when it is not
a boat anchor).

If the boat anchor is a required unit of functionality, and I
believe IDIV and FPDIV is, it should be defined in ISA and if you
can't afford it find some way to trap rapidly so you can fix it up
without excessive overhead. Like a MIPS TLB reload. If you can't
get trap and emulate at sufficient performance, then add the HW to
perform the instruction.

Though, 32-bit ARM managed OK without integer divide.

For slightly less then 20 years ARM managed OK without integer divide.
Then in 2004 they added integer divide instruction in ARMv7 (including
ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are
doing great :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB-Alt on Wed Mar 27 00:02:05 2024

BGB-Alt wrote:

On 3/26/2024 2:16 PM, MitchAlsup1 wrote:

BGB wrote:

I ended up with jumbo-prefixes. Still not perfect, and not perfectly
orthogonal, but mostly works.

Allows, say:
   ADD R4, 0x12345678, R6

To be performed in potentially 1 clock-cycle and with a 64-bit
encoding, which is better than, say:
   LUI X8, 0x12345
   ADD X8, X8, 0x678
   ADD X12, X10, X8

This strategy completely fails when the constant contains more than 32-bits >>
    FDIV   R9,#3.141592653589247,R17

When you have universal constants (including 5-bit immediates), you rarely >> need a register containing 0.

The jumbo prefixes at least allow for a 64-bit constant load, but as-is
not for 64-bit immediate values to 3RI ops. The latter could be done,
but would require 128-bit fetch and decode, which doesn't seem worth it.

There is the limbo feature of allowing for 57-bit immediate values, but
this is optional.

OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with
Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline.

Which the LLVM compiler for RISC-V does not do, instead it uses a AUPIC
and a LD to get the value from data memory within ±2GB of IP. This takes
3 instructions and 2 words in memory when universal constants do this in
1 instruction and 2 words in the code stream to do this.

Typical GCC response on RV64 seems to be to turn nearly all of the big-constant cases into memory loads, which kinda sucks.

This is typical when the underlying architecture is not very extensible
to 64-bit virtual address spaces; they have to waste a portion of the
32-bit space to get access to all the 64-bit space. Universal constants
makes this problem vanish.

Even something like a "LI Xd, Imm17s" instruction, would notably reduce
the number of constants loaded from memory (as GCC seemingly prefers to
use a LHU or LW or similar rather than encode it using LUI+ADD).

Reduce when compared to RISC-V but increased when compared to My 66000.
My 66000 has (at 99& level) uses no instructions to fetch or create
constants, nor does it waste any register (or registers) to hold use
once constants.

I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or
S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping
them enabled in the CPU core (they involved the non-zero cost of
repacking them into Binary16 in ID1 and then throwing a
Binary16->Binary64 converter into the ID2 stage).

Generally, the "FLDCH Imm16, Rn" instruction works well enough here (and
can leverage a more generic Binary16->Binary64 converter path).

Sometimes I see a::

CVTSD R2,#5

Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed in register R2 so it can be accesses as an argument in the subroutine call
to happen in a few instructions.

Mostly, a floating point immediate is available from a 32-bit constant container. When accesses in a float calculation it is used as IEEE32
when accessed by a 6double calculation IEEE32->IEEE64 promotion is
performed in the constant delivery path. So, one can use almost any
floating point constant that is representable in float as a double
without eating cycles and while saving code footprint.

For FPU compare with zero, can almost leverage the integer compare ops,
apart from the annoying edge cases of -0.0 and NaN leading to "not
strictly equivalent" behavior (though, an ASM programmer could more
easily get away with this). But, not common enough to justify adding FPU specific ops for this.

Actually, the edge/noise cases are not that many gates.
a) once you are separating out NaNs, infinities are free !!
b) once you are checking denorms for zero, infinites become free !!

Having structured a Compare-to-zero circuit based on the fields in double;
You can compose the terns to do all signed and unsigned integers and get
a gate count, then the number of gates you add to cover all 10 cases of floating point is 12% gate count over the simple integer version. Also
note:: this circuit is about 10% of the gate count of an integer adder.

-----------------------

Seems that generally 0 still isn't quite common enough to justify having
one register fewer for variables though (or to have a designated zero register), but otherwise it seems there is not much to justify trying to exclude the "implicit zero" ops from the ISA listing.

It is common enough,
But there are lots of ways to get a zero where you want it for a return.

Though, would likely still make a few decisions differently from
those in RISC-V. Things like indexed load/store,

Absolutely

                                           predicated ops (with a
designated flag bit),

Predicated then and else clauses which are branch free.
{{Also good for constant time crypto in need of flow control...}}

I have per instruction predication:
   CMPxx ...
   OP?T //if-true
   OP?F //if-false
Or:
   OP?T | OP?F //both in parallel, subject to encoding and ISA rules

    CMP Rt,Ra,#whatever
    PLE Rt,TTTTTEEE
    // This begins the then-clause 5Ts -> 5 instructions
    OP1
    OP2
    OP3
    OP4
    OP5
    // this begins the else-clause 3Es -> 3 instructions
    OP6
    OP7
    OP8
    // we are now back join point.

Notice no internal flow control instructions.

It can be similar in my case, with the ?T / ?F encoding scheme.

Except you eat that/those bits in OpCode encoding.

While poking at it, did go and add a check to exclude large struct-copy operations from predication, as it is slower to turn a large struct copy
into NOPs than to branch over it.

Did end up leaving struct-copies where sz<=64 as allowed though (where a
64 byte copy at least has the merit of achieving full pipeline
saturation and being roughly break-even with a branch-miss, whereas a
128 byte copy would cost roughly twice as much as a branch miss).

I decided to bite the bullet and have LDM, STM and MM so the compiler does
not have to do any analysis. This puts the onus on the memory unit designer
to process these at least as fast as a series of LDs and STs. Done right
this saves ~40%of the power of the caches avoiding ~70% of tag accesses
and 90% of TLB accesses. You access the tag only when/after crossing a line boundary and you access TLB only after crossing a page boundary.

Performance gains are modest, but still noticeable (part of why
predication ended up as a core ISA feature). Effect on pipeline seems
to be small in its current form (it is handled along with register
fetch, mostly turning non-executed instructions into NOPs during the
EX stages).

The effect is that one uses Predication whenever you will have already
fetched instructions at the join point by the time you have determined
the predicate value {then, else} clauses. The PARSE and DECODE do the
flow control without bothering FETCH.

Yeah, though in my pipeline, it is still a tradeoff of the relative cost
of a missed branch, vs the cost of sliding over both the THEN and ELSE branches as a series of NOPs.

For the most part, 1-bit seems sufficient.

How do you do && and || predication with 1 bit ??

Originally, it didn't.
Now I added some 3R and 3RI CMPxx encodings.

This allows, say:
CMPGT R8, R10, R4
CMPGT R8, R11, R5
TST R4, R5
....

All I had to do was to make the second predication overwrite the first predication's mask, and the compiler did the rest.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Mar 27 21:03:52 2024

BGB wrote:

On 3/26/2024 7:02 PM, MitchAlsup1 wrote:

Sometimes I see a::

CVTSD R2,#5

Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed
in register R2 so it can be accesses as an argument in the subroutine call >> to happen in a few instructions.

I had looked into, say:
FADD Rm, Imm5fp, Rn
Where, despite Imm5fp being severely limited, it had an OK hit rate.

Unpacking imm5fp to Binary16 being, essentially:
aee.fff -> 0.aAAee.fff0000000

realistically ±{0, 1, 2, 3, 4, 5, .., 31} only misses a few of the often
used fp constants--but does include 0, 1, 2, and 10. Also, realistically
the missing cases are the 0.5s.

OTOH, can note that a majority of typical floating point constants can
be represented exactly in Binary16 (well, excluding "0.1" or similar),
so it works OK as an immediate format.

This allows a single 32-bit op to be used for constant loads (nevermind
if one needs a 96 bit encoding for 0.1, or PI, or ...).

Mostly, a floating point immediate is available from a 32-bit constant
container. When accesses in a float calculation it is used as IEEE32
when accessed by a 6double calculation IEEE32->IEEE64 promotion is
performed in the constant delivery path. So, one can use almost any
floating point constant that is representable in float as a double
without eating cycles and while saving code footprint.

Don't currently have the encoding space for this.

Could in theory pull off truncated Binary32 an Imm29s form, but not
likely worth it. Would also require putting a converted in the ID2
stage, so not free.

In this case, the issue is more one of LUT cost to support these cases.

For FPU compare with zero, can almost leverage the integer compare
ops, apart from the annoying edge cases of -0.0 and NaN leading to
"not strictly equivalent" behavior (though, an ASM programmer could
more easily get away with this). But, not common enough to justify
adding FPU specific ops for this.

Actually, the edge/noise cases are not that many gates.
a) once you are separating out NaNs, infinities are free !!
b) once you are checking denorms for zero, infinites become free !!

Having structured a Compare-to-zero circuit based on the fields in double; >> You can compose the terns to do all signed and unsigned integers and get
a gate count, then the number of gates you add to cover all 10 cases of
floating point is 12% gate count over the simple integer version. Also
note:: this circuit is about 10% of the gate count of an integer adder.

I could add them, but, is it worth it?...

Whether to add them or not is on you.
I found things like this to be more straw on the camel's back
{where the camel collapses to a unified register file model.}

In this case, it is more a question of encoding space than logic cost.

It is semi-common in FP terms, but likely not common enough to justify dedicated compare-and-branch ops and similar (vs the minor annoyance at
the integer ops not quite working correctly due to edge cases).

My model requires about ½ the instruction count when processing FP
comparisons compared to RISC-V (big, no; around 5% in FP code and 0
elsewhere.} Where it wins big is compare against a non-zero FP constant.
My 66000 uses 1 instructions {FCMP, BB} whereas RISC=V uses 4 {AUPIC,
LD, FCMP, BC}

-----------------------

Seems that generally 0 still isn't quite common enough to justify
having one register fewer for variables though (or to have a
designated zero register), but otherwise it seems there is not much to
justify trying to exclude the "implicit zero" ops from the ISA listing.

It is common enough,
But there are lots of ways to get a zero where you want it for a return.

I think the main use case for a zero register is mostly that it allows
using it as a special case for pseudo-ops. I guess, not quite the same
if it is a normal GPR that just so happens to be 0.

Recently ended up fixing a bug where:
y=-x;
Was misbehaving with "unsigned int":
"NEG" produces a value which falls outside of UInt range;
But, "NEG; EXTU.L" is a 2-op sequence.
It had the EXT for SB/UB/SW/UW, but not for UL.
For SL, bare NEG almost works, apart from ((-1)<<31).
Could encode it as:
SUBU.L Zero, Rs, Rn

ADD Rn,#0,-Rs

But notice::
y = -x;
a = b + y;
can be performed as if it had been written::
y = -x;
a = b + (-x);
Which is encoded as::

ADD Ry,#0,-Rx
ADD Ra,Rb,-Rx

But, without a zero register,

#0 is not a register, but its value is 0x0000000000000000 anyway.

You missed the point entirely, if you can get easy access to #0
then you no longer need a register to hold this simple bit pattern.
In fact a large portion of My 66000 ISA over RISC-V comes from this
mechanism.

the compiler needs to special-case

The compiler needs easy access to #0 and the compiler needs to know
that #0 exists, but the compiler does not need to know if some register contains that same bit pattern.

provision this (or, in theory, add a "NEGU.L" instruction, but doesn't
seem common enough to justify this).

....

It is less bad than 32-bit ARM, where I only burnt 2 bits, rather than 4.

I burned 0 per instruction, but you can claim I burned 1 instruction PRED and 6.4 bits of that instruction are used to create masks that project upon up to
8 following instructions.

Also seems like a reasonable tradeoff, as the 2 bits effectively gain:
Per-instruction predication;
WEX / Bundle encoding;
Jumbo prefixes;
...

But, maybe otherwise could have justified slightly bigger immediate
fields, dunno.

While poking at it, did go and add a check to exclude large
struct-copy operations from predication, as it is slower to turn a
large struct copy into NOPs than to branch over it.

Did end up leaving struct-copies where sz<=64 as allowed though (where
a 64 byte copy at least has the merit of achieving full pipeline
saturation and being roughly break-even with a branch-miss, whereas a
128 byte copy would cost roughly twice as much as a branch miss).

I decided to bite the bullet and have LDM, STM and MM so the compiler does >> not have to do any analysis. This puts the onus on the memory unit designer >> to process these at least as fast as a series of LDs and STs. Done right
this saves ~40%of the power of the caches avoiding ~70% of tag accesses
and 90% of TLB accesses. You access the tag only when/after crossing a
line boundary and you access TLB only after crossing a page boundary.

OK.

In my case, it was more a case of noting that sliding over, say, 1kB
worth of memory loads/stores, is slower than branching around it.

This is why My 66000 predication has use limits. Once you can get where
you want faster with a branch, then a branch is what you should use.
I reasoned that my 1-wide machine would fetch 16-bytes (4 words) per
cycle and that the minimum DECODE time is 2 cycles, that Predication
wins when the number of instructions <= FWidth × Dcycles = 8.
Use predication and save cycles by not disrupting the front end.
Use branching and save cycles by disrupting the front end.

All I had to do was to make the second predication overwrite the first
predication's mask, and the compiler did the rest.

Not so simple in my case, but the hardware is simpler, since it just
cares about the state of 1 bit (which is explicitly saved/restored along
with the rest of the status-register if an interrupt occurs).

Simpler than 8-flip flops used as a shift right register ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Mar 27 21:14:01 2024

BGB wrote:

On 3/26/2024 5:27 PM, Michael S wrote:

For slightly less then 20 years ARM managed OK without integer divide.
Then in 2004 they added integer divide instruction in ARMv7 (including
ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are
doing great :-)

OK.

The point is they are doing better now after adding IDIV and FDIV.

I think both modern ARM and AMD Zen went over to "actually fast" integer divide.

I think for a long time, the de-facto integer divide was ~ 36-40 cycles
for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I
can get from a shift-add unit.

While those numbers are acceptable for shift-subtract division (including
SRT variants).

What I don't get is the reluctance for using the FP multiplier as a fast divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
and with special casing of leading 1s and 0s average around 10-cycles ???

I submit that at 10-cycles for average latency, the need to invent screwy
forms of even faster division fall by the wayside {accurate or not}.

NOTE well:: The size of the FMUL (or FMAC) unit does increase, but its
increase is less than that of an STR divisor unit.

On my BJX2 core, it is currently similar (36 and 68 cycle for divide).
This works out faster than a generic shift-subtract divider (or using a runtime call which then sorts out what to do).

This is because you are using a linear iteration, try using a quadratic convergent iteration instead. OH but you CAN'T because your multiplier
tree does not give accurate lower order bits.

A special case allows turning small divisors internally into divide-by-reciprocal, which allows for a 3-cycle divide special case.
But, this is a LUT cost tradeoff.

It could be possible in theory to support a general 3-cycle integer
divide, albeit if one can accept inexact results (would be faster than
the software-based lookup table strategy).

But, it is debatable. Pure minimalism would likely favor leaving out
divide (and a bunch of other stuff). Usual rationale being, say, to try
to fit the entire ISA listing on a single page of paper or similar (vs
having a listing with several hundred defined encodings).

Nevermind if the commonly used ISAs (x86 and 64-bit ARM) have ISA
listings that are considerably larger (thousands of encodings).

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Mar 27 21:53:36 2024

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

On 3/26/2024 5:27 PM, Michael S wrote:

For slightly less then 20 years ARM managed OK without integer divide.
Then in 2004 they added integer divide instruction in ARMv7 (including
ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are
doing great :-)

OK.

The point is they are doing better now after adding IDIV and FDIV.

I think both modern ARM and AMD Zen went over to "actually fast" integer
divide.

I think for a long time, the de-facto integer divide was ~ 36-40 cycles
for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I
can get from a shift-add unit.

While those numbers are acceptable for shift-subtract division (including
SRT variants).

What I don't get is the reluctance for using the FP multiplier as a fast >divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
and with special casing of leading 1s and 0s average around 10-cycles ???

Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
(where s is the number of significant digits in the quotient).

https://www.quinapalus.com/cm7cycles.html

I submit that at 10-cycles for average latency, the need to invent screwy >forms of even faster division fall by the wayside {accurate or not}.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Thu Mar 28 01:06:05 2024

On Wed, 27 Mar 2024 21:14:01 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

What I don't get is the reluctance for using the FP multiplier as a
fast divisor (IBM 360/91). AMD Opteron used this means to achieve
17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ?? and with special casing of leading 1s and 0s average
around 10-cycles ???

I submit that at 10-cycles for average latency, the need to invent
screwy forms of even faster division fall by the wayside {accurate or
not}.

All modern performance-oriented cores from Intel, AMD, ARM and Apple
have fast integer dividers and typically even faster FP dividers.
The last "big" cores with relatively slow 64-bit IDIV were Intel Skylake (launched in 2015) and AMD Zen2 (launched in 2019), but the later is
slow only in the worst case, the best case is o.k.
I'd guess, when Skylake was designed nobody at Intel could imagine that
it and its variations would be manufactured in huge volumes up to 2021.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Mar 27 23:11:34 2024

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

On 3/26/2024 5:27 PM, Michael S wrote:

For slightly less then 20 years ARM managed OK without integer divide. >>>> Then in 2004 they added integer divide instruction in ARMv7 (including >>>> ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are >>>> doing great :-)

OK.

The point is they are doing better now after adding IDIV and FDIV.

I think both modern ARM and AMD Zen went over to "actually fast" integer >>> divide.

I think for a long time, the de-facto integer divide was ~ 36-40 cycles
for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I >>> can get from a shift-add unit.

While those numbers are acceptable for shift-subtract division (including >>SRT variants).

What I don't get is the reluctance for using the FP multiplier as a fast >>divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ?? >>and with special casing of leading 1s and 0s average around 10-cycles ???

Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
(where s is the number of significant digits in the quotient).

I submit that a 5+2×ln8(s) is faster still.
32-bits = 15 cycles <not so much faster>
64-bits = 17 cycles <A lot faster>

{Log base 8, where one uses Newton-Raphson or Goldschmidt to get 8 significant digits (9.2 bits are correct) and double the significant bits each iteration (2-cycles). }

5 comes from looking at numerator and denominator to find the first bit of significance, and then shifting numerator and denominator so that the FDIV algorithm can work.

https://www.quinapalus.com/cm7cycles.html

I submit that at 10-cycles for average latency, the need to invent screwy >>forms of even faster division fall by the wayside {accurate or not}.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Scott Lurndal on Thu Mar 28 09:31:11 2024

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

On 3/26/2024 5:27 PM, Michael S wrote:

For slightly less then 20 years ARM managed OK without integer divide. >>>> Then in 2004 they added integer divide instruction in ARMv7 (including >>>> ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are >>>> doing great :-)

OK.

The point is they are doing better now after adding IDIV and FDIV.

I think both modern ARM and AMD Zen went over to "actually fast" integer >>> divide.

I think for a long time, the de-facto integer divide was ~ 36-40 cycles
for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I >>> can get from a shift-add unit.

While those numbers are acceptable for shift-subtract division (including
SRT variants).

What I don't get is the reluctance for using the FP multiplier as a fast
divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
and with special casing of leading 1s and 0s average around 10-cycles ???

Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
(where s is the number of significant digits in the quotient).

https://www.quinapalus.com/cm7cycles.html

That looks a lot like an SRT divisor with early out?

Having variable timing DIV means that any crypto operating (including
hashes?) where you use modulo operations, said modulus _must_ be a known constant, otherwise information about will leak from the timings, right?

I submit that at 10-cycles for average latency, the need to invent screwy
forms of even faster division fall by the wayside {accurate or not}.

I agree.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Terje Mathisen on Thu Mar 28 14:03:18 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

On 3/26/2024 5:27 PM, Michael S wrote:

For slightly less then 20 years ARM managed OK without integer divide. >>>>> Then in 2004 they added integer divide instruction in ARMv7 (including >>>>> ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are >>>>> doing great :-)

OK.

The point is they are doing better now after adding IDIV and FDIV.

I think both modern ARM and AMD Zen went over to "actually fast" integer >>>> divide.

I think for a long time, the de-facto integer divide was ~ 36-40 cycles >>>> for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I >>>> can get from a shift-add unit.

While those numbers are acceptable for shift-subtract division (including >>> SRT variants).

What I don't get is the reluctance for using the FP multiplier as a fast >>> divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ?? >>> and with special casing of leading 1s and 0s average around 10-cycles ??? >>

Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
(where s is the number of significant digits in the quotient).

https://www.quinapalus.com/cm7cycles.html

That looks a lot like an SRT divisor with early out?

Having variable timing DIV means that any crypto operating (including >hashes?) where you use modulo operations, said modulus _must_ be a known >constant, otherwise information about will leak from the timings, right?

Perhaps, yet I suspect that the m7 isn't generally used for crypto,
nor used in an SMP implementation where an observer can monitor
a shared cache.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Thu Mar 28 22:38:41 2024

On Thu, 28 Mar 2024 09:31:11 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

On 3/26/2024 5:27 PM, Michael S wrote:

For slightly less then 20 years ARM managed OK without integer
divide. Then in 2004 they added integer divide instruction in
ARMv7 (including ARMv7-M variant intended for small
microcontroller cores like Cortex-M3) and for the following 20
years instead of merely OK they are doing great :-)

OK.

The point is they are doing better now after adding IDIV and FDIV.

I think both modern ARM and AMD Zen went over to "actually fast"
integer divide.

I think for a long time, the de-facto integer divide was ~ 36-40
cycles for 32-bit, and 68-72 cycles for 64-bit. This is also
on-par with what I can get from a shift-add unit.

While those numbers are acceptable for shift-subtract division
(including SRT variants).

What I don't get is the reluctance for using the FP multiplier as
a fast divisor (IBM 360/91). AMD Opteron used this means to
achieve 17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV
not be under 20-cycles ?? and with special casing of leading 1s
and 0s average around 10-cycles ???

Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2]
cycles (where s is the number of significant digits in the
quotient).

https://www.quinapalus.com/cm7cycles.html

That looks a lot like an SRT divisor with early out?

Having variable timing DIV means that any crypto operating (including hashes?) where you use modulo operations, said modulus _must_ be a
known constant, otherwise information about will leak from the
timings, right?

Are you aware of any professional crypto algorithm, including hashes,
that uses modulo operations by modulo that is neither power-of-two nor
at least 192-bit wide?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Fri Mar 29 13:38:55 2024

Michael S wrote:

On Thu, 28 Mar 2024 09:31:11 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

On 3/26/2024 5:27 PM, Michael S wrote:

For slightly less then 20 years ARM managed OK without integer
divide. Then in 2004 they added integer divide instruction in
ARMv7 (including ARMv7-M variant intended for small
microcontroller cores like Cortex-M3) and for the following 20
years instead of merely OK they are doing great :-)

OK.

The point is they are doing better now after adding IDIV and FDIV.

I think both modern ARM and AMD Zen went over to "actually fast"
integer divide.

I think for a long time, the de-facto integer divide was ~ 36-40
cycles for 32-bit, and 68-72 cycles for 64-bit. This is also
on-par with what I can get from a shift-add unit.

While those numbers are acceptable for shift-subtract division
(including SRT variants).

What I don't get is the reluctance for using the FP multiplier as
a fast divisor (IBM 360/91). AMD Opteron used this means to
achieve 17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV
not be under 20-cycles ?? and with special casing of leading 1s
and 0s average around 10-cycles ???

Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2]
cycles (where s is the number of significant digits in the
quotient).

https://www.quinapalus.com/cm7cycles.html

That looks a lot like an SRT divisor with early out?

Having variable timing DIV means that any crypto operating (including
hashes?) where you use modulo operations, said modulus _must_ be a
known constant, otherwise information about will leak from the
timings, right?

Are you aware of any professional crypto algorithm, including hashes,
that uses modulo operations by modulo that is neither power-of-two nor
at least 192-bit wide?

I was involved with the optimization of DFC, the AES condidate from CERN:

It uses a fixed prime just above 2^64 as the modulus (2^64+13 afair),
and that resulted in a very simple reciprocal, i.e. no need for a DIV
opcode.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Fri Mar 29 16:38:58 2024

On Fri, 29 Mar 2024 13:38:55 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Thu, 28 Mar 2024 09:31:11 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Are you aware of any professional crypto algorithm, including
hashes, that uses modulo operations by modulo that is neither
power-of-two nor at least 192-bit wide?

I was involved with the optimization of DFC, the AES condidate from
CERN:

It uses a fixed prime just above 2^64 as the modulus (2^64+13 afair),
and that resulted in a very simple reciprocal, i.e. no need for a DIV
opcode.

Terje

Since DFC lost, I suppose that even ignoring reciprocal optimization
the answer to my question is 'No'.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB-Alt on Fri Apr 5 21:43:08 2024

BGB-Alt wrote:

I have yet to decide on the default bit-depth for UI widgets, mostly
torn between 16-color and 256-color. Probably don't need high bit-depths
for "text and widgets" windows, but may need more than 16-color. Also

When I write documents I have a favorite set of pastel colors I use to
shade boxes, arrows, and text. I draw the figures first and then fill in
the text later. To give the eye an easy means to go from reading the text
to looking at a figure and finding the discussed item, I place a box around
the text and shade the box with the same R-G-B color of the pre.jpg box in
the figure. All figures are *.jpg. So, even a word processor needs 24-bit color.

{{I have also found that Word R-G-B color values are not exactly the same
as the R-G-V values in the *.jpg figure, too; but they are close enough
to avoid "figuring it out and trying to fix it to perfection.}}

TBD which color palette to use in the case if 256 color is used (the

256 is OK enough for DOOM games, but not for professional documentation.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Keyop
  Wed Jun 5 23:51:32 2024
  from Huddersfield, West Yorkshire via SSH
- Bob Worm
  Wed Jun 5 22:15:50 2024
  from Wales, Uk via Telnet
- Bob Worm
  Thu Jun 6 18:46:50 2024
  from Wales, Uk via Telnet
- Bob Worm
  Thu Jun 6 18:10:19 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	307
Nodes:	16 (2 / 14)
Uptime:	132:16:47
Calls:	6,856
Calls today:	2
Files:	12,360
Messages:	5,418,146

Microarch Club

Who's Online

Recent Visitors

System Info