On 3/21/2024 2:34 PM, George Musk wrote:
Thought this may be interesting:
https://microarch.club/
https://www.youtube.com/@MicroarchClub/videos
At least sort of interesting...
I guess one of the guys on there did a manycore VLIW architecture with
the memory local to each of the cores. Seems like an interesting
approach, though not sure how well it would work on a general purpose workload. This is also closer to what I had imagined when I first
started working on this stuff, but it had drifted more towards a
slightly more conventional design.
But, admittedly, this is for small-N cores, 16/32K of L1 with a shared
L2, seemed like a better option than cores with a very large shared L1
cache.
I am not sure that abandoning a global address space is such a great
idea, as a lot of the "merits" can be gained instead by using weak
coherence models (possibly with a shared 256K or 512K or so for each
group of 4 cores, at which point it goes out to a higher latency global
bus). In this case, the division into independent memory regions could
be done in software.
It is unclear if my approach is "sufficiently minimal". There is more complexity than I would like in my ISA (and effectively turning it into
the common superset of both my original design and RV64G, doesn't really
help matters here).
If going for a more minimal core optimized for perf/area, some stuff
might be dropped. Would likely drop integer and floating-point divide
again. Might also make sense to add an architectural zero register, and eliminate some number of encodings which exist merely because of the
lack of a zero register (though, encodings are comparably cheap, as the
internal uArch has a zero register, and effectively treats immediate
values as a special register as well, ...). Some of the debate is more related to the logic cost of dealing with some things in the decoder.
Though, would likely still make a few decisions differently from those
in RISC-V. Things like indexed load/store,
predicated ops (with a
designated flag bit),
and large-immediate encodings,
help enough with performance (relative to cost)
to be worth keeping (though, mostly
because the alternatives are not so good in terms of performance).
On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
BGB-Alt wrote:
Say, "we have an instruction, but it is a boat anchor" isn't an ideal situation (unless to be a placeholder for if/when it is not a boat anchor).
again. Might also make sense to add an architectural zero register,
and eliminate some number of encodings which exist merely because of
the lack of a zero register (though, encodings are comparably cheap,
as the
I got an effective zero register without having to waste a register name
to "get it". My 66000 gives you 32 registers of 64-bits each and you can
put any bit pattern in any register and treat it as you like.
Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally
available.
I guess offloading this to the compiler can also make sense.
Least common denominator would be, say, not providing things like NEG instructions and similar (pretending as-if one had a zero register), and
if a program needs to do a NEG or similar, it can load 0 into a register itself.
In the extreme case (say, one also lacks a designated "load immediate" instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy to
zero a register...
Say:
XOR R14, R14, R14 //Designate R14 as pseudo-zero...
...
ADD R14, 0x123, R8 //Load 0x123 into R8
Though, likely still makes sense in this case to provide some
"convenience" instructions.
internal uArch has a zero register, and effectively treats immediate
values as a special register as well, ...). Some of the debate is more
related to the logic cost of dealing with some things in the decoder.
The problem is universal constants. RISCs being notably poor in their
support--however this is better than addressing modes which require
µCode.
Yeah.
I ended up with jumbo-prefixes. Still not perfect, and not perfectly orthogonal, but mostly works.
Allows, say:
ADD R4, 0x12345678, R6
To be performed in potentially 1 clock-cycle and with a 64-bit encoding, which is better than, say:
LUI X8, 0x12345
ADD X8, X8, 0x678
ADD X12, X10, X8
Though, for jumbo-prefixes, did end up adding a special case in the
compile where it will try to figure out if a constant will be used
multiple times in a basic-block and, if so, will load it into a register rather than use a jumbo-prefix form.
It could maybe make sense to have function-scale static-assigned
constants, but have not done so yet.
Though, it appears as if one of the "top contenders" here would be 0,
mostly because things like:
foo->x=0;
And:
bar[i]=0;
Are semi-common, and as-is end up needing to load 0 into a register each
time they appear.
Had already ended up with a similar sort of special case to optimize
"return 0;" and similar, mostly because this was common enough that it
made more sense to have a special case:
BRA .lbl_ret //if function does not end with "return 0;"
.lbl_ret_zero:
MOV 0, R2
.lbl_ret:
... epilog ...
For many functions, which allowed "return 0;" to be emitted as:
BRA .lbl_ret_zero
Rather than:
MOV 0, R2
BRA .lbl_ret
Which on average ended up as a net-win when there are more than around 3
of them per function.
Though, another possibility could be to allow constants to be included
in the "statically assign variables to registers" logic (as-is, they are excluded except in "tiny leaf" functions).
Though, would likely still make a few decisions differently from those
in RISC-V. Things like indexed load/store,
Absolutely
predicated ops (with a
designated flag bit),
Predicated then and else clauses which are branch free.
{{Also good for constant time crypto in need of flow control...}}
I have per instruction predication:
CMPxx ...
OP?T //if-true
OP?F //if-false
Or:
OP?T | OP?F //both in parallel, subject to encoding and ISA rules
Performance gains are modest, but still noticeable (part of why
predication ended up as a core ISA feature). Effect on pipeline seems to
be small in its current form (it is handled along with register fetch,
mostly turning non-executed instructions into NOPs during the EX stages).
For the most part, 1-bit seems sufficient.
More complex schemes generally ran into issues (had experimented with allowing a second predicate bit, or handling predicates as a
stack-machine, but these ideas were mostly dead on arrival).
and large-immediate encodings, >>Nothing else is so poorly served in typical ISAs.
Probably true.
help enough with
performance (relative to cost)
+40%
I am mostly seeing around 30% or so, for Doom and similar.
A few other programs still being closer to break-even at present.
Things are a bit more contentious in terms of code density:
With size-minimizing options to GCC:
".text" is slightly larger with BGBCC vs GCC (around 11%);
However, the GCC output has significantly more ".rodata".
A reasonable chunk of the code-size difference could be attributed to
jumbo prefixes making the average instruction size slightly bigger.
More could be possible with more compiler optimization effort.
Currently, a few recent optimization cases are disabled as they seem to
be causing bugs that I haven't figured out yet.
to be worth keeping (though, mostly
because the alternatives are not so good in terms of performance).
Damage to pipeline ability less than -5%.
Yeah.
On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
BGB wrote:
On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
BGB-Alt wrote:
Say, "we have an instruction, but it is a boat anchor" isn't an
ideal situation (unless to be a placeholder for if/when it is not
a boat anchor).
If the boat anchor is a required unit of functionality, and I
believe IDIV and FPDIV is, it should be defined in ISA and if you
can't afford it find some way to trap rapidly so you can fix it up
without excessive overhead. Like a MIPS TLB reload. If you can't
get trap and emulate at sufficient performance, then add the HW to
perform the instruction.
Though, 32-bit ARM managed OK without integer divide.
On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
BGB wrote:
I ended up with jumbo-prefixes. Still not perfect, and not perfectly
orthogonal, but mostly works.
Allows, say:
ADD R4, 0x12345678, R6
To be performed in potentially 1 clock-cycle and with a 64-bit
encoding, which is better than, say:
LUI X8, 0x12345
ADD X8, X8, 0x678
ADD X12, X10, X8
This strategy completely fails when the constant contains more than 32-bits >>
FDIV R9,#3.141592653589247,R17
When you have universal constants (including 5-bit immediates), you rarely >> need a register containing 0.
The jumbo prefixes at least allow for a 64-bit constant load, but as-is
not for 64-bit immediate values to 3RI ops. The latter could be done,
but would require 128-bit fetch and decode, which doesn't seem worth it.
There is the limbo feature of allowing for 57-bit immediate values, but
this is optional.
OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with
Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline.
Typical GCC response on RV64 seems to be to turn nearly all of the big-constant cases into memory loads, which kinda sucks.
Even something like a "LI Xd, Imm17s" instruction, would notably reduce
the number of constants loaded from memory (as GCC seemingly prefers to
use a LHU or LW or similar rather than encode it using LUI+ADD).
I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or
S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping
them enabled in the CPU core (they involved the non-zero cost of
repacking them into Binary16 in ID1 and then throwing a
Binary16->Binary64 converter into the ID2 stage).
Generally, the "FLDCH Imm16, Rn" instruction works well enough here (and
can leverage a more generic Binary16->Binary64 converter path).
For FPU compare with zero, can almost leverage the integer compare ops,
apart from the annoying edge cases of -0.0 and NaN leading to "not
strictly equivalent" behavior (though, an ASM programmer could more
easily get away with this). But, not common enough to justify adding FPU specific ops for this.
Seems that generally 0 still isn't quite common enough to justify having
one register fewer for variables though (or to have a designated zero register), but otherwise it seems there is not much to justify trying to exclude the "implicit zero" ops from the ISA listing.
Though, would likely still make a few decisions differently from
those in RISC-V. Things like indexed load/store,
Absolutely
predicated ops (with a
designated flag bit),
Predicated then and else clauses which are branch free.
{{Also good for constant time crypto in need of flow control...}}
I have per instruction predication:
CMPxx ...
OP?T //if-true
OP?F //if-false
Or:
OP?T | OP?F //both in parallel, subject to encoding and ISA rules
CMP Rt,Ra,#whatever
PLE Rt,TTTTTEEE
// This begins the then-clause 5Ts -> 5 instructions
OP1
OP2
OP3
OP4
OP5
// this begins the else-clause 3Es -> 3 instructions
OP6
OP7
OP8
// we are now back join point.
Notice no internal flow control instructions.
It can be similar in my case, with the ?T / ?F encoding scheme.
While poking at it, did go and add a check to exclude large struct-copy operations from predication, as it is slower to turn a large struct copy
into NOPs than to branch over it.
Did end up leaving struct-copies where sz<=64 as allowed though (where a
64 byte copy at least has the merit of achieving full pipeline
saturation and being roughly break-even with a branch-miss, whereas a
128 byte copy would cost roughly twice as much as a branch miss).
Performance gains are modest, but still noticeable (part of why
predication ended up as a core ISA feature). Effect on pipeline seems
to be small in its current form (it is handled along with register
fetch, mostly turning non-executed instructions into NOPs during the
EX stages).
The effect is that one uses Predication whenever you will have already
fetched instructions at the join point by the time you have determined
the predicate value {then, else} clauses. The PARSE and DECODE do the
flow control without bothering FETCH.
Yeah, though in my pipeline, it is still a tradeoff of the relative cost
of a missed branch, vs the cost of sliding over both the THEN and ELSE branches as a series of NOPs.
For the most part, 1-bit seems sufficient.
How do you do && and || predication with 1 bit ??
Originally, it didn't.
Now I added some 3R and 3RI CMPxx encodings.
This allows, say:
CMPGT R8, R10, R4
CMPGT R8, R11, R5
TST R4, R5
....
On 3/26/2024 7:02 PM, MitchAlsup1 wrote:
Sometimes I see a::
CVTSD R2,#5
Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed
in register R2 so it can be accesses as an argument in the subroutine call >> to happen in a few instructions.
I had looked into, say:
FADD Rm, Imm5fp, Rn
Where, despite Imm5fp being severely limited, it had an OK hit rate.
Unpacking imm5fp to Binary16 being, essentially:
aee.fff -> 0.aAAee.fff0000000
OTOH, can note that a majority of typical floating point constants can
be represented exactly in Binary16 (well, excluding "0.1" or similar),
so it works OK as an immediate format.
This allows a single 32-bit op to be used for constant loads (nevermind
if one needs a 96 bit encoding for 0.1, or PI, or ...).
Mostly, a floating point immediate is available from a 32-bit constant
container. When accesses in a float calculation it is used as IEEE32
when accessed by a 6double calculation IEEE32->IEEE64 promotion is
performed in the constant delivery path. So, one can use almost any
floating point constant that is representable in float as a double
without eating cycles and while saving code footprint.
Don't currently have the encoding space for this.
Could in theory pull off truncated Binary32 an Imm29s form, but not
likely worth it. Would also require putting a converted in the ID2
stage, so not free.
In this case, the issue is more one of LUT cost to support these cases.
For FPU compare with zero, can almost leverage the integer compare
ops, apart from the annoying edge cases of -0.0 and NaN leading to
"not strictly equivalent" behavior (though, an ASM programmer could
more easily get away with this). But, not common enough to justify
adding FPU specific ops for this.
Actually, the edge/noise cases are not that many gates.
a) once you are separating out NaNs, infinities are free !!
b) once you are checking denorms for zero, infinites become free !!
Having structured a Compare-to-zero circuit based on the fields in double; >> You can compose the terns to do all signed and unsigned integers and get
a gate count, then the number of gates you add to cover all 10 cases of
floating point is 12% gate count over the simple integer version. Also
note:: this circuit is about 10% of the gate count of an integer adder.
I could add them, but, is it worth it?...
In this case, it is more a question of encoding space than logic cost.
It is semi-common in FP terms, but likely not common enough to justify dedicated compare-and-branch ops and similar (vs the minor annoyance at
the integer ops not quite working correctly due to edge cases).
-----------------------
Seems that generally 0 still isn't quite common enough to justify
having one register fewer for variables though (or to have a
designated zero register), but otherwise it seems there is not much to
justify trying to exclude the "implicit zero" ops from the ISA listing.
It is common enough,
But there are lots of ways to get a zero where you want it for a return.
I think the main use case for a zero register is mostly that it allows
using it as a special case for pseudo-ops. I guess, not quite the same
if it is a normal GPR that just so happens to be 0.
Recently ended up fixing a bug where:
y=-x;
Was misbehaving with "unsigned int":
"NEG" produces a value which falls outside of UInt range;
But, "NEG; EXTU.L" is a 2-op sequence.
It had the EXT for SB/UB/SW/UW, but not for UL.
For SL, bare NEG almost works, apart from ((-1)<<31).
Could encode it as:
SUBU.L Zero, Rs, Rn
But, without a zero register,
the compiler needs to special-case
provision this (or, in theory, add a "NEGU.L" instruction, but doesn't
seem common enough to justify this).
....
It is less bad than 32-bit ARM, where I only burnt 2 bits, rather than 4.
Also seems like a reasonable tradeoff, as the 2 bits effectively gain:
Per-instruction predication;
WEX / Bundle encoding;
Jumbo prefixes;
...
But, maybe otherwise could have justified slightly bigger immediate
fields, dunno.
While poking at it, did go and add a check to exclude large
struct-copy operations from predication, as it is slower to turn a
large struct copy into NOPs than to branch over it.
Did end up leaving struct-copies where sz<=64 as allowed though (where
a 64 byte copy at least has the merit of achieving full pipeline
saturation and being roughly break-even with a branch-miss, whereas a
128 byte copy would cost roughly twice as much as a branch miss).
I decided to bite the bullet and have LDM, STM and MM so the compiler does >> not have to do any analysis. This puts the onus on the memory unit designer >> to process these at least as fast as a series of LDs and STs. Done right
this saves ~40%of the power of the caches avoiding ~70% of tag accesses
and 90% of TLB accesses. You access the tag only when/after crossing a
line boundary and you access TLB only after crossing a page boundary.
OK.
In my case, it was more a case of noting that sliding over, say, 1kB
worth of memory loads/stores, is slower than branching around it.
All I had to do was to make the second predication overwrite the first
predication's mask, and the compiler did the rest.
Not so simple in my case, but the hardware is simpler, since it just
cares about the state of 1 bit (which is explicitly saved/restored along
with the rest of the status-register if an interrupt occurs).
On 3/26/2024 5:27 PM, Michael S wrote:
For slightly less then 20 years ARM managed OK without integer divide.
Then in 2004 they added integer divide instruction in ARMv7 (including
ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are
doing great :-)
OK.
I think both modern ARM and AMD Zen went over to "actually fast" integer divide.
I think for a long time, the de-facto integer divide was ~ 36-40 cycles
for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I
can get from a shift-add unit.
On my BJX2 core, it is currently similar (36 and 68 cycle for divide).
This works out faster than a generic shift-subtract divider (or using a runtime call which then sorts out what to do).
A special case allows turning small divisors internally into divide-by-reciprocal, which allows for a 3-cycle divide special case.
But, this is a LUT cost tradeoff.
It could be possible in theory to support a general 3-cycle integer
divide, albeit if one can accept inexact results (would be faster than
the software-based lookup table strategy).
But, it is debatable. Pure minimalism would likely favor leaving out
divide (and a bunch of other stuff). Usual rationale being, say, to try
to fit the entire ISA listing on a single page of paper or similar (vs
having a listing with several hundred defined encodings).
Nevermind if the commonly used ISAs (x86 and 64-bit ARM) have ISA
listings that are considerably larger (thousands of encodings).
....
BGB wrote:
On 3/26/2024 5:27 PM, Michael S wrote:
For slightly less then 20 years ARM managed OK without integer divide.
Then in 2004 they added integer divide instruction in ARMv7 (including
ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are
doing great :-)
OK.
The point is they are doing better now after adding IDIV and FDIV.
I think both modern ARM and AMD Zen went over to "actually fast" integer
divide.
I think for a long time, the de-facto integer divide was ~ 36-40 cycles
for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I
can get from a shift-add unit.
While those numbers are acceptable for shift-subtract division (including
SRT variants).
What I don't get is the reluctance for using the FP multiplier as a fast >divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
and with special casing of leading 1s and 0s average around 10-cycles ???
I submit that at 10-cycles for average latency, the need to invent screwy >forms of even faster division fall by the wayside {accurate or not}.
What I don't get is the reluctance for using the FP multiplier as a
fast divisor (IBM 360/91). AMD Opteron used this means to achieve
17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ?? and with special casing of leading 1s and 0s average
around 10-cycles ???
I submit that at 10-cycles for average latency, the need to invent
screwy forms of even faster division fall by the wayside {accurate or
not}.
mitchalsup@aol.com (MitchAlsup1) writes:
BGB wrote:
On 3/26/2024 5:27 PM, Michael S wrote:
For slightly less then 20 years ARM managed OK without integer divide. >>>> Then in 2004 they added integer divide instruction in ARMv7 (including >>>> ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are >>>> doing great :-)
OK.
The point is they are doing better now after adding IDIV and FDIV.
I think both modern ARM and AMD Zen went over to "actually fast" integer >>> divide.
I think for a long time, the de-facto integer divide was ~ 36-40 cycles
for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I >>> can get from a shift-add unit.
While those numbers are acceptable for shift-subtract division (including >>SRT variants).
What I don't get is the reluctance for using the FP multiplier as a fast >>divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ?? >>and with special casing of leading 1s and 0s average around 10-cycles ???
Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
(where s is the number of significant digits in the quotient).
https://www.quinapalus.com/cm7cycles.html
I submit that at 10-cycles for average latency, the need to invent screwy >>forms of even faster division fall by the wayside {accurate or not}.
mitchalsup@aol.com (MitchAlsup1) writes:
BGB wrote:
On 3/26/2024 5:27 PM, Michael S wrote:
For slightly less then 20 years ARM managed OK without integer divide. >>>> Then in 2004 they added integer divide instruction in ARMv7 (including >>>> ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are >>>> doing great :-)
OK.
The point is they are doing better now after adding IDIV and FDIV.
I think both modern ARM and AMD Zen went over to "actually fast" integer >>> divide.
I think for a long time, the de-facto integer divide was ~ 36-40 cycles
for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I >>> can get from a shift-add unit.
While those numbers are acceptable for shift-subtract division (including
SRT variants).
What I don't get is the reluctance for using the FP multiplier as a fast
divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
and with special casing of leading 1s and 0s average around 10-cycles ???
Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
(where s is the number of significant digits in the quotient).
https://www.quinapalus.com/cm7cycles.html
I submit that at 10-cycles for average latency, the need to invent screwy
forms of even faster division fall by the wayside {accurate or not}.
Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
BGB wrote:Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
On 3/26/2024 5:27 PM, Michael S wrote:
For slightly less then 20 years ARM managed OK without integer divide. >>>>> Then in 2004 they added integer divide instruction in ARMv7 (including >>>>> ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are >>>>> doing great :-)
OK.
The point is they are doing better now after adding IDIV and FDIV.
I think both modern ARM and AMD Zen went over to "actually fast" integer >>>> divide.
I think for a long time, the de-facto integer divide was ~ 36-40 cycles >>>> for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I >>>> can get from a shift-add unit.
While those numbers are acceptable for shift-subtract division (including >>> SRT variants).
What I don't get is the reluctance for using the FP multiplier as a fast >>> divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ?? >>> and with special casing of leading 1s and 0s average around 10-cycles ??? >>
(where s is the number of significant digits in the quotient).
https://www.quinapalus.com/cm7cycles.html
That looks a lot like an SRT divisor with early out?
Having variable timing DIV means that any crypto operating (including >hashes?) where you use modulo operations, said modulus _must_ be a known >constant, otherwise information about will leak from the timings, right?
Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
BGB wrote:
On 3/26/2024 5:27 PM, Michael S wrote:
For slightly less then 20 years ARM managed OK without integer
divide. Then in 2004 they added integer divide instruction in
ARMv7 (including ARMv7-M variant intended for small
microcontroller cores like Cortex-M3) and for the following 20
years instead of merely OK they are doing great :-)
OK.
The point is they are doing better now after adding IDIV and FDIV.
I think both modern ARM and AMD Zen went over to "actually fast"
integer divide.
I think for a long time, the de-facto integer divide was ~ 36-40
cycles for 32-bit, and 68-72 cycles for 64-bit. This is also
on-par with what I can get from a shift-add unit.
While those numbers are acceptable for shift-subtract division
(including SRT variants).
What I don't get is the reluctance for using the FP multiplier as
a fast divisor (IBM 360/91). AMD Opteron used this means to
achieve 17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV
not be under 20-cycles ?? and with special casing of leading 1s
and 0s average around 10-cycles ???
Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2]
cycles (where s is the number of significant digits in the
quotient).
https://www.quinapalus.com/cm7cycles.html
That looks a lot like an SRT divisor with early out?
Having variable timing DIV means that any crypto operating (including hashes?) where you use modulo operations, said modulus _must_ be a
known constant, otherwise information about will leak from the
timings, right?
On Thu, 28 Mar 2024 09:31:11 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
BGB wrote:
On 3/26/2024 5:27 PM, Michael S wrote:
For slightly less then 20 years ARM managed OK without integer
divide. Then in 2004 they added integer divide instruction in
ARMv7 (including ARMv7-M variant intended for small
microcontroller cores like Cortex-M3) and for the following 20
years instead of merely OK they are doing great :-)
OK.
The point is they are doing better now after adding IDIV and FDIV.
I think both modern ARM and AMD Zen went over to "actually fast"
integer divide.
I think for a long time, the de-facto integer divide was ~ 36-40
cycles for 32-bit, and 68-72 cycles for 64-bit. This is also
on-par with what I can get from a shift-add unit.
While those numbers are acceptable for shift-subtract division
(including SRT variants).
What I don't get is the reluctance for using the FP multiplier as
a fast divisor (IBM 360/91). AMD Opteron used this means to
achieve 17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV
not be under 20-cycles ?? and with special casing of leading 1s
and 0s average around 10-cycles ???
Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2]
cycles (where s is the number of significant digits in the
quotient).
https://www.quinapalus.com/cm7cycles.html
That looks a lot like an SRT divisor with early out?
Having variable timing DIV means that any crypto operating (including
hashes?) where you use modulo operations, said modulus _must_ be a
known constant, otherwise information about will leak from the
timings, right?
Are you aware of any professional crypto algorithm, including hashes,
that uses modulo operations by modulo that is neither power-of-two nor
at least 192-bit wide?
Michael S wrote:
On Thu, 28 Mar 2024 09:31:11 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Are you aware of any professional crypto algorithm, including
hashes, that uses modulo operations by modulo that is neither
power-of-two nor at least 192-bit wide?
I was involved with the optimization of DFC, the AES condidate from
CERN:
It uses a fixed prime just above 2^64 as the modulus (2^64+13 afair),
and that resulted in a very simple reciprocal, i.e. no need for a DIV
opcode.
Terje
I have yet to decide on the default bit-depth for UI widgets, mostly
torn between 16-color and 256-color. Probably don't need high bit-depths
for "text and widgets" windows, but may need more than 16-color. Also
TBD which color palette to use in the case if 256 color is used (the
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 307 |
Nodes: | 16 (2 / 14) |
Uptime: | 132:16:47 |
Calls: | 6,856 |
Calls today: | 2 |
Files: | 12,360 |
Messages: | 5,418,146 |