Forum: >>> Magnum BBS <<<

"Mini" tags to reduce the number of op codes

From Stephen Fuld@21:1/5 to All on Wed Apr 3 09:43:44 2024

There has been discussion here about the benefits of reducing the number
of op codes. One reason not mentioned before is if you have fixed
length instructions, you may want to leave as many codes as possible
available for future use. Of course, if you are doing a 16-bit
instruction design, where instruction bits are especially tight, you may
save enough op-codes to save a bit, perhaps allowing a larger register specifier field, or to allow more instructions in the smaller subset.

It is in this spirit that I had an idea, partially inspired by Mill’s
use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, it
has several features that are “friendly” to the idea. Second, I know
Mitch cares about keeping the number of op codes low.

Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate
discussions, and because it has been fun to think about.

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If
set, the bit indicates that the corresponding register contains a floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load single floating and load double floating, which work the same as the other 32-
and 64-bit loads, but in addition to loading the value, set the tag bit
for the destination register. Non-floating-point loads would clear the
tag bit. As I show below, I don’t think you need any special "store
tag" instructions.

When executing arithmetic instructions, if the tag bits of both sources
of an instruction are the same, do the appropriate operation (floating
or integer), and set the tag bit of the result register appropriately.
If the tag bits of the two sources are different, I see several
possibilities.

1. Generate an exception.
2. Use the sense of source 1 for the arithmetic operation, but perform
the appropriate conversion on the second operand first, potentially
saving an instruction
3. Always do the operation in floating point and convert the integer operand prior to the operation. (Or, if you prefer, change floating
point to integer in the above description.)
4. Same as 2 or 3 above, but don’t do the conversions.

I suspect this is the least useful choice. I am not sure which is the
best option.

Given that, use the same op code for the floating-point and fixed
versions of the same operations. So we can save eight op codes, the
four arithmetic operations, max, min, abs and compare. So far, a net
savings of six opcodes.

But we can go further. There are some opcodes that only make sense for
FP operands, e.g. the transcendental instructions. And there are some operations that probably only make sense for non-FP operands, e.g. POP,
FF1, probably shifts. Given the tag bit, these could share the same
op-code. There may be several more of these.

I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with
separate compilations? The called function probably doesn’t know the
tag value for callee saved registers. Fortunately, the My 66000
architecture comes to the rescue here. You would modify the Enter and
Exit instructions to save/restore the tag bits of the registers they are
saving or restoring in the same data structure it uses for the registers
(yes, it adds 32 bits to that structure – minimal cost). The same
mechanism works for interrupts that take control away from a running
process.

I don’t think you need to set or clear the tag bits without doing
anything else, but if you do, I think you could “repurpose” some other instructions to do this, without requiring another op-code. For
example, Oring a register with itself could be used to set the tag bit
and Oring a register with zero could clear it. These should be pretty rare.

That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it? To me, a major question is the effect on performance. What
is the cost of having to decode the source registers and reading their respective tag bits before knowing which FU to use? If it causes an
extra cycle per instruction, then it is almost certainly not worth it.
IANAHG, so I don’t know. But even if it doesn’t cost any performance, I think the overall gains are pretty small, and probably not worth it
unless the op-code space is really tight (which, for My 66000 it isn’t).

Anyway, it has been fun thinking about this, so I hope you don’t mind
the, probably too long, post.
Any comments are welcome.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stephen Fuld on Wed Apr 3 17:24:05 2024

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If
set, the bit indicates that the corresponding register contains a >floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load single >floating and load double floating, which work the same as the other 32-
and 64-bit loads, but in addition to loading the value, set the tag bit
for the destination register. Non-floating-point loads would clear the
tag bit. As I show below, I don’t think you need any special "store
tag" instructions.

...

But we can go further. There are some opcodes that only make sense for
FP operands, e.g. the transcendental instructions. And there are some >operations that probably only make sense for non-FP operands, e.g. POP,
FF1, probably shifts. Given the tag bit, these could share the same
op-code. There may be several more of these.

Certainly makes reading disassembler output fun (or writing the
disassembler). This reminds me of the work on SafeTSA [amme+01] where
they encode only programs that are correct (according to some notion
of correctness).

I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with >separate compilations? The called function probably doesn’t know the
tag value for callee saved registers. Fortunately, the My 66000
architecture comes to the rescue here. You would modify the Enter and
Exit instructions to save/restore the tag bits of the registers they are >saving or restoring in the same data structure it uses for the registers >(yes, it adds 32 bits to that structure – minimal cost).

That's expensive in an OoO CPU. There you want each tag to be stored
alongside with the other 64 bits of the register, because they should
be renamed at the same time. So the ENTER instruction would depend on
all the registers that it saves (or maybe on all registers). And upon
EXIT the restored registers have to be reassembled (which ist not that expensive).

I have a similar problem for the carry and overflow bits in <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>, and chose to
let those bits not survive across calls; if there was a cheap solution
for the problem, it would eliminate this drawback of my idea.

The same
mechanism works for interrupts that take control away from a running
process.

For context switches one cannot get around the problem, but they are
much rarer than calls and returns, so requiring a pipeline drain for
them is not so bad.

Concerning interrupts, as long as nesting is limited, one could just
treat the physical registers of the interrupted program as taken, and
execute the interrupt with the remaining physical registers. No need
to save any architectural registers or their tag, carry, or overflow
bits.

That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it? To me, a major question is the effect on performance. What
is the cost of having to decode the source registers and reading their >respective tag bits before knowing which FU to use?

In in OoO CPU, that's pretty heavy.

But actually, your idea does not need any computation results for
determining the tag bits of registers (except during EXIT), so you
probably can handle the tags in the front end (decoder and renamer).
Then the tags are really separate and not part of the rgisters that
have to be renamed, and you don't need to perform any waiting on
ENTER.

However, in EXIT the front end would have to wait for the result of
the load/store unit loading the 32 bits, unless you add a special
mechanism for that. So EXIT would become expensive, one way or the
other.

@InProceedings{amme+01,
author = {Wolfram Amme and Niall Dalton and Jeffery von Ronne
and Michael Franz},
title = {Safe{TSA}: A Type Safe and Referentially Secure
Mobile-Code Representation Based on Static Single
Assignment Form},
crossref = {sigplan01},
pages = {137--147},
annote = {The basic ideas in this representation are:
variables are named as the pair (distance in the
dominator tree, assignment within basic block);
variables are separated by type, with operations
referring only to variables of the right type (like
integer and FP instructions and registers in
assemblers); memory references use types to encode
that a null-pointer check and/or a range check has
already occured, allowing optimizing these
operations; the resulting code is encoded (using
text compression methods) in a way that supports
only correct code. These ideas are discussed mostly
in a general way, with some Java-specifics, but the
representation supposedly also supports Fortran95
and Ada95. The representation supports some CSE, but
not for address computation operations. The paper
also gives numbers on size (usually a little smaller
than Java bytecode), and some other static metrics,
especially wrt. the effect of optimizations.}
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Stephen Fuld on Wed Apr 3 14:44:27 2024

Stephen Fuld wrote:

There has been discussion here about the benefits of reducing the number
of op codes. One reason not mentioned before is if you have fixed
length instructions, you may want to leave as many codes as possible available for future use. Of course, if you are doing a 16-bit
instruction design, where instruction bits are especially tight, you may
save enough op-codes to save a bit, perhaps allowing a larger register specifier field, or to allow more instructions in the smaller subset.

It is in this spirit that I had an idea, partially inspired by Mill’s
use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, it has several features that are “friendly” to the idea. Second, I know Mitch cares about keeping the number of op codes low.

Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate
discussions, and because it has been fun to think about.

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If
set, the bit indicates that the corresponding register contains a floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load single floating and load double floating, which work the same as the other 32-
and 64-bit loads, but in addition to loading the value, set the tag bit
for the destination register. Non-floating-point loads would clear the
tag bit. As I show below, I don’t think you need any special "store
tag" instructions.

If you are adding a float/int data type flag you might as well
also add operand size for floats at least, though some ISA's
have both int32 and int64 ALU operations for result compatibility.

When executing arithmetic instructions, if the tag bits of both sources
of an instruction are the same, do the appropriate operation (floating
or integer), and set the tag bit of the result register appropriately.
If the tag bits of the two sources are different, I see several possibilities.

1. Generate an exception.
2. Use the sense of source 1 for the arithmetic operation, but
perform the appropriate conversion on the second operand first,
potentially saving an instruction
3. Always do the operation in floating point and convert the integer operand prior to the operation. (Or, if you prefer, change floating
point to integer in the above description.)
4. Same as 2 or 3 above, but don’t do the conversions.

I suspect this is the least useful choice. I am not sure which is the
best option.

Given that, use the same op code for the floating-point and fixed
versions of the same operations. So we can save eight op codes, the
four arithmetic operations, max, min, abs and compare. So far, a net
savings of six opcodes.

But we can go further. There are some opcodes that only make sense for
FP operands, e.g. the transcendental instructions. And there are some operations that probably only make sense for non-FP operands, e.g. POP,
FF1, probably shifts. Given the tag bit, these could share the same
op-code. There may be several more of these.

I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with separate compilations? The called function probably doesn’t know the
tag value for callee saved registers. Fortunately, the My 66000
architecture comes to the rescue here. You would modify the Enter and
Exit instructions to save/restore the tag bits of the registers they are saving or restoring in the same data structure it uses for the registers (yes, it adds 32 bits to that structure – minimal cost). The same mechanism works for interrupts that take control away from a running
process.

I don’t think you need to set or clear the tag bits without doing
anything else, but if you do, I think you could “repurpose” some other instructions to do this, without requiring another op-code. For
example, Oring a register with itself could be used to set the tag bit
and Oring a register with zero could clear it. These should be pretty
rare.

That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it? To me, a major question is the effect on performance. What
is the cost of having to decode the source registers and reading their respective tag bits before knowing which FU to use? If it causes an
extra cycle per instruction, then it is almost certainly not worth it. IANAHG, so I don’t know. But even if it doesn’t cost any performance, I think the overall gains are pretty small, and probably not worth it
unless the op-code space is really tight (which, for My 66000 it isn’t).

Anyway, it has been fun thinking about this, so I hope you don’t mind
the, probably too long, post.
Any comments are welcome.

Currently the opcode data type can tell the uArch how to route
the operands internally without knowing the data values.
For example, FPU reservation stations monitor float operands
and schedule for just the FPU FADD or FMUL units.

Dynamic data typing would change that to be data dependent routing.
It means, for example, you can't begin to schedule a uOp
until you know all its operand types and opcode.

Looks like it makes such distributed decisions impossible.
Probably everything winds up in a big pile of logic in the center,
which might be problematic for those things whose complexity grows N^2.
Not sure how significant that is.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stephen Fuld on Wed Apr 3 20:02:25 2024

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

[saving opcodes]

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If
set, the bit indicates that the corresponding register contains a floating-point value. Clear indicates not floating point (integer,
address, etc.).

I don't think this would save a lot of opcode space, which
is the important thing.

A typical RISC design has a six-bit major opcode.
Having three registers takes away fifteen bits, leaving
eleven, which is far more than anybody would ever want as
minor opdoce for arithmetic instructions. Compare with https://en.wikipedia.org/wiki/DEC_Alpha#Instruction_formats
where DEC actually left out three bits because they did not
need them.

What is _really_ eating up opcode space are many- (usually 16-) bit
constants in the instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB-Alt on Wed Apr 3 21:30:02 2024

BGB-Alt wrote:

On 4/3/2024 11:43 AM, Stephen Fuld wrote:

There has been discussion here about the benefits of reducing the number
of op codes. One reason not mentioned before is if you have fixed
length instructions, you may want to leave as many codes as possible
available for future use. Of course, if you are doing a 16-bit
instruction design, where instruction bits are especially tight, you may
save enough op-codes to save a bit, perhaps allowing a larger register
specifier field, or to allow more instructions in the smaller subset.

It is in this spirit that I had an idea, partially inspired by Mill’s
use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, it

66000

has several features that are “friendly” to the idea. Second, I know >> Mitch cares about keeping the number of op codes low.

Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate
discussions, and because it has been fun to think about.

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If
set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load single
floating and load double floating, which work the same as the other 32-
and 64-bit loads, but in addition to loading the value, set the tag bit
for the destination register. Non-floating-point loads would clear the
tag bit. As I show below, I don’t think you need any special "store
tag" instructions.

What do you do when you want a FP bit pattern interpreted as an integer,
or vice versa.

When executing arithmetic instructions, if the tag bits of both sources
of an instruction are the same, do the appropriate operation (floating
or integer), and set the tag bit of the result register appropriately.
If the tag bits of the two sources are different, I see several
possibilities.

1.    Generate an exception.
2.    Use the sense of source 1 for the arithmetic operation, but
perform the appropriate conversion on the second operand first,
potentially saving an instruction

Conversions to/from FP often require a rounding mode. How do you specify that?

3.    Always do the operation in floating point and convert the integer >> operand prior to the operation. (Or, if you prefer, change floating
point to integer in the above description.)
4.    Same as 2 or 3 above, but don’t do the conversions.

I suspect this is the least useful choice. I am not sure which is the
best option.

Given that, use the same op code for the floating-point and fixed
versions of the same operations. So we can save eight op codes, the
four arithmetic operations, max, min, abs and compare. So far, a net
savings of six opcodes.

But we can go further. There are some opcodes that only make sense for
FP operands, e.g. the transcendental instructions. And there are some
operations that probably only make sense for non-FP operands, e.g. POP,
FF1, probably shifts. Given the tag bit, these could share the same
op-code. There may be several more of these.

Hands waving:: "Danger Will Robinson, Danger" more waving of hands.

I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with
separate compilations? The called function probably doesn’t know the

The compiler will certainly have a function prototype. In any event, if FP
and Integers share a register file the lack of prototype is much less stress- full to the compiler/linking system.

tag value for callee saved registers. Fortunately, the My 66000
architecture comes to the rescue here. You would modify the Enter and
Exit instructions to save/restore the tag bits of the registers they are
saving or restoring in the same data structure it uses for the registers
(yes, it adds 32 bits to that structure – minimal cost). The same
mechanism works for interrupts that take control away from a running
process.

Yes, but we do just fine without the tag and without the stuff mentioned
above. Neither ENTER nor EXIT care about the 64-bit pattern in the register.

I don’t think you need to set or clear the tag bits without doing
anything else, but if you do, I think you could “repurpose” some other >> instructions to do this, without requiring another op-code.   For
example, Oring a register with itself could be used to set the tag bit
and Oring a register with zero could clear it. These should be pretty
rare.

That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it?

No.

To me, a major question is the effect on performance. What

is the cost of having to decode the source registers and reading their
respective tag bits before knowing which FU to use?

The problem is you have put decode dependent on dynamic pipeline information.
I suggest you don't want to do that. Consider a change from int to FP instruction
as a predicated instruction, so the pipeline cannot DECODE the instruction at hand until the predicate resolves. Yech.

If it causes an

extra cycle per instruction, then it is almost certainly not worth it.
IANAHG, so I don’t know. But even if it doesn’t cost any performance, I
think the overall gains are pretty small, and probably not worth it
unless the op-code space is really tight (which, for My 66000 it isn’t). >>
Anyway, it has been fun thinking about this, so I hope you don’t mind
the, probably too long, post.
Any comments are welcome.

It is actually an interesting idea if you want to limit your architecture
to 1-wide.

FWIW:
This doesn't seem too far off from what would be involved with dynamic
typing at the ISA level, but with many of same sorts of drawbacks...

Say, for example, top 2 bits of a register:
00: Object Reference
Next 2 bits:
00: Pointer (with type-tag)
01: ?
1z: Bounded Array
01: Fixnum (route to ALU)
10: Flonum (route to FPU)
11: Other types
00: Smaller value types
Say: int/uint, short/ushort, ...
...

One issue:
Decoding based on register tags would mean needing to know the register
tag bits at the same time the instruction is being decoded. In this
case, one is likely to need two clock-cycles to fully decode the opcode.

More importantly, you added a cycle AFTER register READ/Forward before
you can start executing (more when OoO is in use).

And finally, the compiler KNOWS what the type is at compile time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB-Alt on Wed Apr 3 21:53:26 2024

BGB-Alt wrote:

FWIW:
This doesn't seem too far off from what would be involved with dynamic
typing at the ISA level, but with many of same sorts of drawbacks...

Say, for example, top 2 bits of a register:
00: Object Reference
Next 2 bits:
00: Pointer (with type-tag)
01: ?
1z: Bounded Array
01: Fixnum (route to ALU)
10: Flonum (route to FPU)
11: Other types
00: Smaller value types
Say: int/uint, short/ushort, ...
...

So, you either have 66-bit registers, or you have 62-bit FP numbers ?!?
This solves nobody's problems; not even LISP.

One issue:
Decoding based on register tags would mean needing to know the register
tag bits at the same time the instruction is being decoded. In this
case, one is likely to need two clock-cycles to fully decode the opcode.

Not good. But what if you don't know the tag until the register is delivered from a latent FU, do you stall DECODE, or do you launch and make the instruction
queue element have to deal with all outcomes.

ID1: Unpack instruction to figure out register fields, etc.
ID2: Fetch registers, specialize variable instructions based on tag bits.

For timing though, one ideally doesn't want to do anything with the
register values until the EX stages (since ID2 might already be tied up
with the comparably expensive register-forwarding logic), but asking for
3 cycles for decode is a bit much.

Otherwise, if one does not know which FU should handle the operation
until EX1, this has its own issues.

Real-friggen-ely

Or, possible, the FU's decide
whether to accept the operation:
ALU: Accepts operation if both are fixnum, FPU if both are Flonum.

What if IMUL is performed in FMAC, IDIV in FDIV,... Int<->FP routing is
based on calculation capability {Even CDC 6600 performed int × in the
FP × unit (not in Thornton's book, but via conversation with 6600 logic designer at Asilomar some time ago. All they had to do to get FP × to
perform int × was disable 1 gate.......)

But, a proper dynamic language allows mixing fixnum and flonum with the result being implicitly converted to flonum, but from the FPU's POV,
this would effectively require two chained FADD operations (one for the Fixnum to Flonum conversion, one for the FADD itself).

That is a LANGUAGE problem not an ISA problem. SNOBOL allowed one to add
a string to an integer and the string would be converted to int before.....

Many other cases could get hairy, but to have any real benefit, the CPU
would need to be able to deal with them. In cases where the compiler
deals with everything, the type-tags become mostly moot (or potentially detrimental).

You are arguing that the added complexity would somehow pay for itself.
I can't see it paying for itself.

But, then, there is another issue:
C code expects C type semantics to be respected, say:
Signed int overflow wraps at 32 bits (sign extending);

maybe

Unsigned int overflow wraps at 32 bits (zero extending);

maybe

Variables may not hold values out-of-range for that type;

LLVM does this GCC does not.

The 'long long' and 'unsigned long long' types are exactly 64-bit;

At least 64-bit not exactly.

...
...

If one has tagged 64-bit registers, then fixnum might not hold the
entire range of 'long long'. If one has 66 or 68 bit registers, then
memory storage is a problem.

Ya think ?

If one has untagged registers for cases where they are needed, one has
not saved any encoding space.

I give up--not worth trying to teach cosmologist why the color of the
lipstick going on the pig is not the problem.....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Apr 3 23:20:46 2024

mitchalsup@aol.com (MitchAlsup1) writes:

BGB-Alt wrote:

But, a proper dynamic language allows mixing fixnum and flonum with the
result being implicitly converted to flonum, but from the FPU's POV,
this would effectively require two chained FADD operations (one for the
Fixnum to Flonum conversion, one for the FADD itself).

That is a LANGUAGE problem not an ISA problem. SNOBOL allowed one to add
a string to an integer and the string would be converted to int before.....

The Burroughs B3500 would simply ignore the zone digit when adding
a string to an integer, based on the address controller for the
operand.

ADD 1225 010000(UN) 020000(UA) 030000(UN)

Would add the 12 unsigned numeric nibbles at address 10000
to the 25 numeric digits of the 8-bit EBCDIC/ASCII data at address 20000
and store the result as 25 numeric nibbles at address 30000.

ADD 0507 010000(UN) 020000(UN) 030000(UA)

Would add the 5 unsigned numeric nibbles at 10000 to
the 7 unsigned numeric nibbles at 20000 and store them
as 8-bit EBCDIC bytes at 30000 (inserting the zone digit @F@
before each numeric nibble). A processor mode toggle selected
whether the inserted zone digit should be @F@ (EBCDIC) or @3@ (ASCII).

Likewise for SUB, INC, DEC, MPY, DIV and data movement instructions.

The data movement instructions would left- or right-align the destination
field (MVN (move numeric) would right justify and MVA (move alphanumeric) would left justify) when the destination and source field lengths differ.

Floating point was BCD with an exponent sign digit, two exponent digits,
a mantissa sign digit and a variable length mantissa of up
to 100 digits in length. The integer instructions could be used
on either the mantissa or exponent individually, as they were
just fields in memory.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Thu Apr 4 10:32:48 2024

MitchAlsup1 wrote:

BGB-Alt wrote:

On 4/3/2024 11:43 AM, Stephen Fuld wrote:

There has been discussion here about the benefits of reducing the
number of op codes. One reason not mentioned before is if you have
fixed length instructions, you may want to leave as many codes as
possible available for future use. Of course, if you are doing a
16-bit instruction design, where instruction bits are especially
tight, you may save enough op-codes to save a bit, perhaps allowing a
larger register specifier field, or to allow more instructions in the
smaller subset.

It is in this spirit that I had an idea, partially inspired by Mill’s >>> use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, it

66000

has several features that are “friendly” to the idea. Second, I know >>> Mitch cares about keeping the number of op codes low.

Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate
discussions, and because it has been fun to think about.

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag.
If set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load
single floating and load double floating, which work the same as the
other 32- and 64-bit loads, but in addition to loading the value, set
the tag bit for the destination register. Non-floating-point loads
would clear the tag bit. As I show below, I don’t think you need any >>> special "store tag" instructions.

What do you do when you want a FP bit pattern interpreted as an integer,
or vice versa.

This is why, if you want to copy Mill, you have to do it properly:

Mill does NOT care about the type of data loaded into a particular belt
slot, only the size and if it is a scalar or a vector filling up the
full belt slot. In either case you will also have marker bits for
special types like None and NaR.

So scalar 8/16/32/64/128 and vector 8x16/16x8/32x4/64x2/128x1 (with the
last being the same as the scalar anyway).

Only load ops and explicit widening/narrowing ops sets the size tag
bits, from that point any op where it makes sense will do the right
thing for either a scalar or a short vector, so you can add 16+16 8-bit
vars with the same ADD encoding as you would use for a single 64-bit ADD.

We do NOT make any attempt to interpret the actual bit patterns sotred
within each belt slot, that is up to the instructions. This means that
there is no difference between loading a float or an int32_t, it also
means that it is perfectly legel (and supported) to use bit operations
on a FP variable. This can be very useful, not just to fake exact
arithmetic by splitting a double into two 26-bit mantissa parts:

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Thu Apr 4 16:47:44 2024

On Thu, 4 Apr 2024 10:32:48 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

We do NOT make any attempt

Terje

Does a present tense means that you are still involved in Mill project?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Thu Apr 4 21:13:21 2024

Michael S wrote:

On Thu, 4 Apr 2024 10:32:48 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

We do NOT make any attempt

Terje

Does a present tense means that you are still involved in Mill project?

I am much less active than I used to be, but I still get the weekly conf
call invites and respond to any interesting subject on our mailing list.

So, yes, I do consider myself to still be involved.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Thu Apr 4 22:25:30 2024

On Thu, 4 Apr 2024 21:13:21 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Thu, 4 Apr 2024 10:32:48 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

We do NOT make any attempt

Terje

Does a present tense means that you are still involved in Mill
project?

I am much less active than I used to be, but I still get the weekly
conf call invites and respond to any interesting subject on our
mailing list.

So, yes, I do consider myself to still be involved.

Terje

Thank you

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB-Alt on Fri Apr 5 01:48:33 2024

BGB-Alt wrote:

On 4/4/2024 3:32 AM, Terje Mathisen wrote:

MitchAlsup1 wrote:

As I can note, in my actual ISA, any type-tagging in the registers was explicit and opt-in, generally managed by the compiler/runtime/etc; in
this case, the ISA merely providing facilities to assist with this.

The main exception would likely have been the possible "Bounds Check
Enforce" mode, which would still need a bit of work to implement, and is
not likely to be terribly useful.

A while back (and maybe in the future) My 66000 had what I called the
Foreign Access Mode. When the HoB of the pointer was set, the first
entry in the translation table was a 4 doubleword structure, A Root
pointer, the Lowest addressable Byte, the Highest addressable Byte,
and a DW of access rights, permissions,... While sort-of like a capability
I don't think it was close enough to actually be a capability or used as
one.

So, it fell out of favor, and it was not clear how it fit into the HyperVisor/SuperVisor model, either.

Most complicated and expensive parts
are that it will require implicit register and memory tagging (to flag capabilities). Though, cheaper option is simply to not enable it, in
which case things either behave as before, with the new functionality essentially being NOP. Much of the work still needed on this would be
getting the 128-bit ABI working, and adding some new tweaks to the ABI
to play well with the capability addressing (effectively it requires
partly reworking how global variables are accessed).

The type-tagging scheme used in my case is very similar to that used in
my previous BGBScript VMs (where, as I can note, BGBCC was itself a fork
off of an early version of the BGBScript VM, and effectively using a lax hybrid typesystem masquerading as C). Though, it has long since moved to
a more proper C style typesystem, with dynamic types more as an optional extension.

In general, any time one needs to change the type you waste an instruction compared to type less registers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Thu Apr 4 21:13:13 2024

On some older CPUs, there might be one set of integer opcodes and one
set of floating-point opcodes, with a status register containing the
integer precision, and the floating-point precision, currently in use.

The idea was that this would be efficient because most programs only
use one size of each type of number, so the number of opcodes would be
the most appropriate, and that status register wouldn't need to be
reloaded too often.

It's considered dangerous, though, to have a mechanism for changing
what instructions mean, since this could let malware alter what
programs do in a useful and sneaky fashion. Memory bandwidth is no
longer a crippling constraint the way it was back in the days of core
memory and discrete transistors - at least not for program code, even
if memory bandwidth for _data_ often limits the processing speed of
computers.

This is basically because any program that does any real work, taking
any real length of time to do its job, is going to mostly consist of
loops that fit in cache. So letting program code be verbose if there
are other benefits obtained thereby is the current conventional
wisdom.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Fri Apr 5 21:34:16 2024

John Savard wrote:

On some older CPUs, there might be one set of integer opcodes and one
set of floating-point opcodes, with a status register containing the
integer precision, and the floating-point precision, currently in use.

The idea was that this would be efficient because most programs only
use one size of each type of number, so the number of opcodes would be
the most appropriate, and that status register wouldn't need to be
reloaded too often.

Most programs I write use bytes (mostly unsigned) a few halfwords (mostly signed) a useful count of integers (both signed and unsigned--mainly as
already defined arguments/returns), and a vast majority of doublewords (invariably unsigned).

Early in My 66000 LLVM development Brian looked at the cost of having
only 1 FP OpCode set--and it did not look good--so we went back to the
standard way of an OpCode for each FP size × calculation.

It's considered dangerous, though, to have a mechanism for changing
what instructions mean, since this could let malware alter what
programs do in a useful and sneaky fashion. Memory bandwidth is no
longer a crippling constraint the way it was back in the days of core
memory and discrete transistors - at least not for program code, even
if memory bandwidth for _data_ often limits the processing speed of computers.

This is basically because any program that does any real work, taking
any real length of time to do its job, is going to mostly consist of
loops that fit in cache. So letting program code be verbose if there
are other benefits obtained thereby is the current conventional
wisdom.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Sat Apr 6 21:30:47 2024

On Fri, 5 Apr 2024 21:34:16 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

Early in My 66000 LLVM development Brian looked at the cost of having
only 1 FP OpCode set--and it did not look good--so we went back to the >standard way of an OpCode for each FP size � calculation.

I do tend to agree.

However, a silly idea has now occurred to me.

256 bits can contain eight instructions that are 32 bits long.

Or they can also contain seven instructions that are 36 bits long,
with four bits left over.

So they could contain *nine* instructions that are 28 bits long, also
with four bits left over.

Thus, instead of having mode bits, one _could_ do the following:

Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.

But have one value for the first four bits in a block that indicates
36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed
in which don't fill a whole block.

While that's a theoretical possibility, I don't view it as being
worthwhile in practice.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Savard on Sun Apr 7 21:01:15 2024

John Savard <quadibloc@servername.invalid> schrieb:

Thus, instead of having mode bits, one _could_ do the following:

Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.

But have one value for the first four bits in a block that indicates
36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed
in which don't fill a whole block.

While that's a theoretical possibility, I don't view it as being
worthwhile in practice.

I played around a bit with another scheme: Encoding things into
128-bit blocks, with either 21-bit or 42-bit or longer instructions
(or a block header with six bits, and 20 or 40 bits for each
instruction).

Did that look promising? Not really; the 21 bits offered a lot
of useful opcode space for two-register operations and even for
a few of the often-used three-register, but 42 bits was really
a bit too long, so the advantage wasn't great. And embedding
32-bit or 64-bit instructions in the code stream does not really
fit the 21-bit raster well, so compared to an ISA which can do so
(like My 66000) it came out at a disadvantage. Might be possible
to beat RISC-V, though.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Sun Apr 7 20:41:45 2024

John Savard wrote:

On Fri, 5 Apr 2024 21:34:16 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

Early in My 66000 LLVM development Brian looked at the cost of having
only 1 FP OpCode set--and it did not look good--so we went back to the >>standard way of an OpCode for each FP size × calculation.

I do tend to agree.

However, a silly idea has now occurred to me.

256 bits can contain eight instructions that are 32 bits long.

Or they can also contain seven instructions that are 36 bits long,
with four bits left over.

So they could contain *nine* instructions that are 28 bits long, also
with four bits left over.

I agree with the arithmetic going into this statement. What I don't
have sufficient data concerning is "whether these extra formats pay
for themselves". For example, how many of the 36-bit encodings are
irredundant with the 32-bit ones, and so on with the 28-bit ones.

Take::

ADD R7,R7,#1

I suspect there is a 28-bit form, a 32-bit form, and a 36-bit form
for this semantic step, that you pay for multiple times in decoding
and possibly pipelining. {{There may also be other encodings for
this; such as:: INC R7}}

Thus, instead of having mode bits, one _could_ do the following:

Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.

How do you attach 32-bit or 64-bit constants to 28-bit instructions ??

How do you switch from 64-bit to Byte to 32-bit to 16-bit in one
set of 256-bit instruction decodes ??

But have one value for the first four bits in a block that indicates
36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed
in which don't fill a whole block.

In complicated if-then-else codes (and switches) I often see one inst-
ruction followed by a branch to a common point. Does your encoding deal
with these efficiently ?? That is:: what happens when you jump to the
middle of a block of 36-bit instructions ??

While that's a theoretical possibility, I don't view it as being
worthwhile in practice.

Agreed.............

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Sun Apr 7 21:22:50 2024

Thomas Koenig wrote:

John Savard <quadibloc@servername.invalid> schrieb:

Thus, instead of having mode bits, one _could_ do the following:

Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.

But have one value for the first four bits in a block that indicates
36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed
in which don't fill a whole block.

While that's a theoretical possibility, I don't view it as being
worthwhile in practice.

I played around a bit with another scheme: Encoding things into
128-bit blocks, with either 21-bit or 42-bit or longer instructions
(or a block header with six bits, and 20 or 40 bits for each
instruction).

Not having seen said encoding scheme:: I suspect you used the Rd=Rs1 destructive operand model for the 21-bit encodings. Yes :: no ??
Otherwise one has 3×5-bit registers = 15-bits leaving only 6-bits
for 64 OpCodes. Now if you have floats and doubles and signed and
unsigned, you get 16 of each and we have not looked at memory
references or branching.

Did that look promising? Not really; the 21 bits offered a lot
of useful opcode space for two-register operations and even for
a few of the often-used three-register, but 42 bits was really
a bit too long, so the advantage wasn't great. And embedding
32-bit or 64-bit instructions in the code stream does not really
fit the 21-bit raster well, so compared to an ISA which can do so
(like My 66000) it came out at a disadvantage. Might be possible
to beat RISC-V, though.

But beating RISC-V is easy, try getting you instruction count down
to VAX counts without losing the ability to pipeline and parallel
instruction execution.

At handwaving accuracy::
VAX has 1.0 instructions
My 66000 has 1.1 instructions
RISC-V has 1.5 instructions

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Mon Apr 8 06:21:43 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

John Savard <quadibloc@servername.invalid> schrieb:

Thus, instead of having mode bits, one _could_ do the following:

Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.

But have one value for the first four bits in a block that indicates
36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed
in which don't fill a whole block.

While that's a theoretical possibility, I don't view it as being
worthwhile in practice.

I played around a bit with another scheme: Encoding things into
128-bit blocks, with either 21-bit or 42-bit or longer instructions
(or a block header with six bits, and 20 or 40 bits for each
instruction).

Not having seen said encoding scheme:: I suspect you used the Rd=Rs1 destructive operand model for the 21-bit encodings. Yes :: no ??

It was not very well developed, I gave it up when I saw there wasn't
much to gain.

Otherwise one has 3×5-bit registers = 15-bits leaving only 6-bits
for 64 OpCodes.

There could have been a case for adding this (maybe just for
the a few frequent ones: "add r1,r2,r3", "add r1,r2,-r3", "add
r1,r2,#num" and "add r1,r2,#-num", but I did not pursue that
further.

I looked at load and store instructions with short offsets
(these would then have been scaled), and short branches. But
the 21-bit opcode space filled up really, really rapidly.

Also, it is easy to synthesize a 3-register operation from
a 2-register operation and a memory move. If the decoder is
set up for 42 bits anyway, instruction fusion also a possibility.
This got a bit weird.

Now if you have floats and doubles and signed and
unsigned, you get 16 of each and we have not looked at memory
references or branching.

For somebody who does Fortran, I find the frequency of floating
point instructions surprisingly low, even in Fortran code.

Did that look promising? Not really; the 21 bits offered a lot
of useful opcode space for two-register operations and even for
a few of the often-used three-register, but 42 bits was really
a bit too long, so the advantage wasn't great. And embedding
32-bit or 64-bit instructions in the code stream does not really
fit the 21-bit raster well, so compared to an ISA which can do so
(like My 66000) it came out at a disadvantage. Might be possible
to beat RISC-V, though.

But beating RISC-V is easy, try getting you instruction count down
to VAX counts without losing the ability to pipeline and parallel
instruction execution.

At handwaving accuracy::
VAX has 1.0 instructions
My 66000 has 1.1 instructions
RISC-V has 1.5 instructions

To reach VAX instruction density, one would have to have things
like memory operands (with the associated danger that compilers
will not put intermediate results in registers, but since they have
been optimized for x86 for decades, they are probably better now)
and load with update, which would then have to be cracked
into two micro-ops. Not sure about the benefit.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Mon Apr 8 07:16:08 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

But beating RISC-V is easy, try getting you instruction count down
to VAX counts without losing the ability to pipeline and parallel
instruction execution.

At handwaving accuracy::
VAX has 1.0 instructions
My 66000 has 1.1 instructions
RISC-V has 1.5 instructions

To reach VAX instruction density

Note that in recent times Mitch Alsup ist writing not about code
density (static code size or dynamically executed bytes), but about instrruction counts. It's unclear why instruction count would be a
primary metric, except that he thinks that he can score points for My
66000 with it. As VAX demonstrates, you can produce an instruction
set with low instruction counts that is bad at the metrics that really
count: cycles for executing the program (for a given CPU chip area in
a given manufacturing process), and, for very small systems, static
code size.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Mon Apr 8 07:05:35 2024

On Sun, 7 Apr 2024 20:41:45 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

How do you attach 32-bit or 64-bit constants to 28-bit instructions ??

Yes, that's a problem. Presumably, I would have to do without
immediates.

An option would be to reserve some 16-bit codes to indicate a block
consisting of one 28-bit instruction and seven 32-bit instructions,
but that means a third instruction set.

How do you switch from 64-bit to Byte to 32-bit to 16-bit in one
set of 256-bit instruction decodes ??

By using 36-bit instructions instead of 28-bit instructions.

In complicated if-then-else codes (and switches) I often see one inst- >ruction followed by a branch to a common point. Does your encoding deal
with these efficiently ?? That is:: what happens when you jump to the
middle of a block of 36-bit instructions ??

Well, when the computer fetches a 256-bit block of code, the first
four bits indicates whether it is composed of 36-bit instructions or
28-bit instructions. So the computer knows where the instructions are;
and thus a convention can be applied, such as addressing each 36-bit instruction by the addresses of the first seven 32-bit positions in
the block.

In the case of 28-bit instructions, the first eight correspond to the
32-bit positions, the ninth corresponds to the last 16 bits of the
block.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Savard on Mon Apr 8 17:25:38 2024

John Savard <quadibloc@servername.invalid> schrieb:

Well, when the computer fetches a 256-bit block of code, the first
four bits indicates whether it is composed of 36-bit instructions or
28-bit instructions.

Do you think that instructions which require a certain size (almost)
always happen to be situated together so they fit in a block?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Mon Apr 8 19:56:27 2024

John Savard wrote:

On Sun, 7 Apr 2024 20:41:45 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

In complicated if-then-else codes (and switches) I often see one inst- >>ruction followed by a branch to a common point. Does your encoding deal >>with these efficiently ?? That is:: what happens when you jump to the >>middle of a block of 36-bit instructions ??

Well, when the computer fetches a 256-bit block of code, the first
four bits indicates whether it is composed of 36-bit instructions or
28-bit instructions. So the computer knows where the instructions are;
and thus a convention can be applied, such as addressing each 36-bit instruction by the addresses of the first seven 32-bit positions in
the block.

So, instead of using the branch target address, one rounds it down to
a 256-bit boundary, reads 256-bits and looks at the first 4-bits to
determine the format, nd then uses the branch offset to pick a cont-
tainer which will become the first instruction executed.

Sounds more complicated than necessary.

In the case of 28-bit instructions, the first eight correspond to the
32-bit positions, the ninth corresponds to the last 16 bits of the
block.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to All on Tue Apr 9 18:24:55 2024

I wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

John Savard <quadibloc@servername.invalid> schrieb:

Thus, instead of having mode bits, one _could_ do the following:

Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.

But have one value for the first four bits in a block that indicates
36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed
in which don't fill a whole block.

While that's a theoretical possibility, I don't view it as being
worthwhile in practice.

I played around a bit with another scheme: Encoding things into
128-bit blocks, with either 21-bit or 42-bit or longer instructions
(or a block header with six bits, and 20 or 40 bits for each
instruction).

Not having seen said encoding scheme:: I suspect you used the Rd=Rs1
destructive operand model for the 21-bit encodings. Yes :: no ??

It was not very well developed, I gave it up when I saw there wasn't
much to gain.

Maybe one more thing: In order to justify the more complex encoding,
I was going for 64 registers, and that didn't work out too well
(missing bits).

Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Apr 9 21:05:44 2024

BGB wrote:

On 4/9/2024 1:24 PM, Thomas Koenig wrote:

I wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Maybe one more thing: In order to justify the more complex encoding,
I was going for 64 registers, and that didn't work out too well
(missing bits).

Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.

For 32-bit instructions at least, 64 GPRs can work out OK.

Though, the gain of 64 over 32 seems to be fairly small for most
"typical" code, mostly bringing a benefit if one is spending a lot of
CPU time in functions that have large numbers of local variables all
being used at the same time.

Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for performance.

Where, 16 GPRs isn't really enough (lots of register spills), and 128
GPRs is wasteful (would likely need lots of monster functions with 250+
local variables to make effective use of this, *, which probably isn't
going to happen).

16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part of
GPRs AND you have good access to constants.

*: Where, it appears it is most efficient (for non-leaf functions) if
the number of local variables is roughly twice that of the number of CPU registers. If more local variables than this, then spill/fill rate goes
up significantly, and if less, then the registers aren't utilized as effectively.

Well, except in "tiny leaf" functions, where the criteria is instead
that the number of local variables be less than the number of scratch registers. However, for many/most small leaf functions, the total number
of variables isn't all that large either.

The vast majority of leaf functions use less than 16 GPRs, given one has
a SP not part of GPRs {including arguments and return values}. Once one
starts placing things like memove(), memset(), sin(), cos(), exp(), log()
in the ISA, it goes up even more.

Where, function categories:
Tiny Leaf:
Everything fits in scratch registers, no stack frame, no calls.
Leaf:
No function calls (either explicit or implicit);
Will have a stack frame.
Non-Leaf:
May call functions, has a stack frame.

You are forgetting about FP, GOT, TLS, and whatever resources are required
to do try-throw-catch stuff as demanded by the source language.

There is a "static assign everything" case in my case, where all of the variables are statically assigned to registers (for the scope of the function). This case typically requires that everything fit into callee
save registers, so (like the "tiny leaf" category, requires that the
number of local variables is less than the available registers).

On a 32 register machine, if there are 14 available callee-save
registers, the limit is 14 variables. On a 64 register machine, this
limit might be 30 instead. This seems to have good coverage.

The apparent number of registers goes up when one does not waste a register
to hold a use-once constant.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB-Alt on Wed Apr 10 00:28:02 2024

BGB-Alt wrote:

On 4/9/2024 4:05 PM, MitchAlsup1 wrote:

BGB wrote:

Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code
density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
performance.

Where, 16 GPRs isn't really enough (lots of register spills), and 128
GPRs is wasteful (would likely need lots of monster functions with
250+ local variables to make effective use of this, *, which probably
isn't going to happen).

16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part
of GPRs AND you have good access to constants.

On the main ISA's I had tried to generate code for, 16 GPRs was kind of
a pain as it resulted in fairly high spill rates.

Though, it would probably be less bad if the compiler was able to use
all of the registers at the same time without stepping on itself (such
as dealing with register allocation involving scratch registers while
also not conflicting with the use of function arguments, ...).

My code generators had typically only used callee save registers for variables in basic blocks which ended in a function call (in my compiler design, both function calls and branches terminating the current basic-block).

On SH, the main way of getting constants (larger than 8 bits) was via PC-relative memory loads, which kinda sucked.

This is slightly less bad on x86-64, since one can use memory operands
with most instructions, and the CPU tends to deal fairly well with code
that has lots of spill-and-fill. This along with instructions having
access to 32-bit immediate values.

Yes, x86 and any architecture (IBM 360, S.E.L. , Interdata, ...) that have LD-Ops act as if they have 4-6 more registers than they really have. x86
with 16 GPRs acts like a RISC with 20-24 GPRs as does 360. Does not really
take the place of universal constants, but goes a long way.

The vast majority of leaf functions use less than 16 GPRs, given one has
a SP not part of GPRs {including arguments and return values}. Once one
starts placing things like memove(), memset(), sin(), cos(), exp(), log()
in the ISA, it goes up even more.

Yeah.

Things like memcpy/memmove/memset/etc, are function calls in cases when
not directly transformed into register load/store sequences.

My 66000 does not convert them into LD-ST sequences, MM is a single inst- ruction.

Did end up with an intermediate "memcpy slide", which can handle medium
size memcpy and memset style operations by branching into a slide.

MMs and MSs that do not cross page boundaries are ATOMIC. The entire system sees only the before or only the after state and nothing in between. This
means one can start (queue up) a SATA disk access without obtaining a lock
to the device--simply because one can fill in all the data of a command in
a single instruction which smells ATOMIC to all interested 3rd parties.

As noted, on a 32 GPR machine, most leaf functions can fit entirely in scratch registers.

Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without getting totally screwed.

On a 64 GPR machine, this percentage is slightly
higher (but, not significantly, since there are few leaf functions
remaining at this point).

If one had a 16 GPR machine with 6 usable scratch registers, it is a
little harder though (as typically these need to cover both any
variables used by the function, and any temporaries used, ...). There
are a whole lot more leaf functions that exceed a limit of 6 than of 14.

The data back in the R2000-3000 days indicated that 32 GPRs has a 15%+ advantage over a 16 GPRs; while 84 had only a 3% advantage.

But, say, a 32 GPR machine could still do well here.

Note that there are reasons why I don't claim 64 GPRs as a large
performance advantage:
On programs like Doom, the difference is small at best.

It mostly effects things like GLQuake in my case, mostly because TKRA-GL
has a lot of functions with a large numbers of local variables (some exceeding 100 local variables).

Partly though this is due to code that is highly inlined and unrolled
and uses lots of variables tending to perform better in my case (and
tightly looping code, with lots of small functions, not so much...).

Where, function categories:
   Tiny Leaf:
     Everything fits in scratch registers, no stack frame, no calls. >>>    Leaf:
     No function calls (either explicit or implicit);
     Will have a stack frame.
   Non-Leaf:
     May call functions, has a stack frame.

You are forgetting about FP, GOT, TLS, and whatever resources are required >> to do try-throw-catch stuff as demanded by the source language.

Yeah, possibly true.

In my case:
There is no frame pointer, as BGBCC doesn't use one;

Can't do PASCAL and other ALOGO derived languages with block structure.

All stack-frames are fixed size, VLA's and alloca use the heap;

longjump() is at a serious disadvantage here.
desctructors are sometimes hard to position on the stack.

GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
TLS, accessed via TBR.

Try/throw/catch:
Mostly N/A for leaf functions.

Any function that can "throw", is in effect no longer a leaf function. Implicitly, any function which uses "variant" or similar is also, no
longer a leaf function.

You do realize that there is a set of #define-s that can implement try-throw-catch without requiring any subroutines ?!?

Need for GBR save/restore effectively excludes a function from being tiny-leaf. This may happen, say, if a function accesses global variables
and may be called as a function pointer.

------------------------------------------------------

One "TODO" here would be to merge constants with the same "actual" value
into the same register. At present, they will be duplicated if the types
are sufficiently different (such as integer 0 vs NULL).

In practice, the upper 48-bits of a extern variable is completely shared whereas the lower 16-bits are unique.

For functions with dynamic assignment, immediate values are more likely
to be used. If the code-generator were clever, potentially it could
exclude assigning registers to constants which are only used by
instructions which can encode them directly as an immediate. Currently,
BGBCC is not that clever.

And then there are languages like PL/1 and FORTRAN where the compiler
has to figure out how big an intermediate array is, allocate it, perform
the math, and then deallocate it.

Or, say:
y=x+31; //31 only being used here, and fits easily in an Imm9.
Ideally, compiler could realize 31 does not need a register here.

Well, and another weakness is with temporaries that exist as function arguments:
If static assigned, the "target variable directly to argument register" optimization can't be used (it ends up needing to go into a callee-save register and then be MOV'ed into the argument register; otherwise the compiler breaks...).

Though, I guess possible could be that the compiler could try to
partition temporaries that are used exclusively as function arguments
into a different category from "normal" temporaries (or those whose
values may cross a basic-block boundary), and then avoid
statically-assigning them (and somehow not cause this to effectively
break the full-static-assignment scheme in the process).

Brian's compiler finds the largest argument list and the largest return
value list and merges them into a single area on the stack used only
for passing arguments and results across the call interface. And the
<static> SP points at this area.

Though, IIRC, I had also considered the possibility of a temporary
"virtual assignment", allowing the argument value to be temporarily
assigned to a function argument register, then going "poof" and
disappearing when the function is called. Hadn't yet thought of a good
way to add this logic to the register allocator though.

But, yeah, compiler stuff is really fiddly...

More orthogonality helps.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Apr 10 17:29:22 2024

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

On 4/9/2024 7:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

Also the blob of constants needed to be within 512 bytes of the load
instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of a
basic block and then branch over it).

In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.

It does occupy some icache space, however; have you boosted the icache
size to compensate?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Apr 10 17:12:47 2024

BGB wrote:

On 4/9/2024 7:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

Also the blob of constants needed to be within 512 bytes of the load instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of a
basic block and then branch over it).

In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.

Usually they were spilled between basic-blocks, with the basic-block
needing to branch to the following basic-block in these cases.

Also 8-bit branch displacements are kinda lame, ...

Why do that to yourself ??

And, if one wanted a 16-bit branch:
MOV.W (PC, 4), R0 //load a 16-bit branch displacement
BRA/F R0
.L0:
NOP // delay slot
.WORD $(Label - .L0)

Also kinda bad...

Can you say Yech !!

Things like memcpy/memmove/memset/etc, are function calls in cases
when not directly transformed into register load/store sequences.

My 66000 does not convert them into LD-ST sequences, MM is a single inst-
ruction.

I have no high-level memory move/copy/set instructions.
Only loads/stores...

You have the power to fix it.........

For small copies, can encode them inline, but past a certain size this becomes too bulky.

A copy loop makes more sense for bigger copies, but has a high overhead
for small to medium copy.

So, there is a size range where doing it inline would be too bulky, but
a loop caries an undesirable level of overhead.

All the more reason to put it (a highly useful unit of work) into an instruction.

Ended up doing these with "slides", which end up eating roughly several
kB of code space, but was more compact than using larger inline copies.

Say (IIRC):
128 bytes or less: Inline Ld/St sequence
129 bytes to 512B: Slide
Over 512B: Call "memcpy()" or similar.

Versus::
1-infinity: use MM instruction.

The slide generally has entry points in multiples of 32 bytes, and
operates in reverse order. So, if not a multiple of 32 bytes, the last
bytes need to be handled externally prior to branching into the slide.

Does this remain sequentially consistent ??

Though, this is only used for fixed-size copies (or "memcpy()" when
value is constant).

Say:

__memcpy64_512_ua:
MOV.Q (R5, 480), R20
MOV.Q (R5, 488), R21
MOV.Q (R5, 496), R22
MOV.Q (R5, 504), R23
MOV.Q R20, (R4, 480)
MOV.Q R21, (R4, 488)
MOV.Q R22, (R4, 496)
MOV.Q R23, (R4, 504)

__memcpy64_480_ua:
MOV.Q (R5, 448), R20
MOV.Q (R5, 456), R21
MOV.Q (R5, 464), R22
MOV.Q (R5, 472), R23
MOV.Q R20, (R4, 448)
MOV.Q R21, (R4, 456)
MOV.Q R22, (R4, 464)
MOV.Q R23, (R4, 472)

....

__memcpy64_32_ua:
MOV.Q (R5), R20
MOV.Q (R5, 8), R21
MOV.Q (R5, 16), R22
MOV.Q (R5, 24), R23
MOV.Q R20, (R4)
MOV.Q R21, (R4, 8)
MOV.Q R22, (R4, 16)
MOV.Q R23, (R4, 24)
RTS

Duff's device in any other name.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB-Alt on Wed Apr 10 21:19:20 2024

BGB-Alt wrote:

On 4/10/2024 12:12 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/9/2024 7:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

Also the blob of constants needed to be within 512 bytes of the load
instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of a
basic block and then branch over it).

In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.

Yeah.

This was why some of the first things I did when I started extending
SH-4 were:
Adding mechanisms to build constants inline;
Adding Load/Store ops with a displacement (albeit with encodings
borrowed from SH-2A);
Adding 3R and 3RI encodings (originally Imm8 for 3RI).

My suggestion is that:: "Now that you have screwed around for a while,
Why not take that experience and do a new ISA without any of those
mistakes in it" ??

Did have a mess when I later extended the ISA to 32 GPRs, as (like with
BJX2 Baseline+XGPR) only part of the ISA had access to R16..R31.

Usually they were spilled between basic-blocks, with the basic-block
needing to branch to the following basic-block in these cases.

Also 8-bit branch displacements are kinda lame, ...

Why do that to yourself ??

I didn't design SuperH, Hitachi did...

But you did not fix them en massé, and you complain about them
at least once a week. There comes a time when it takes less time
and less courage to do that big switch and clean up all that mess.

But, with BJX1, I had added Disp16 branches.

With BJX2, they were replaced with 20 bit branches. These have the merit
of being able to branch anywhere within a Doom or Quake sized binary.

And, if one wanted a 16-bit branch:
   MOV.W (PC, 4), R0 //load a 16-bit branch displacement
   BRA/F R0
   .L0:
   NOP    // delay slot
   .WORD $(Label - .L0)

Also kinda bad...

Can you say Yech !!

Yeah.
This sort of stuff created strong incentive for ISA redesign...

Maybe consider now as the appropriate time to strt.

Granted, it is possible had I instead started with RISC-V instead of
SuperH, it is probable BJX2 wouldn't exist.

Though, at the time, the original thinking was that SuperH having
smaller instructions meant it would have better code density than RV32I
or similar. Turns out not really, as the penalty of the 16 bit ops was needing almost twice as many on average.

My 66000 only requires 70% the instruction count of RISC-V,
Yours could too ................

Things like memcpy/memmove/memset/etc, are function calls in cases
when not directly transformed into register load/store sequences.

My 66000 does not convert them into LD-ST sequences, MM is a single
inst-
ruction.

I have no high-level memory move/copy/set instructions.
Only loads/stores...

You have the power to fix it.........

But, at what cost...

You would not have to spend hours a week defending the indefensible !!

I had generally avoided anything that will have required microcode or
shoving state-machines into the pipeline or similar.

Things as simple as IDIV and FDIV require sequencers.
But LDM, STM, MM require sequencers simpler than IDIV and FDIV !!

Things like Load/Store-Multiple or

If you like polluted ICaches..............

For small copies, can encode them inline, but past a certain size this
becomes too bulky.

A copy loop makes more sense for bigger copies, but has a high
overhead for small to medium copy.

So, there is a size range where doing it inline would be too bulky,
but a loop caries an undesirable level of overhead.

All the more reason to put it (a highly useful unit of work) into an
instruction.

This is an area where "slides" work well, the main cost is mostly the
bulk that the slide adds to the binary (albeit, it is one-off).

Consider that the predictor getting into the slide the first time
always mispredicts !!

Which is why it is a 512B memcpy slide vs, say, a 4kB memcpy slide...

What if you only wanted to copy 63 bytes ?? Your DW slide fails miserably,
yet a HW sequencer only has to avoid asserting a single byte write enable
once.

For looping memcpy, it makes sense to copy 64 or 128 bytes per loop
iteration or so to try to limit looping overhead.

On low end machines, you want to operate at cache port width,
On high end machines, you want to operate at cache line widths per port.
This is essentially impossible using slides.....here, the same code is
not optimal across a line of implementations.

Though, leveraging the memcpy slide for the interior part of the copy
could be possible in theory as well.

What do you do when the STAT drive wants to write a whole page ??

For LZ memcpy, it is typically smaller, as LZ copies tend to be a lot
shorter (a big part of LZ decoder performance mostly being in
fine-tuning the logic for the match copies).

Though, this is part of why my runtime library had added a
"_memlzcpy(dst, src, len)" and "_memlzcpyf(dst, src, len)" functions,
which can consolidate this rather than needing to do it one-off for each
LZ decoder (as I see it, it is a similar issue to not wanting code to endlessly re-roll stuff for functions like memcpy or malloc/free, *).

*: Though, nevermind that the standard C interface for malloc is
annoyingly minimal, and ends up requiring most non-trivial programs to
roll their own memory management.

Ended up doing these with "slides", which end up eating roughly
several kB of code space, but was more compact than using larger
inline copies.

Say (IIRC):
   128 bytes or less: Inline Ld/St sequence
   129 bytes to 512B: Slide
   Over 512B: Call "memcpy()" or similar.

Versus::
    1-infinity: use MM instruction.

Yeah, but it makes the CPU logic more expensive.

By what, 37-gates ??

The slide generally has entry points in multiples of 32 bytes, and
operates in reverse order. So, if not a multiple of 32 bytes, the last
bytes need to be handled externally prior to branching into the slide.

Does this remain sequentially consistent ??

Within a thread, it is fine.

What if a SATA drive is reading while you are writing !!
That is, DMA is no different than multi-threaded applications--except
DMA cannot perform locks.

Main wonk is that it does start copying from the high address first. Presumably interrupts or similar wont be messing with application memory
mid memcpy.

The only things wanting high-low access patterns are dumping stuff to the stack. The fact you CAN get away with it most of the time is no excuse.

The looping memcpy's generally work from low to high addresses though.

As does all string processing.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Apr 10 23:30:02 2024

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.

It does occupy some icache space, however; have you boosted the icache
size to compensate?

The space occupied in the ICache is freed up from being in the DCache
so the overall hit rate goes up !! At typical sizes, ICache miss rate
is about ¼ the miss rate of DCache.

Besides:: if you had to LD the constant from memory, you use a LD instruction and 1 or 2 words in DCache, while consuming a GPR. So, overall, it takes
fewer cycles, fewer GPRs, and fewer instructions.

Alternatively:: if you paste constants together (LUI, AUPIC) you have no
direct route to either 64-bit constants or 64-bit address spaces.

It looks to be a win-win !!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Thu Apr 11 14:13:24 2024

On Wed, 10 Apr 2024 23:30:02 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

In My 66000 case, the constant is the word following the
instruction. Easy to find, easy to access, no register pollution,
no DCache pollution.

It does occupy some icache space, however; have you boosted the
icache size to compensate?

The space occupied in the ICache is freed up from being in the DCache
so the overall hit rate goes up !! At typical sizes, ICache miss rate
is about ¼ the miss rate of DCache.

Besides:: if you had to LD the constant from memory, you use a LD
instruction and 1 or 2 words in DCache, while consuming a GPR. So,
overall, it takes fewer cycles, fewer GPRs, and fewer instructions.

Alternatively:: if you paste constants together (LUI, AUPIC) you have
no direct route to either 64-bit constants or 64-bit address spaces.

It looks to be a win-win !!

Win-win under constraints of Load-Store Arch. Otherwise, it depends.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Scott Lurndal on Thu Apr 11 12:22:47 2024

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

On 4/9/2024 7:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

Also the blob of constants needed to be within 512 bytes of the load
instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of a
basic block and then branch over it).

In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.

It does occupy some icache space, however; have you boosted the icache
size to compensate?

Except it pretty rarely do so (increase icache pressure):

mov temp_reg, offset const_table
mov reg,qword ptr [temp_reg+const_offset]

looks to me like at least 5 bytes for the first instruction and probably
6 for the second, for a total of 11 (could be as low as 8 for a very
small offset), all on top of the 8 bytes of dcache needed to hold the
64-bit value loaded.

In My 66000 this should be a single 32-bit instruction followed by the
8-byte const, so 12 bytes total and no lookaside dcache inference.

It is only when you do a lot of 64-bit data loads, all gathered in a
single 256-byte buffer holding up to 32 such values, and you can afford
to allocate a fixed register pointing to the middle of that range, that
you actually gain some total space: Each load can now just do a

mov reg,qword ptr [fixed_base_reg+byte_offset]

which, due to the need for a 64-bit prefix, will probably need 4
instruction bytes on top of the 8 bytes from dcache. At this point we
are touching exactly the same number of bytes (12) as My 66000, but from
two different caches, so much more likley to suffer dcache misses.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Paul A. Clayton on Thu Apr 11 14:30:27 2024

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

On 4/9/24 8:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

[snip]

Things like memcpy/memmove/memset/etc, are function calls in
cases when not directly transformed into register load/store
sequences.

My 66000 does not convert them into LD-ST sequences, MM is a
single instruction.

I wonder if it would be useful to have an immediate count form of
memory move. Copying fixed-size structures would be able to use an
immediate. Aside from not having to load an immediate for such
cases, there might be microarchitectural benefits to using a
constant. Since fixed-sized copies would likely be limited to
smaller regions (with the possible exception of 8 MiB page copies)
and the overhead of loading a constant for large sizes would be
tiny, only providing a 16-bit immediate form might be reasonable.

It seems to me that an offloaded DMA engine would be a far
better way to do memmove (over some threshhold, perhaps a
cache line) without trashing the caches. Likewise memset.

Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into
a slide.

MMs and MSs that do not cross page boundaries are ATOMIC. The
entire system
sees only the before or only the after state and nothing in
between.

One might wonder how that atomicity is guaranteed in a
SMP processor...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Thu Apr 11 18:46:54 2024

BGB wrote:

On 4/11/2024 6:13 AM, Michael S wrote:

On Wed, 10 Apr 2024 23:30:02 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

It does occupy some icache space, however; have you boosted the
icache size to compensate?

The space occupied in the ICache is freed up from being in the DCache
so the overall hit rate goes up !! At typical sizes, ICache miss rate
is about ¼ the miss rate of DCache.

Besides:: if you had to LD the constant from memory, you use a LD
instruction and 1 or 2 words in DCache, while consuming a GPR. So,
overall, it takes fewer cycles, fewer GPRs, and fewer instructions.

Alternatively:: if you paste constants together (LUI, AUPIC) you have
no direct route to either 64-bit constants or 64-bit address spaces.

It looks to be a win-win !!

Win-win under constraints of Load-Store Arch. Otherwise, it depends.

Never seen a LD-OP architecture where the inbound memory can be in the
Rs1 position of the instruction.

FWIW:
The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
and needs less encoding space than the LUI route.

MOV Imm16. Rn
SHORI Imm16, Rn
SHORI Imm16, Rn
SHORI Imm16, Rn

Granted, if each is a 1-cycle instruction, this still takes 4 clock cycles.

As compared to::

CALK Rd,Rs1,#imm64

Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
of the constant is free !! (0 cycles) !! {{The above example uses at least
5 cycles to use the loaded/built constant.}}

An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
1-cycle, is preferable....

A consuming instruction where you don't even use a register is better
still !!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB-Alt on Thu Apr 11 23:06:05 2024

BGB-Alt wrote:

On 4/11/2024 1:46 PM, MitchAlsup1 wrote:

BGB wrote:

Win-win under constraints of Load-Store Arch. Otherwise, it depends.

Never seen a LD-OP architecture where the inbound memory can be in the
Rs1 position of the instruction.

FWIW:
The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
and needs less encoding space than the LUI route.

   MOV Imm16. Rn
   SHORI Imm16, Rn
   SHORI Imm16, Rn
   SHORI Imm16, Rn

Granted, if each is a 1-cycle instruction, this still takes 4 clock
cycles.

As compared to::

    CALK   Rd,Rs1,#imm64

Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
of the constant is free !! (0 cycles) !! {{The above example uses at least >> 5 cycles to use the loaded/built constant.}}

The main reason one might want SHORI is that it can fit into a
fixed-length 32-bit encoding.

While 32-bit encoding is RISC mantra, it has NOT been shown to be best
just simplest. Then, once you start widening the microarchitecture, it
is better to fetch wider than decode-issue so that you suffer least
from boundary conditions. Once you start fetching wide OR have wide decode-issue, you have ALL the infrastructure to do variable length instructions. Thus, complaining that VLE is hard has already been
eradicated.

Also technically could be retrofitted onto RISC-V without any significant change, unlike some other options (as
noted, I don't argue for adding Jumbo prefixes to RV under the basis
that there is no real viable way to add them to RV, *).

The issue is that once you do VLE RISC-Vs ISA is no longer helping you
get the job done, especially when you have to execute 40% more instructions

Sadly, the closest option to viable for RV would be to add the SHORI instruction and optionally pattern match it in the fetch/decode.

Or, say:
LUI Xn, Imm20
ADD Xn, Xn, Imm12
SHORI Xn, Imm16
SHORI Xn, Imm16

Then, combine LUI+ADD into a 32-bit load in the decoder (though probably
only if the Imm12 is positive), and 2x SHORI into a combined "Xn=(Xn<<32)|Imm32" operation.

This could potentially get it down to 2 clock cycles.

Universal constants gets this down to 0 cycles......

*: To add a jumbo prefix, one needs an encoding that:
Uses up a really big chunk of encoding space;
Is otherwise illegal and unused.
RISC-V doesn't have anything here.

Which is WHY you should not jump ship from SH to RV, but jump to an
ISA without these problems.

Ironically, in XG2 mode, I still have 28x 24-bit chunks of encoding
space that aren't yet used for anything, but aren't usable as normal
encoding space mostly because if I put instructions in there (with the existing encoding schemes), I couldn't use all the registers (and they
would not have predication or similar either). Annoyingly, the only
types of encodings that would fit in there at present are 2RI Imm16 ops
or similar (or maybe 3R 128-bit SIMD ops, where these ops only use
encodings for R0..R31 anyways, interpreting the LSB of the register
field as encoding R32..R63).

Just another reason not to stay with what you have developed.

In comparison, I reserve 6-major OpCodes so that a control transfer into
data is highly likely to get Undefined OpCode exceptions rather than a
try to execute what is in that data. Then, as it is, I still have 21-slots
in the major OpCode group free (27 if you count the permanently reserved).

Much of this comes from side effects of Universal Constants.

An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
1-cycle, is preferable....

A consuming instruction where you don't even use a register is better
still !!

Can be done, but thus far 33-bit immediate values. Luckily, Imm33s seems
to addresses around 99% of uses (for normal ALU ops and similar).

What do you do when accessing data that the linker knows is more than 4GB
away from IP ?? or known to be outside of 0-4GB ?? externs, GOT, PLT, ...

Had considered allowing an Imm57s case for SIMD immediates (4x S.E5.F8
or 2x S.E8.F19), which would have indirectly allowed the Imm57s case. By themselves though, the difference doesn't seem enough to justify the cost.

While I admit that <basically> anything bigger than 50-bits will be fine
as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.

Don't have enough bits in the encoding scheme to pull off a 3RI Imm64 in
12 bytes (and allowing a 16-byte encoding would have too steep of a cost increase to be worthwhile).

And yet I did.

So, alas...

Yes, alas..........

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB-Alt on Thu Apr 11 23:22:16 2024

BGB-Alt wrote:

On 4/11/2024 9:30 AM, Scott Lurndal wrote:

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

One thing that is still needed is a good, fast, and semi-accurate way to
pull off the Z=1.0/Z' calculation, as needed for perspective-correct rasterization (affine requires subdivision, which adds cost to the
front-end, and interpolating Z directly adds significant distortion for geometry near the near plane).

I saw a 10-cycle latency 1-cycle throughput divider at Samsumg::
10 stages of 3-bit at a time SRT divider with some exponent stuff
on the side. 1.0/z is a lot simpler than that (float only). A lot
of these great big complicated calculations can be beaten into
submission with a clever attack of brute force HW.....FMUL and FMAC
being the most often cited cases.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Apr 11 23:12:25 2024

Scott Lurndal wrote:

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

On 4/9/24 8:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

[snip]

Things like memcpy/memmove/memset/etc, are function calls in
cases when not directly transformed into register load/store
sequences.

My 66000 does not convert them into LD-ST sequences, MM is a
single instruction.

I wonder if it would be useful to have an immediate count form of
memory move. Copying fixed-size structures would be able to use an >>immediate. Aside from not having to load an immediate for such
cases, there might be microarchitectural benefits to using a
constant. Since fixed-sized copies would likely be limited to
smaller regions (with the possible exception of 8 MiB page copies)
and the overhead of loading a constant for large sizes would be
tiny, only providing a 16-bit immediate form might be reasonable.

It seems to me that an offloaded DMA engine would be a far
better way to do memmove (over some threshhold, perhaps a
cache line) without trashing the caches. Likewise memset.

Effectively, that is what HW does, even on the lower end machines,
the AGEN unit of the Cache access pipeline is repeatedly cycled,
and data is read and/or written. One can execute instructions not
needing memory references while LDM, STM, ENTER, EXIT, MM, and MS
are in progress.

Moving this sequencer farther out would still require it to consume
all L1 BW in any event (snooping) for memory consistency reasons.
{Note: cache accesses are performed line-wide not register width wide}

Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into
a slide.

MMs and MSs that do not cross page boundaries are ATOMIC. The
entire system
sees only the before or only the after state and nothing in
between.

One might wonder how that atomicity is guaranteed in a
SMP processor...

The entire chunk of data traverses the interconnect as a single
transaction. All interested 3rd parties (not originator nor
recipient) see either the memory state before the transfer or
after the transfer.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Fri Apr 12 02:19:04 2024

On Thu, 11 Apr 2024 18:46:54 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On 4/11/2024 6:13 AM, Michael S wrote:

On Wed, 10 Apr 2024 23:30:02 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

It does occupy some icache space, however; have you boosted the
icache size to compensate?

The space occupied in the ICache is freed up from being in the
DCache so the overall hit rate goes up !! At typical sizes,
ICache miss rate is about � the miss rate of DCache.

Besides:: if you had to LD the constant from memory, you use a LD
instruction and 1 or 2 words in DCache, while consuming a GPR. So,
overall, it takes fewer cycles, fewer GPRs, and fewer
instructions.

Alternatively:: if you paste constants together (LUI, AUPIC) you
have no direct route to either 64-bit constants or 64-bit address
spaces.

It looks to be a win-win !!

Win-win under constraints of Load-Store Arch. Otherwise, it
depends.

Never seen a LD-OP architecture where the inbound memory can be in
the Rs1 position of the instruction.

May be. But out of 6 major integer OPs it matters only for SUB.
By now I don't remember for sure, but I think that I had seen LD-OP architecture that had SUBR instruction. May be, TI TMS320C30?
It was 30 years ago and my memory is not what it used to be.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Apr 12 01:40:27 2024

BGB wrote:

On 4/11/2024 6:06 PM, MitchAlsup1 wrote:

While I admit that <basically> anything bigger than 50-bits will be fine
as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.

The number of cases where this comes up is not statistically significant enough to have a meaningful impact on performance.

Fraction of a percent edge-cases are not deal-breakers, as I see it.

Idle speculation::

.globl r8_erf ; -- Begin function r8_erf
.type r8_erf,@function
r8_erf: ; @r8_erf
; %bb.0:
add sp,sp,#-128
std #4614300636657501161,[sp,88] // a[0]
std #4645348406721991307,[sp,104] // a[2]
std #4659275911028085274,[sp,112] // a[3]
std #4595861367557309218,[sp,120] // a[4]
std #4599171895595656694,[sp,40] // p[0]
std #4593699784569291823,[sp,56] // p[2]
std #4580293056851789237,[sp,64] // p[3]
std #4559215111867327292,[sp,72] // p[4]
std #4580359811580069319,[sp,80] // p[4]
std #4612966212090462427,[sp] // q[0]
std #4602930165995154489,[sp,16] // q[2]
std #4588882433176075751,[sp,24] // q[3]
std #4567531038595922641,[sp,32] // q[4]
fabs r2,r1
fcmp r3,r2,#0x3EF00000 // thresh
bnlt r3,.LBB141_6
; %bb.1:
fcmp r3,r2,#4 // xabs <= 4.0
bnlt r3,.LBB141_7
; %bb.2:
fcmp r3,r2,#0x403A8B020C49BA5E // xbig
bngt r3,.LBB141_11
; %bb.3:
fmul r3,r1,r1
fdiv r3,#1,r3
mov r4,#0x3F90B4FB18B485C7 // p[5]
fmac r4,r3,r4,#0x3FD38A78B9F065F6 // p[0]
fadd r5,r3,#0x40048C54508800DB // q[0]
fmac r6,r3,r4,#0x3FD70FE40E2425B8 // p[1]
fmac r4,r3,r5,#0x3FFDF79D6855F0AD // q[1]
fmul r4,r3,r4
fmul r6,r3,r6
mov r5,#2
add r7,sp,#40 // p[*]
add r8,sp,#0 // q[*]
LBB141_4: ; %._crit_edge11
; =>This Inner Loop Header: Depth=1
vec r9,{r4,r6}
ldd r10,[r7,r5<<3,0] // p[*]
ldd r11,[r8,r5<<3,0] // q[*]
fadd r6,r6,r10
fadd r4,r4,r11
fmul r4,r3,r4
fmul r6,r3,r6
loop ne,r5,#4,#1
; %bb.5:
fadd r5,r6,#0x3F4595FD0D71E33C // p[4]
fmul r3,r3,r5
fadd r4,r4,#0x3F632147A014BAD1 // q[4]
fdiv r3,r3,r4
fadd r3,#0x3FE20DD750429B6D,-r3 // c[0]
fdiv r3,r3,r2
br .LBB141_10 // common tail
LBB141_6: ; %._crit_edge
fmul r3,r1,r1
fcmp r2,r2,#0x3C9FFE5AB7E8AD5E // xsmall
sra r2,r2,<1:13>
cvtsd r4,#0
mux r2,r2,r3,r4
mov r3,#0x3FC7C7905A31C322 // a[4]
fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
ldd r4,[sp,104] // a[2]
fmac r3,r2,r3,r4
fadd r4,r2,#0x403799EE342FB2DE // b[0]
fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
fmul r1,r3,r1
fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
fdiv r2,r1,r2
mov r1,r2
add sp,sp,#128
ret // 68
LBB141_7:
fmul r3,r2,#0x3E571E703C5F5815 // c[8]
mov r5,#0
mov r4,r2
LBB141_8: ; =>This Inner Loop Header: Depth=1
vec r6,{r3,r4}
ldd r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
fadd r3,r3,r7
fmul r3,r2,r3
ldd r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
fadd r4,r4,r7
fmul r4,r2,r4
loop ne,r5,#7,#1
; %bb.9:
fadd r3,r3,#0x4093395B7FD2FC8E // c[7]
fadd r4,r4,#0x4093395B7FD35F61 // d[7]
fdiv r3,r3,r4
LBB141_10: // common tail
fmul r4,r2,#0x41800000 // 16.0
fmul r4,r4,#0x3D800000 // 1/16.0
cvtds r4,r4 // (signed)double
cvtsd r4,r4 // (double)signed
fadd r5,r2,-r4
fadd r2,r2,r4
fmul r4,r4,-r4
fexp r4,r4 // exp()
fmul r2,r2,-r5
fexp r2,r2 // exp()
fmul r2,r4,r2
fadd r2,#0,-r2
fmac r2,r2,r3,#0x3F000000 // 0.5
fadd r2,r2,#0x3F000000 // 0.5
pflt r1,0,T
fadd r2,#0,-r2
mov r1,r2
add sp,sp,#128
ret
LBB141_11:
fcmp r1,r1,#0
sra r1,r1,<1:13>
cvtsd r2,#-1 // (double)-1
cvtsd r3,#1 // (double)+1
mux r2,r1,r3,r2
mov r1,r2
add sp,sp,#128
ret
Lfunc_end141:
.size r8_erf, .Lfunc_end141-r8_erf
; -- End function

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Fri Apr 12 13:40:01 2024

Michael S <already5chosen@yahoo.com> writes:

On Thu, 11 Apr 2024 18:46:54 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

It looks to be a win-win !! =20

=20
Win-win under constraints of Load-Store Arch. Otherwise, it
depends. =20

=20
Never seen a LD-OP architecture where the inbound memory can be in
the Rs1 position of the instruction.
=20

May be. But out of 6 major integer OPs it matters only for SUB.
By now I don't remember for sure, but I think that I had seen LD-OP >architecture that had SUBR instruction. May be, TI TMS320C30?

ARM has LDADD - negate one argument and it becomes a subtract.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Fri Apr 12 18:08:33 2024

On Fri, 12 Apr 2024 13:40:01 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Thu, 11 Apr 2024 18:46:54 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

It looks to be a win-win !! =20

=20
Win-win under constraints of Load-Store Arch. Otherwise, it
depends. =20

=20
Never seen a LD-OP architecture where the inbound memory can be in
the Rs1 position of the instruction.
=20

May be. But out of 6 major integer OPs it matters only for SUB.
By now I don't remember for sure, but I think that I had seen LD-OP >architecture that had SUBR instruction. May be, TI TMS320C30?

ARM has LDADD - negate one argument and it becomes a subtract.

ARM LDADD is not a LD-OP instruction. It is RMW.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Apr 12 23:46:33 2024

BGB wrote:

On 4/11/2024 8:40 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/11/2024 6:06 PM, MitchAlsup1 wrote:

While I admit that <basically> anything bigger than 50-bits will be fine >>>> as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.

The number of cases where this comes up is not statistically
significant enough to have a meaningful impact on performance.

Fraction of a percent edge-cases are not deal-breakers, as I see it.

Idle speculation::

    .globl    r8_erf                          ; -- Begin function r8_erf
    .type    r8_erf,@function
r8_erf:                                 ; @r8_erf
; %bb.0:
    add    sp,sp,#-128
    std    #4614300636657501161,[sp,88]    // a[0]
    std    #4645348406721991307,[sp,104]    // a[2]
    std    #4659275911028085274,[sp,112]    // a[3]
    std    #4595861367557309218,[sp,120]    // a[4]
    std    #4599171895595656694,[sp,40]    // p[0]
    std    #4593699784569291823,[sp,56]    // p[2]
    std    #4580293056851789237,[sp,64]    // p[3]
    std    #4559215111867327292,[sp,72]    // p[4]
    std    #4580359811580069319,[sp,80]    // p[4]
    std    #4612966212090462427,[sp]    // q[0]
    std    #4602930165995154489,[sp,16]    // q[2]
    std    #4588882433176075751,[sp,24]    // q[3]
    std    #4567531038595922641,[sp,32]    // q[4]
    fabs    r2,r1
    fcmp    r3,r2,#0x3EF00000        // thresh
    bnlt    r3,.LBB141_6
; %bb.1:
    fcmp    r3,r2,#4            // xabs <= 4.0
    bnlt    r3,.LBB141_7
; %bb.2:
    fcmp    r3,r2,#0x403A8B020C49BA5E    // xbig
    bngt    r3,.LBB141_11
; %bb.3:
    fmul    r3,r1,r1
    fdiv    r3,#1,r3
    mov    r4,#0x3F90B4FB18B485C7        // p[5]
    fmac    r4,r3,r4,#0x3FD38A78B9F065F6    // p[0]
    fadd    r5,r3,#0x40048C54508800DB    // q[0]
    fmac    r6,r3,r4,#0x3FD70FE40E2425B8    // p[1]
    fmac    r4,r3,r5,#0x3FFDF79D6855F0AD    // q[1]
    fmul    r4,r3,r4
    fmul    r6,r3,r6
    mov    r5,#2
    add    r7,sp,#40            // p[*]
    add    r8,sp,#0            // q[*]
LBB141_4:                              ; %._crit_edge11
                                       ; =>This Inner Loop Header: Depth=1
    vec    r9,{r4,r6}
    ldd    r10,[r7,r5<<3,0]        // p[*]
    ldd    r11,[r8,r5<<3,0]        // q[*]
    fadd    r6,r6,r10
    fadd    r4,r4,r11
    fmul    r4,r3,r4
    fmul    r6,r3,r6
    loop    ne,r5,#4,#1
; %bb.5:
    fadd    r5,r6,#0x3F4595FD0D71E33C    // p[4]
    fmul    r3,r3,r5
    fadd    r4,r4,#0x3F632147A014BAD1    // q[4]
    fdiv    r3,r3,r4
    fadd    r3,#0x3FE20DD750429B6D,-r3    // c[0]
    fdiv    r3,r3,r2
    br    .LBB141_10            // common tail
LBB141_6:                              ; %._crit_edge
    fmul    r3,r1,r1
    fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E    // xsmall
    sra    r2,r2,<1:13>
    cvtsd    r4,#0
    mux    r2,r2,r3,r4
    mov    r3,#0x3FC7C7905A31C322        // a[4]
    fmac    r3,r2,r3,#0x400949FB3ED443E9    // a[0]
    fmac    r3,r2,r3,#0x405C774E4D365DA3    // a[1]
    ldd    r4,[sp,104]            // a[2]
    fmac    r3,r2,r3,r4
    fadd    r4,r2,#0x403799EE342FB2DE    // b[0]
    fmac    r4,r2,r4,#0x406E80C9D57E55B8    // b[1]
    fmac    r4,r2,r4,#0x40940A77529CADC8    // b[2]
    fmac    r3,r2,r3,#0x40A912C1535D121A    // a[3]
    fmul    r1,r3,r1
    fmac    r2,r2,r4,#0x40A63879423B87AD    // b[3]
    fdiv    r2,r1,r2
    mov    r1,r2
    add    sp,sp,#128
    ret                // 68
LBB141_7:
    fmul    r3,r2,#0x3E571E703C5F5815    // c[8]
    mov    r5,#0
    mov    r4,r2
LBB141_8:                              ; =>This Inner Loop Header: Depth=1
    vec    r6,{r3,r4}
    ldd    r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
    fadd    r3,r3,r7
    fmul    r3,r2,r3
    ldd    r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
    fadd    r4,r4,r7
    fmul    r4,r2,r4
    loop    ne,r5,#7,#1
; %bb.9:
    fadd    r3,r3,#0x4093395B7FD2FC8E    // c[7]
    fadd    r4,r4,#0x4093395B7FD35F61    // d[7]
    fdiv    r3,r3,r4
LBB141_10:                // common tail
    fmul    r4,r2,#0x41800000        // 16.0
    fmul    r4,r4,#0x3D800000        // 1/16.0
    cvtds    r4,r4                // (signed)double >>     cvtsd    r4,r4                // (double)signed >>     fadd    r5,r2,-r4
    fadd    r2,r2,r4
    fmul    r4,r4,-r4
    fexp    r4,r4                // exp()
    fmul    r2,r2,-r5
    fexp    r2,r2                // exp()
    fmul    r2,r4,r2
    fadd    r2,#0,-r2
    fmac    r2,r2,r3,#0x3F000000        // 0.5
    fadd    r2,r2,#0x3F000000        // 0.5
    pflt    r1,0,T
    fadd    r2,#0,-r2
    mov    r1,r2
    add    sp,sp,#128
    ret
LBB141_11:
    fcmp    r1,r1,#0
    sra    r1,r1,<1:13>
    cvtsd    r2,#-1                // (double)-1
    cvtsd    r3,#1                // (double)+1
    mux    r2,r1,r3,r2
    mov    r1,r2
    add    sp,sp,#128
    ret
Lfunc_end141:
    .size    r8_erf, .Lfunc_end141-r8_erf
                                       ; -- End function

These patterns seem rather unusual...
Don't really know the ABI.

Patterns don't really fit observations for typical compiler output
though (mostly in the FP constants, and particular ones that fall
outside the scope of what can be exactly represented as Binary16 or
similar, are rare).

.globl r8_erf ; -- Begin function r8_erf
.type r8_erf,@function
r8_erf: ; @r8_erf
; %bb.0:
add sp,sp,#-128

ADD -128, SP

std #4614300636657501161,[sp,88] // a[0]

MOV 0x400949FB3ED443E9, R3
MOV.Q R3, (SP, 88)

std #4645348406721991307,[sp,104] // a[2]

MOV 0x407797C38897528B, R3
MOV.Q R3, (SP, 104)

std #4659275911028085274,[sp,112] // a[3]
std #4595861367557309218,[sp,120] // a[4]
std #4599171895595656694,[sp,40] // p[0]
std #4593699784569291823,[sp,56] // p[2]
std #4580293056851789237,[sp,64] // p[3]
std #4559215111867327292,[sp,72] // p[4]
std #4580359811580069319,[sp,80] // p[4]
std #4612966212090462427,[sp] // q[0]
std #4602930165995154489,[sp,16] // q[2]
std #4588882433176075751,[sp,24] // q[3]
std #4567531038595922641,[sp,32] // q[4]

.... pattern is obvious enough.
Each constant needs 12 bytes, so 16 bytes/store.

But 2 instructions instead of 1 and 16 bytes instead of 12.

fabs r2,r1
fcmp r3,r2,#0x3EF00000 // thresh
bnlt r3,.LBB141_6

FABS R5, R6
FLDH 0x3780, R3 //A
FCMPGT R3, R6 //A
BT .LBB141_6 //A

Or (FP-IMM extension):

FABS R5, R6
FCMPGE 0x0DE, R6 //B (FP-IMM)
BF .LBB141_6 //B

; %bb.1:
fcmp r3,r2,#4 // xabs <= 4.0
bnlt r3,.LBB141_7

FCMPGE 0x110, R6
BF .LBB141_7

; %bb.2:
fcmp r3,r2,#0x403A8B020C49BA5E // xbig
bngt r3,.LBB141_11

MOV 0x403A8B020C49BA5E, R3
FCMPGT R3, R6
BT .LBB141_11

Where, FP-IMM wont work with that value.

Value came from source code.

; %bb.3:
fmul r3,r1,r1

FMUL R5, R5, R7

fdiv r3,#1,r3

Skip, operation gives identity?...

It is a reciprocate R3 = #1.0/R3

mov r4,#0x3F90B4FB18B485C7 // p[5]

Similar.

fmac r4,r3,r4,#0x3FD38A78B9F065F6 // p[0]
fadd r5,r3,#0x40048C54508800DB // q[0]
fmac r6,r3,r4,#0x3FD70FE40E2425B8 // p[1]
fmac r4,r3,r5,#0x3FFDF79D6855F0AD // q[1]

Turns into 4 constants, 7 FPU instructions (if no FMAC extension, 4 with FMAC). Though, at present, FMAC is slower than separate FMUL+FADD.

So, between 8 and 11 instructions.

Instead of 4.....

fmul r4,r3,r4
fmul r6,r3,r6
mov r5,#2
add r7,sp,#40 // p[*]
add r8,sp,#0 // q[*]

These can map 1:1.

LBB141_4: ; %._crit_edge11
; =>This Inner Loop Header:

Depth=1

vec r9,{r4,r6}
ldd r10,[r7,r5<<3,0] // p[*]
ldd r11,[r8,r5<<3,0] // q[*]
fadd r6,r6,r10
fadd r4,r4,r11
fmul r4,r3,r4
fmul r6,r3,r6
loop ne,r5,#4,#1

Could be mapped to a scalar loop, pretty close to 1:1.

I have 7 instructions in the loop, you would have 9.

Could possibly also be mapped over to 2x Binary64 SIMD ops, I am
guessing 2 copies for a 4-element vector?...

; %bb.5:
fadd r5,r6,#0x3F4595FD0D71E33C // p[4]
fmul r3,r3,r5
fadd r4,r4,#0x3F632147A014BAD1 // q[4]
fdiv r3,r3,r4
fadd r3,#0x3FE20DD750429B6D,-r3 // c[0]
fdiv r3,r3,r2
br .LBB141_10 // common tail

Same patterns as before.
Would need ~ 10 ops.

Well, could be expressed with fewer ops via jumbo-prefixed FP-IMM ops,
but this would only give "Binary32 truncated to 29 bits" precision for
the immediate values.

Theoretically, could allow an FE-FE-F0 encoding for FP-IMM, which could
give ~ 53 bits of precision. But, if one needs full Binary64, this will
not gain much in this case.

LBB141_6: ; %._crit_edge
fmul r3,r1,r1
fcmp r2,r2,#0x3C9FFE5AB7E8AD5E // xsmall
sra r2,r2,<1:13>
cvtsd r4,#0
mux r2,r2,r3,r4
mov r3,#0x3FC7C7905A31C322 // a[4]
fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
ldd r4,[sp,104] // a[2]
fmac r3,r2,r3,r4
fadd r4,r2,#0x403799EE342FB2DE // b[0]
fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
fmul r1,r3,r1
fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
fdiv r2,r1,r2
mov r1,r2
add sp,sp,#128
ret // 68
LBB141_7:
fmul r3,r2,#0x3E571E703C5F5815 // c[8]
mov r5,#0
mov r4,r2
LBB141_8: ; =>This Inner Loop Header:

Depth=1

vec r6,{r3,r4}
ldd r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
fadd r3,r3,r7
fmul r3,r2,r3
ldd r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
fadd r4,r4,r7
fmul r4,r2,r4
loop ne,r5,#7,#1
; %bb.9:
fadd r3,r3,#0x4093395B7FD2FC8E // c[7]
fadd r4,r4,#0x4093395B7FD35F61 // d[7]
fdiv r3,r3,r4
LBB141_10: // common tail
fmul r4,r2,#0x41800000 // 16.0
fmul r4,r4,#0x3D800000 // 1/16.0
cvtds r4,r4 // (signed)double
cvtsd r4,r4 // (double)signed
fadd r5,r2,-r4
fadd r2,r2,r4
fmul r4,r4,-r4
fexp r4,r4 // exp()
fmul r2,r2,-r5
fexp r2,r2 // exp()
fmul r2,r4,r2
fadd r2,#0,-r2
fmac r2,r2,r3,#0x3F000000 // 0.5
fadd r2,r2,#0x3F000000 // 0.5
pflt r1,0,T
fadd r2,#0,-r2
mov r1,r2
add sp,sp,#128
ret
LBB141_11:
fcmp r1,r1,#0
sra r1,r1,<1:13>
cvtsd r2,#-1 // (double)-1
cvtsd r3,#1 // (double)+1
mux r2,r1,r3,r2
mov r1,r2
add sp,sp,#128
ret
Lfunc_end141:
.size r8_erf, .Lfunc_end141-r8_erf
; -- End function

Don't really have time at the moment to comment on the rest of this...

In other news, found a bug in the function dependency-walking code.

Fixing this bug got things a little closer to beak-even with RV64G GCC
output regarding ".text" size (though, was still not sufficient to
entirely close the gap).

This was mostly based on noting that the compiler output had included
some things that were not reachable from within the program being
compiled (namely, noticing that the Doom build had included a copy of
the MS-CRAM video decoder and similar, which was not reachable from
anywhere within Doom).

Some more analysis may be needed.

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sat Apr 13 03:17:43 2024

BGB wrote:

On 4/11/2024 8:40 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/11/2024 6:06 PM, MitchAlsup1 wrote:

While I admit that <basically> anything bigger than 50-bits will be fine >>>> as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.

The number of cases where this comes up is not statistically
significant enough to have a meaningful impact on performance.

Fraction of a percent edge-cases are not deal-breakers, as I see it.

Idle speculation::

    .globl    r8_erf                          ; -- Begin function r8_erf
    .type    r8_erf,@function
r8_erf:                                 ; @r8_erf
; %bb.0:
    add    sp,sp,#-128
    std    #4614300636657501161,[sp,88]    // a[0]
    std    #4645348406721991307,[sp,104]    // a[2]
    std    #4659275911028085274,[sp,112]    // a[3]
    std    #4595861367557309218,[sp,120]    // a[4]
    std    #4599171895595656694,[sp,40]    // p[0]
    std    #4593699784569291823,[sp,56]    // p[2]
    std    #4580293056851789237,[sp,64]    // p[3]
    std    #4559215111867327292,[sp,72]    // p[4]
    std    #4580359811580069319,[sp,80]    // p[4]
    std    #4612966212090462427,[sp]    // q[0]
    std    #4602930165995154489,[sp,16]    // q[2]
    std    #4588882433176075751,[sp,24]    // q[3]
    std    #4567531038595922641,[sp,32]    // q[4]
    fabs    r2,r1
    fcmp    r3,r2,#0x3EF00000        // thresh
    bnlt    r3,.LBB141_6
; %bb.1:
    fcmp    r3,r2,#4            // xabs <= 4.0
    bnlt    r3,.LBB141_7
; %bb.2:
    fcmp    r3,r2,#0x403A8B020C49BA5E    // xbig
    bngt    r3,.LBB141_11
; %bb.3:
    fmul    r3,r1,r1
    fdiv    r3,#1,r3
    mov    r4,#0x3F90B4FB18B485C7        // p[5]
    fmac    r4,r3,r4,#0x3FD38A78B9F065F6    // p[0]
    fadd    r5,r3,#0x40048C54508800DB    // q[0]
    fmac    r6,r3,r4,#0x3FD70FE40E2425B8    // p[1]
    fmac    r4,r3,r5,#0x3FFDF79D6855F0AD    // q[1]
    fmul    r4,r3,r4
    fmul    r6,r3,r6
    mov    r5,#2
    add    r7,sp,#40            // p[*]
    add    r8,sp,#0            // q[*]
LBB141_4:                              ; %._crit_edge11
                                       ; =>This Inner Loop Header: Depth=1
    vec    r9,{r4,r6}
    ldd    r10,[r7,r5<<3,0]        // p[*]
    ldd    r11,[r8,r5<<3,0]        // q[*]
    fadd    r6,r6,r10
    fadd    r4,r4,r11
    fmul    r4,r3,r4
    fmul    r6,r3,r6
    loop    ne,r5,#4,#1
; %bb.5:
    fadd    r5,r6,#0x3F4595FD0D71E33C    // p[4]
    fmul    r3,r3,r5
    fadd    r4,r4,#0x3F632147A014BAD1    // q[4]
    fdiv    r3,r3,r4
    fadd    r3,#0x3FE20DD750429B6D,-r3    // c[0]
    fdiv    r3,r3,r2
    br    .LBB141_10            // common tail
LBB141_6:                              ; %._crit_edge
    fmul    r3,r1,r1
    fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E    // xsmall
    sra    r2,r2,<1:13>
    cvtsd    r4,#0
    mux    r2,r2,r3,r4
    mov    r3,#0x3FC7C7905A31C322        // a[4]
    fmac    r3,r2,r3,#0x400949FB3ED443E9    // a[0]
    fmac    r3,r2,r3,#0x405C774E4D365DA3    // a[1]
    ldd    r4,[sp,104]            // a[2]
    fmac    r3,r2,r3,r4
    fadd    r4,r2,#0x403799EE342FB2DE    // b[0]
    fmac    r4,r2,r4,#0x406E80C9D57E55B8    // b[1]
    fmac    r4,r2,r4,#0x40940A77529CADC8    // b[2]
    fmac    r3,r2,r3,#0x40A912C1535D121A    // a[3]
    fmul    r1,r3,r1
    fmac    r2,r2,r4,#0x40A63879423B87AD    // b[3]
    fdiv    r2,r1,r2
    mov    r1,r2
    add    sp,sp,#128
    ret                // 68
LBB141_7:
    fmul    r3,r2,#0x3E571E703C5F5815    // c[8]
    mov    r5,#0
    mov    r4,r2
LBB141_8:                              ; =>This Inner Loop Header: Depth=1
    vec    r6,{r3,r4}
    ldd    r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
    fadd    r3,r3,r7
    fmul    r3,r2,r3
    ldd    r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
    fadd    r4,r4,r7
    fmul    r4,r2,r4
    loop    ne,r5,#7,#1
; %bb.9:
    fadd    r3,r3,#0x4093395B7FD2FC8E    // c[7]
    fadd    r4,r4,#0x4093395B7FD35F61    // d[7]
    fdiv    r3,r3,r4
LBB141_10:                // common tail
    fmul    r4,r2,#0x41800000        // 16.0
    fmul    r4,r4,#0x3D800000        // 1/16.0
    cvtds    r4,r4                // (signed)double >>     cvtsd    r4,r4                // (double)signed >>     fadd    r5,r2,-r4
    fadd    r2,r2,r4
    fmul    r4,r4,-r4
    fexp    r4,r4                // exp()
    fmul    r2,r2,-r5
    fexp    r2,r2                // exp()
    fmul    r2,r4,r2
    fadd    r2,#0,-r2
    fmac    r2,r2,r3,#0x3F000000        // 0.5
    fadd    r2,r2,#0x3F000000        // 0.5
    pflt    r1,0,T
    fadd    r2,#0,-r2
    mov    r1,r2
    add    sp,sp,#128
    ret
LBB141_11:
    fcmp    r1,r1,#0
    sra    r1,r1,<1:13>
    cvtsd    r2,#-1                // (double)-1
    cvtsd    r3,#1                // (double)+1
    mux    r2,r1,r3,r2
    mov    r1,r2
    add    sp,sp,#128
    ret
Lfunc_end141:
    .size    r8_erf, .Lfunc_end141-r8_erf
                                       ; -- End function

These patterns seem rather unusual...
Don't really know the ABI.

Patterns don't really fit observations for typical compiler output
though (mostly in the FP constants, and particular ones that fall
outside the scope of what can be exactly represented as Binary16 or
similar, are rare).

You are N E V E R going to find the coefficients of a Chebyshev
polynomial to fit in a small FP container; excepting the very
occasional C0 or C1 term {which are mostly 1.0 and 0.0}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Sun Apr 14 22:58:22 2024

Stephen Fuld wrote:
<snip>

I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with separate compilations? The called function probably doesn’t know the
tag value for callee saved registers. Fortunately, the My 66000
architecture comes to the rescue here. You would modify the Enter and
Exit instructions to save/restore the tag bits of the registers they are saving or restoring in the same data structure it uses for the registers (yes, it adds 32 bits to that structure – minimal cost). The same mechanism works for interrupts that take control away from a running
process.

I had missed this until now:: The stack remains 64-bit aligned at all times,
so if you add 32-bits to the stack you actually add 64-bits to the stack.

Given this, you an effectively use a 2-bit tag {integral, floating, pointing, describing}. The difference between pointing and describing is that pointing
is C-like, while describing is dope-vector-like. {{Although others may find something else to put in the 4-th slot.}}

Any comments are welcome.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sun Apr 14 23:25:52 2024

Anton Ertl wrote:

I have a similar problem for the carry and overflow bits in
< http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to
let those bits not survive across calls; if there was a cheap solution
for the problem, it would eliminate this drawback of my idea.

My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
whereas RISC-V encodes the inner loop in 11 instructions.

Source code:

void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
{
uint64_t c = 0;
for( int i = 0; i < n; i++ )
{
{c, sum[i]} = a[i] + b[i] + c;
}
return
}

Assembly code::

.global mpn_add_n
mpn_add_n:
MOV R5,#0 // c
MOV R6,#0 // i

VEC R7,{}
LDD R8,[R2,Ri<<3]
LDD R9,[R3,Ri<<3]
CARRY R5,{{IO}}
ADD R10,R8,R9
STD R10,[R1,Ri<<3]
LOOP LT,R6,#1,R4
RET

So, adding a few "bells and whistles" to RISC-V does give you a
performance gain (1.38×); using a well designed ISA gives you a
performance gain of 2.00× !! {{moral: don't stop too early}}

Note that all the register bookkeeping has disappeared !! because
of the indexed memory reference form.

As I count executing instructions, VEC does not execute, nor does
CARRY--CARRY causes the subsequent ADD to take C input as carry and
the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
BC sequence in a single instruction and in a single clock.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Mon Apr 15 10:02:46 2024

MitchAlsup1 wrote:

Anton Ertl wrote:

I have a similar problem for the carry and overflow bits in
< http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to
let those bits not survive across calls; if there was a cheap solution
for the problem, it would eliminate this drawback of my idea.

My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
whereas RISC-V encodes the inner loop in 11 instructions.

Source code:

void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
{
    uint64_t c = 0;
    for( int i = 0; i < n; i++ )
    {
         {c, sum[i]} = a[i] + b[i] + c;
    }
    return
}

Assembly code::

    .global mpn_add_n
mpn_add_n:
    MOV   R5,#0     // c
    MOV   R6,#0     // i

    VEC   R7,{}
    LDD   R8,[R2,Ri<<3]
    LDD   R9,[R3,Ri<<3]
    CARRY R5,{{IO}}
    ADD   R10,R8,R9
    STD   R10,[R1,Ri<<3]
    LOOP LT,R6,#1,R4
    RET

So, adding a few "bells and whistles" to RISC-V does give you a
performance gain (1.38Ã—); using a well designed ISA gives you a performance gain of 2.00Ã— !! {{moral: don't stop too early}}

Note that all the register bookkeeping has disappeared !! because
of the indexed memory reference form.

As I count executing instructions, VEC does not execute, nor does CARRY--CARRY causes the subsequent ADD to take C input as carry and
the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
BC sequence in a single instruction and in a single clock.

; RSI->a[n], RDX->b[n], RDI->sum[n], RCX=-n
xor rax,rax ;; Clear carry
next:
mov rax,[rsi+rcx*8]
adc rax,[rdx+rcx*8]
mov [rdi+rcx*8],rax
inc rcx
jnz next

The code above is 5 instructions, or 6 if we avoid the load-op, doing
two loads and one store, so it should only be limited by the latency of
the ADC, i.e. one or two cycles.

In the non-OoO (i.e Pentium) days, I would have inverted the loop in
order to hide the latencies as much as possible, resulting in an inner
loop something like this:

next:
adc eax,ebx
mov ebx,[edx+ecx*4] ; First cycle

mov [edi+ecx*4],eax
mov eax,[esi+ecx*4] ; Second cycle

inc ecx
jnz next ; Third cycle

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Terje Mathisen on Mon Apr 15 11:16:15 2024

Terje Mathisen wrote:

MitchAlsup1 wrote:

Anton Ertl wrote:

I have a similar problem for the carry and overflow bits in
< http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to
let those bits not survive across calls; if there was a cheap solution
for the problem, it would eliminate this drawback of my idea.

My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
whereas RISC-V encodes the inner loop in 11 instructions.

Source code:

void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
{
Â Â Â uint64_t c = 0;
Â Â Â for( int i = 0; i < n; i++ )
Â Â Â {
Â Â Â Â Â Â Â Â {c, sum[i]} = a[i] + b[i] + c;
Â Â Â }
Â Â Â return
}

Assembly code::

Â Â Â .global mpn_add_n
mpn_add_n:
Â Â Â MOVÂ Â R5,#0Â Â Â Â // c
Â Â Â MOVÂ Â R6,#0Â Â Â Â // i

Â Â Â VECÂ Â R7,{}
Â Â Â LDDÂ Â R8,[R2,Ri<<3]
Â Â Â LDDÂ Â R9,[R3,Ri<<3]
Â Â Â CARRY R5,{{IO}}
Â Â Â ADDÂ Â R10,R8,R9
Â Â Â STDÂ Â R10,[R1,Ri<<3]
Â Â Â LOOPÂ LT,R6,#1,R4
Â Â Â RET

So, adding a few "bells and whistles" to RISC-V does give you a
performance gain (1.38Ãƒâ€”); using a well designed ISA gives you a >> performance gain of 2.00Ãƒâ€” !! {{moral: don't stop too early}}

Note that all the register bookkeeping has disappeared !! because
of the indexed memory reference form.

As I count executing instructions, VEC does not execute, nor does
CARRY--CARRY causes the subsequent ADD to take C input as carry and
the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
BC sequence in a single instruction and in a single clock.

; RSI->a[n], RDX->b[n], RDI->sum[n], RCX=-n
xor rax,rax ;; Clear carry
next:
mov rax,[rsi+rcx*8]
adc rax,[rdx+rcx*8]
mov [rdi+rcx*8],rax
inc rcx
   jnz next

The code above is 5 instructions, or 6 if we avoid the load-op, doing
two loads and one store, so it should only be limited by the latency of
the ADC, i.e. one or two cycles.

In the non-OoO (i.e Pentium) days, I would have inverted the loop in
order to hide the latencies as much as possible, resulting in an inner
loop something like this:

next:
adc eax,ebx
mov ebx,[edx+ecx*4]    ; First cycle

mov [edi+ecx*4],eax
mov eax,[esi+ecx*4]    ; Second cycle

inc ecx
   jnz next        ; Third cycle

In the same bad old days, the standard way to speed it up would have
used unrolling, but until we got more registers, it would have stopped
itself very quickly. With AVX2 we could use 4 64-bit slots in a 32-byte register, but then we would have needed to handle the carry propagation manually, and that would take longer than a series of ADC/ADX instructions.

next4:
mov eax,[esi]
adc eax,[esi+edx]
mov [esi+edi],eax
mov eax,[esi+4]
adc eax,[esi+edx+4]
mov [esi+edi+4],eax
mov eax,[esi+8]
adc eax,[esi+edx+8]
mov [esi+edi+8],eax
mov eax,[esi+12]
adc eax,[esi+edx+12]
mov [esi+edi+12],eax
lea esi,[esi+16]
dec ecx
jnz next4

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Mon Apr 15 19:03:34 2024

Michael S wrote:

On Thu, 11 Apr 2024 18:46:54 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On 4/11/2024 6:13 AM, Michael S wrote:

On Wed, 10 Apr 2024 23:30:02 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

It does occupy some icache space, however; have you boosted the
icache size to compensate?

The space occupied in the ICache is freed up from being in the
DCache so the overall hit rate goes up !! At typical sizes,
ICache miss rate is about ¼ the miss rate of DCache.

Besides:: if you had to LD the constant from memory, you use a LD
instruction and 1 or 2 words in DCache, while consuming a GPR. So,
overall, it takes fewer cycles, fewer GPRs, and fewer
instructions.

Alternatively:: if you paste constants together (LUI, AUPIC) you
have no direct route to either 64-bit constants or 64-bit address
spaces.

It looks to be a win-win !!

Win-win under constraints of Load-Store Arch. Otherwise, it
depends.

Never seen a LD-OP architecture where the inbound memory can be in
the Rs1 position of the instruction.

May be. But out of 6 major integer OPs it matters only for SUB.
By now I don't remember for sure, but I think that I had seen LD-OP architecture that had SUBR instruction. May be, TI TMS320C30?
It was 30 years ago and my memory is not what it used to be.

That a SUBR instruction exists does not disavow my statement that
the inbound memory reference was never in the Rs1 position.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Mon Apr 15 20:55:53 2024

Terje Mathisen wrote:

MitchAlsup1 wrote:

In the non-OoO (i.e Pentium) days, I would have inverted the loop in
order to hide the latencies as much as possible, resulting in an inner
loop something like this:

next:
adc eax,ebx
mov ebx,[edx+ecx*4] ; First cycle

mov [edi+ecx*4],eax
mov eax,[esi+ecx*4] ; Second cycle

inc ecx
jnz next ; Third cycle

Terje

As opposed to::

.global mpn_add_n
mpn_add_n:
MOV R5,#0 // c
MOV R6,#0 // i

VEC R7,{}
LDD R8,[R2,Ri<<3] // Load 128-to-512 bits
LDD R9,[R3,Ri<<3] // Load 128-to-512 bits
CARRY R5,{{IO}}
ADD R10,R8,R9 // Add pair to add octal
STD R10,[R1,Ri<<3] // Store 128-to-512 bits
LOOP LT,R6,#1,R4 // increment 2-to-8 times
RET

--------------------------------------------------------

LDD R8,[R2,Ri<<3] // AGEN cycle 1
LDD R9,[R3,Ri<<3] // AGEN cycle 2 data cycle 4
CARRY R5,{{IO}}
ADD R10,R8,R9 // cycle 4
STD R10,[R1,Ri<<3] // AGEN cycle 3 write cycle 5
LOOP LT,R6,#1,R4 // cycle 3

OR

LDD LDd
LDD LDd
ADD
ST STd
LOOP
LDD LDd
LDD LDd
ADD
ST STd
LOOP

10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM machine !! without code scheduling heroics.

40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM machine !!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Tue Apr 16 08:44:26 2024

MitchAlsup1 wrote:

Terje Mathisen wrote:

MitchAlsup1 wrote:

In the non-OoO (i.e Pentium) days, I would have inverted the loop in
order to hide the latencies as much as possible, resulting in an inner
loop something like this:

next:
   adc eax,ebx
   mov ebx,[edx+ecx*4]    ; First cycle

   mov [edi+ecx*4],eax
   mov eax,[esi+ecx*4]    ; Second cycle

   inc ecx
   jnz next        ; Third cycle

Terje

As opposed to::

    .global mpn_add_n
mpn_add_n:
    MOV   R5,#0     // c
    MOV   R6,#0     // i

    VEC   R7,{}
    LDD   R8,[R2,Ri<<3]       // Load 128-to-512 bits
    LDD   R9,[R3,Ri<<3]       // Load 128-to-512 bits
    CARRY R5,{{IO}}
    ADD   R10,R8,R9           // Add pair to add octal
    STD   R10,[R1,Ri<<3]      // Store 128-to-512 bits
    LOOP LT,R6,#1,R4         // increment 2-to-8 times
    RET

--------------------------------------------------------

    LDD   R8,[R2,Ri<<3]       // AGEN cycle 1
    LDD   R9,[R3,Ri<<3]       // AGEN cycle 2 data cycle 4
    CARRY R5,{{IO}}
    ADD   R10,R8,R9           // cycle 4
    STD   R10,[R1,Ri<<3]      // AGEN cycle 3 write cycle 5
    LOOP LT,R6,#1,R4         // cycle 3

OR

    LDD       LDd
         LDD       LDd                    ADD
              ST        STd
              LOOP
                   LDD       LDd
                        LDD       LDd
ADD
                             ST        STd
                             LOOP

10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM
machine !!
without code scheduling heroics.

40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM
machine !!

It all comes down to the carry propagation, right?

The way I understood the original code, you are doing a very wide
unsigned add, so you need a carry to propagate from each and every block
to the next, right?

If you can do that at half a clock cycle per 64 bit ADD, then consider
me very impressed!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Tue Apr 16 18:14:39 2024

Terje Mathisen wrote:

MitchAlsup1 wrote:

Terje Mathisen wrote:

MitchAlsup1 wrote:

In the non-OoO (i.e Pentium) days, I would have inverted the loop in
order to hide the latencies as much as possible, resulting in an inner
loop something like this:

next:
   adc eax,ebx
   mov ebx,[edx+ecx*4]    ; First cycle

   mov [edi+ecx*4],eax
   mov eax,[esi+ecx*4]    ; Second cycle

   inc ecx
   jnz next        ; Third cycle

Terje

As opposed to::

    .global mpn_add_n
mpn_add_n:
    MOV   R5,#0     // c
    MOV   R6,#0     // i

    VEC   R7,{}
    LDD   R8,[R2,Ri<<3]       // Load 128-to-512 bits
    LDD   R9,[R3,Ri<<3]       // Load 128-to-512 bits
    CARRY R5,{{IO}}
    ADD   R10,R8,R9           // Add pair to add octal
    STD   R10,[R1,Ri<<3]      // Store 128-to-512 bits
    LOOP LT,R6,#1,R4         // increment 2-to-8 times
    RET

--------------------------------------------------------

    LDD   R8,[R2,Ri<<3]       // AGEN cycle 1
    LDD   R9,[R3,Ri<<3]       // AGEN cycle 2 data cycle 4
    CARRY R5,{{IO}}
    ADD   R10,R8,R9           // cycle 4
    STD   R10,[R1,Ri<<3]      // AGEN cycle 3 write cycle 5
    LOOP LT,R6,#1,R4         // cycle 3

OR

    LDD       LDd
         LDD       LDd                    ADD
              ST        STd
              LOOP
                   LDD       LDd
                        LDD       LDd
ADD
                             ST        STd
                             LOOP

10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM
machine !!
without code scheduling heroics.

40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM
machine !!

It all comes down to the carry propagation, right?

The way I understood the original code, you are doing a very wide
unsigned add, so you need a carry to propagate from each and every block
to the next, right?

Most ST pipelines have an align stage to align the data to be stored to where it needs to be stored, one can extend the carry into this stage if needed,
and capture the a+b and a+b+1 and use carry in to select one or the other.

If you can do that at half a clock cycle per 64 bit ADD, then consider
me very impressed!

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to EricP on Tue Apr 16 15:06:48 2024

On 4/3/2024 11:44 AM, EricP wrote:

Stephen Fuld wrote:

There has been discussion here about the benefits of reducing the
number of op codes. One reason not mentioned before is if you have
fixed length instructions, you may want to leave as many codes as
possible available for future use. Of course, if you are doing a
16-bit instruction design, where instruction bits are especially
tight, you may save enough op-codes to save a bit, perhaps allowing a
larger register specifier field, or to allow more instructions in the
smaller subset.

It is in this spirit that I had an idea, partially inspired by Mill’s
use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First,
it has several features that are “friendly” to the idea. Second, I
know Mitch cares about keeping the number of op codes low.

Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate
discussions, and because it has been fun to think about.

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag.
If set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load
single floating and load double floating, which work the same as the
other 32- and 64-bit loads, but in addition to loading the value, set
the tag bit for the destination register. Non-floating-point loads
would clear the tag bit. As I show below, I don’t think you need any
special "store tag" instructions.

If you are adding a float/int data type flag you might as well
also add operand size for floats at least, though some ISA's
have both int32 and int64 ALU operations for result compatibility.

Not needed for My 66000, as all floating point loads convert the loaded
value to double precision.

big snip

Currently the opcode data type can tell the uArch how to route
the operands internally without knowing the data values.
For example, FPU reservation stations monitor float operands
and schedule for just the FPU FADD or FMUL units.

Dynamic data typing would change that to be data dependent routing.
It means, for example, you can't begin to schedule a uOp
until you know all its operand types and opcode.

Seems right.

Looks like it makes such distributed decisions impossible.
Probably everything winds up in a big pile of logic in the center,
which might be problematic for those things whose complexity grows N^2.
Not sure how significant that is.

Could be. Again, IANAHG.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Thomas Koenig on Tue Apr 16 15:08:58 2024

On 4/3/2024 1:02 PM, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

[saving opcodes]

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If
set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.).

I don't think this would save a lot of opcode space, which
is the important thing.

A typical RISC design has a six-bit major opcode.
Having three registers takes away fifteen bits, leaving
eleven, which is far more than anybody would ever want as
minor opdoce for arithmetic instructions. Compare with https://en.wikipedia.org/wiki/DEC_Alpha#Instruction_formats
where DEC actually left out three bits because they did not
need them.

I think that is probably true for 32 bit instructions, but what about 16
bit?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Anton Ertl on Tue Apr 16 15:02:13 2024

On 4/3/2024 10:24 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If
set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load single
floating and load double floating, which work the same as the other 32-
and 64-bit loads, but in addition to loading the value, set the tag bit
for the destination register. Non-floating-point loads would clear the
tag bit. As I show below, I don’t think you need any special "store
tag" instructions.

...

But we can go further. There are some opcodes that only make sense for
FP operands, e.g. the transcendental instructions. And there are some
operations that probably only make sense for non-FP operands, e.g. POP,
FF1, probably shifts. Given the tag bit, these could share the same
op-code. There may be several more of these.

Certainly makes reading disassembler output fun (or writing the disassembler).

Good point. It probably isn't too bad for the arithmetic operations,
etc, but once you extend it as I suggested in the last paragraph it gets
ugly. :-(

big snip

That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it? To me, a major question is the effect on performance. What
is the cost of having to decode the source registers and reading their
respective tag bits before knowing which FU to use?

In in OoO CPU, that's pretty heavy.

OK, but in the vast majority of cases (i.e. unless there is something
like a conditional branch that uses floating point or integer depending
upon whether the branch is taken.) the flag bit that a register will
have can be known well in advance. As I said, IANAHG, but that might
make it easier.

But actually, your idea does not need any computation results for
determining the tag bits of registers (except during EXIT),

But even here, you almost certainly know what the tag bit for any given register is long before you execute the EXIT instruction. And remember,
on MY 66000 EXIT is performed lazily, so you have time and the mechanism
is in place to wait if needed.

so you
probably can handle the tags in the front end (decoder and renamer).
Then the tags are really separate and not part of the rgisters that
have to be renamed, and you don't need to perform any waiting on
ENTER.

However, in EXIT the front end would have to wait for the result of
the load/store unit loading the 32 bits, unless you add a special
mechanism for that. So EXIT would become expensive, one way or the
other.

Yes.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Tue Apr 16 16:44:27 2024

On 4/3/2024 2:30 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

On 4/3/2024 11:43 AM, Stephen Fuld wrote:

There has been discussion here about the benefits of reducing the
number of op codes. One reason not mentioned before is if you have
fixed length instructions, you may want to leave as many codes as
possible available for future use. Of course, if you are doing a
16-bit instruction design, where instruction bits are especially
tight, you may save enough op-codes to save a bit, perhaps allowing a
larger register specifier field, or to allow more instructions in the
smaller subset.

It is in this spirit that I had an idea, partially inspired by Mill’s
use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, it

               66000

Sorry. Typo.

has several features that are “friendly” to the idea. Second, I know >>> Mitch cares about keeping the number of op codes low.

Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate
discussions, and because it has been fun to think about.

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag.
If set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load
single floating and load double floating, which work the same as the
other 32- and 64-bit loads, but in addition to loading the value, set
the tag bit for the destination register. Non-floating-point loads
would clear the tag bit. As I show below, I don’t think you need any >>> special "store tag" instructions.

What do you do when you want a FP bit pattern interpreted as an integer,
or vice versa.

As I said below, if you need that, you can use an otherwise :"useless" instruction, such as ORing a register with itself the modify the tag bits.

When executing arithmetic instructions, if the tag bits of both
sources of an instruction are the same, do the appropriate operation
(floating or integer), and set the tag bit of the result register
appropriately.
If the tag bits of the two sources are different, I see several
possibilities.

1.    Generate an exception.
2.    Use the sense of source 1 for the arithmetic operation, but
perform the appropriate conversion on the second operand first,
potentially saving an instruction

Conversions to/from FP often require a rounding mode. How do you specify that?

Good point.

3.    Always do the operation in floating point and convert the
integer operand prior to the operation. (Or, if you prefer, change
floating point to integer in the above description.)
4.    Same as 2 or 3 above, but don’t do the conversions.

I suspect this is the least useful choice. I am not sure which is
the best option.

Given that, use the same op code for the floating-point and fixed
versions of the same operations. So we can save eight op codes, the
four arithmetic operations, max, min, abs and compare. So far, a net
savings of six opcodes.

But we can go further. There are some opcodes that only make sense
for FP operands, e.g. the transcendental instructions. And there are
some operations that probably only make sense for non-FP operands,
e.g. POP, FF1, probably shifts. Given the tag bit, these could share
the same op-code. There may be several more of these.

Hands waving:: "Danger Will Robinson, Danger" more waving of hands.

Agreed.

I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with
separate compilations? The called function probably doesn’t know the

The compiler will certainly have a function prototype. In any event, if FP and Integers share a register file the lack of prototype is much less
stress-
full to the compiler/linking system.

tag value for callee saved registers. Fortunately, the My 66000
architecture comes to the rescue here. You would modify the Enter
and Exit instructions to save/restore the tag bits of the registers
they are saving or restoring in the same data structure it uses for
the registers (yes, it adds 32 bits to that structure – minimal
cost). The same mechanism works for interrupts that take control
away from a running process.

Yes, but we do just fine without the tag and without the stuff mentioned above. Neither ENTER nor EXIT care about the 64-bit pattern in the
register.

I think you need it for callee saved registers to insure the tag is set correctly for the calling program upon return to it.

I don’t think you need to set or clear the tag bits without doing
anything else, but if you do, I think you could “repurpose” some
other instructions to do this, without requiring another op-code.
For example, Oring a register with itself could be used to set the
tag bit and Oring a register with zero could clear it. These should
be pretty rare.

That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it?

No.

           To me, a major question is the effect on performance. What

is the cost of having to decode the source registers and reading
their respective tag bits before knowing which FU to use?

The problem is you have put decode dependent on dynamic pipeline
information.
I suggest you don't want to do that. Consider a change from int to FP instruction
as a predicated instruction, so the pipeline cannot DECODE the
instruction at
hand until the predicate resolves. Yech.

Good point.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Wed Apr 17 01:11:12 2024

Stephen Fuld wrote:

On 4/3/2024 11:44 AM, EricP wrote:

If you are adding a float/int data type flag you might as well
also add operand size for floats at least, though some ISA's
have both int32 and int64 ALU operations for result compatibility.

Not needed for My 66000, as all floating point loads convert the loaded
value to double precision.

Insufficient verbal precision::

My 66000 only cares about the size of a value being loaded from memory
(or ST into memory).

While (float) LDs load the 32-bit value from memory, they remain (float)
while residing in the register; and the High Order 32-bits are ignored.
The (float) register can be consumed by a (float) FP calculation and it
remains (float) after processing.

Small immediates, when consumed by FP instructions, are converted from
integer to <sized> FP during DECODE. So::

FADD R7,R7,#1

adds 1.0D0 to the (double) value in R7 (and takes one 32-bit instruction), while:

FADDs R7,R7,#1

Adds 1.0E0 to the (float) value in R7.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to tkoenig@netcologne.de on Wed Apr 17 15:06:05 2024

On Mon, 8 Apr 2024 17:25:38 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

John Savard <quadibloc@servername.invalid> schrieb:

Well, when the computer fetches a 256-bit block of code, the first
four bits indicates whether it is composed of 36-bit instructions or
28-bit instructions.

Do you think that instructions which require a certain size (almost)
always happen to be situated together so they fit in a block?

Well, floating-point and integer instructions of one size each can be arbitrarily mixed. And when different sizes need to mix, going to
36-bit instructions is low overhead.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Wed Apr 17 15:07:18 2024

On Mon, 8 Apr 2024 19:56:27 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

So, instead of using the branch target address, one rounds it down to
a 256-bit boundary, reads 256-bits and looks at the first 4-bits to
determine the format, nd then uses the branch offset to pick a cont-
tainer which will become the first instruction executed.

Sounds more complicated than necessary.

Yes, I don't disagree. I'm just pointing out that it's possible to
make the mini tags idea work that way, since it lets you easily turn
mini tags off when you need to.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	299
Nodes:	16 (2 / 14)
Uptime:	31:53:27
Calls:	6,682
Files:	12,222
Messages:	5,342,796

"Mini" tags to reduce the number of op codes

Who's Online

System Info