Forum: >>> Magnum BBS <<<

Concertina II Progress

From Quadibloc@21:1/5 to All on Wed Nov 8 21:33:59 2023

Some progress has been made in advancing a small step towards sanity
in the description of the Concertina II architecture described at

http://www.quadibloc.com/arch/ct17int.htm

As Mitch Alsup has rightly noted, I want to have my cake and eat it
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.

I want memory-reference instructions to still fit in 32 bits, despite
asking for so much more capacity.

So what I had done was, after squeezing as much as I could into a basic instruction format, I provided for switching into alternate instruction
formats which made different compromises by using the block headers.

This has now been dropped. Since I managed to get the normal (unaligned) memory-reference instruction squeezed into so much less opcode space that
I also had room for the aligned memory-reference format without compromises
in the basic instruction set, it wasn't needed to have multiple instruction formats.

I had to change the instructions longer than 32 bits to get them in the
basic instruction format, so now they're less dense.

Block structure is still used, but now for only the two things it's
actually needed for: reserving part of a block as unused for the pseudo-immediates, and for VLIW features (explicitly indicating
parallelism, and instruction predication).

The ISA is still tremendously complicated, since I've put room in it for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Quadibloc on Thu Nov 9 00:43:27 2023

On 11/8/2023 3:33 PM, Quadibloc wrote:

Some progress has been made in advancing a small step towards sanity
in the description of the Concertina II architecture described at

http://www.quadibloc.com/arch/ct17int.htm

As Mitch Alsup has rightly noted, I want to have my cake and eat it
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.

Ironically, I am getting slightly better reach on average with (scaled)
9-bit (and 10) bit displacements than RISC-V gets with 12 bits...

Say:
DWORD:
12s, Unscaled: +/- 2K
9u, 4B Scale : + 2K
10s, 4B Scale: +/- 2K (XG2)
QWORD:
12s, Unscaled: +/- 2K
9u, 8B Scale : + 4K
10s, 8B Scale: +/- 4K (XG2)

It was a pretty tight call between 10s and 10u, but 10s won out by a
slight margin mostly because the majority of structs and stack-frames
tend to be smaller than 4K (but, does create an incentive to use larger
storage formats for on-stack storage).

Though, for integer immediate instructions, RISC-V would have a slight advantage. Where, say, roughly 9% of 3R integer immediate values miss
with the existing Imm9u/Imm9n scheme; but the sliver of "Misses with 9
bits, but would hit with 12 bits", is relatively small (most of the
"miss" cases are much larger constants).

However, a fair chunk of these "miss" cases, could be handled with a bit-set/bit-clear instruction, say:
y=x|0x02000000;
z=x&0xFBFFFFFF;
Turning into, say:
BIS R4, 25, R6
BIC R4, 25, R7

Unclear if this case is quite common enough to justify adding these instructions though (granted, a case could be made for them).

However, a few cases do typically need larger displacements:
PC relative, such as branches.
GBR relative, namely constant loads.

For PC relative, 20-bits is "mostly enough", but one program has hit the
20-bit limit (+/- 1MB). Recently, via a tweak, in current forms of the
ISA, the effective branch-displacement limit (for a 32-bit instruction
form) has been increased to 23 bit (+/- 8MB).
Baseline+XGPR: Unconditional BRA and BSR only.
Conditional branches still limited to 20 bits.
XG2: Also includes conditional branches.

In these cases, it was mostly because the bits that were being used to
extend the GPRs to 6 bits were N/A for their original purpose with
branch-ops, and this could be repurposed to the displacement. Main other alternatives would have been 22 bits + alternate link register, or a
3-bit LR field; however, the cost of supporting this would have been
higher than that of reassigning them simply towards making the
displacement bigger.

Potentially a similar role could have been served by a conjoined "MOV
LR, R1 | BSR Disp" instruction (and/or allowing "MOV LR, R1" in Lane 2
as a special case for this, even if it would not otherwise be allowed
within the ISA rules). Though, would defeat the point if this encoding
foils the branch predictor.

Recently, had ended up adding some Disp11s Compare-with-Zero branches,
mostly as these branches turn out to be useful (in the face of 2-cycle
CMPxx), and 8 bits "wasn't quite enough". Say, Disp11s can cover a much
bigger if/else block or loop body (+/- 2K) than Disp8s (+/- 256B).

For GBR Relative:
The default 9-bit displacement was Byte scaled (for "reasons");
But, a 512B range isn't terribly useful;
Later forms ended up with Disp10u Scaled:
This gives 4K or 8K of range (in Baseline)
This increases to 8K and 16K in XG2.

If the compiler sorts primitive global variables by descending-usage
(and emits the top N specially, at the start of ".data"), then the
Scaled GBR cases can access a majority of the global variables (around
75-80% with a scaled 10-bit displacement).

Effectively, the remaining 20-25% or so need to be handled as one of:
Jumbo Disp33s (if Jumbo prefixes are available, most profiles);
2-op Disp25s (no jumbo, '.data'+'.bss' less than 16MB).
3-op Disp33s (else).

Though, as with the stack frames, these instructions do create an
incentive to effectively promote any small global variables to a larger
storage type (such as 'char' or 'short' to 'int'); just with implicit
sign (or zero) extensions to preserve the expected behavior of the
smaller type (though, strictly speaking, only zero-extensions would be
required by the C standard, given signed overflow is technically UB; but
there would be something "deeply wrong" with a 'char' variable being
able to hold, say, -4495213, or similar).

Though, does mean for normal variables, "just use int or similar" is
typically faster (say, because there are dedicated 32-bit sign and zero extending forms of some of the common ALU ops, but not for 8 or 16 bit
cases).

A Disp16u case could maybe reach 256K or 512K, which could cover much of
a combined data+bss section. While in theory this could be better, to
make effective use of this would require effectively folding much of
".bss" into ".data", which is not such a good thing for the program
loader (as opposed to merely folding the top N most-used variables into ".data").

Then again, uninitialized global arrays could probably still be left in
".bss", which tend to be the main "bulking factor" for this section (as
opposed to normal variables).

I want memory-reference instructions to still fit in 32 bits, despite
asking for so much more capacity.

Yeah.

If you want a Load/Store to have two 5 bit registers and a 16-bit
displacement, only 6 bits are left in a 32-bit instruction word. This
is, not a whole lot...

For a full set of Load/Store ops, this is 4 bits;
For a set of basic ALU ops, this is another 3 bits.

So, just for Load/Store and basic ALU ops, half the encoding space is
gone...

Would it be worth it?...

So what I had done was, after squeezing as much as I could into a basic instruction format, I provided for switching into alternate instruction formats which made different compromises by using the block headers.

This has now been dropped. Since I managed to get the normal (unaligned) memory-reference instruction squeezed into so much less opcode space that
I also had room for the aligned memory-reference format without compromises in the basic instruction set, it wasn't needed to have multiple instruction formats.

I had to change the instructions longer than 32 bits to get them in the
basic instruction format, so now they're less dense.

Block structure is still used, but now for only the two things it's
actually needed for: reserving part of a block as unused for the pseudo-immediates, and for VLIW features (explicitly indicating
parallelism, and instruction predication).

The ISA is still tremendously complicated, since I've put room in it for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.

Such is a long standing issue...

I am also annoyed sometimes at how complicated my design has gotten.
Still, it is within reason, and not too far outside the scope of many
existing RISC's.

But, as noted, the reason XG2 exists as-is was sort of a compromise:
I couldn't come up with any encoding which could actually give
everything I wanted, and the "most practical" option was effectively to
dust off an idea I had originally rejected:
Having an alternate encoding which dropped 16-bit ops in favor of
reusing these bits for more GPRs.

At first glance, RISC-V seems cleaner and simpler, but this falls on its
face once one goes outside the scope of RV64IM or similar.

And, it isn't tempting when, at least from my POV, RV64 seems "less
good" than what I have already (others may disagree; but at least to me,
some parts of RISC-V's design seem to me like kind of a trash fire).

The main tempting thing the RV64 has is that, maybe, if one goes and
implements RV64GC and clones a bunch of SiFive's hardware interfaces,
then potentially one can run a mainline Linux on it.

There have apparently been some people that have gotten NOMMU Linux
working on RV32IM targets, which is possible (and, ironically, seemingly
basing these on the SuperH branch in the Linux kernel from what I had
seen...).

Seemingly, AMD/Xilinx is jumping over from MicroBlaze to an RV32
variant. But, granted, RV32 isn't too far from what MicroBlaze is
typically used for, so not really a huge stretch.

I sometimes wonder if maybe I would be better off jumping to RV, but
then I end up seeing examples where cores running at somewhat higher
clock speeds still manage to deliver relatively poor framerates in Doom.

Like, as-is, my MIPs scores are kinda weak, but I am still getting
around 30 fps in Doom at around 20-24 MIPs.

RV64IM seemingly needs significantly higher MIPs to get similar
framerates in Doom.

Say, for Doom:
BJX2 needs ~ 800k instructions / frame;
RV64IM seemingly needs nearly 2 million instructions / frame.

Not entirely sure what all is going on, but I have my suspicions.

Though, it does seem to be the inverse situation with Dhrystone.

Say:
BJX2: around 1.3 DMIPS per BJX2 instruction;
RV64: around 3.8 DMIPS per RV64 instruction.

Though, I can note that there seems to be "something weird" with
Dhrystone and GCC (in multiple scenarios, GCC gives Dhrystone scores
that are significantly above what could be "reasonably expected", or
which agree with the scores given by other compilers, seemingly as-if it
is optimizing away a big chunk of the benchmark...).

But, these results don't typically extend to other programs (where
scores are typically much closer together).

Actually, I have noted that if comparing BGBCC with MSVC and BJX2 with
my Ryzen, performance relations seem to scale pretty closer to linearly relative to clock-speed, albeit with some outliers.

There are cases where deviation has been noted:
Speed differences for TKRA-GL's software rasterizer backend are smaller
than the difference in clock-speed (74x clock-speed delta; 20x fill-rate delta);
And cases where it is bigger: The performance delta for things like LZ4 decompression or some of my image codecs is somewhat larger than the clock-speed delta (say: 74x clock-speed delta, 115x performance delta, *1).

*1: Though, LZ4 still operates near memcpy() speed in both cases; issue
is mostly that, relative to MHz, my BJX2 core has comparably slower
memory access.

Albeit somehow, this trend reverses for my early 2000s laptop, which has
slower RAM access. However, the SO-DIMM is 4x the width (64b vs 16b),
and 133MHz vs 50MHz; and this leads to a theoretical 10.64x ratio, which
isn't too far off from the observed memcpy() performance of the laptop.

So, laptop has 10.64x faster RAM, relative to 28x more MHz.

Wheres, say, my Ryzen has 2.64x more MHz (3.7 vs 1.4), but around 40x
more memory bandwidth (12.7x for single-thread memcpy).

Well, and if I did jump over to RV64, it would renderer much of what I
am doing entirely moot.

I *could* do a dedicated RV64 core, but could unlikely make it "notable"
enough to be worthwhile.

So, it seems like my options are either:
Continue on doing stuff mostly as is;
Drop it and probably go off to doing something else entirely.

...

But, don't have much else better to be doing, considering the typically
"meh" response to most of my 3D engine attempts. And my general
lackluster skills towards most types of "creative" endeavors (I suspect "affective alexithymia" probably doesn't help too much for artistic expression).

Well, and I have also recently noted other oddities, for example:
It seems I may have "reverse slope hearing loss", and my hearing is
seemingly notably poor for sounds much lower than about 1.5 or 2kHz (lower-frequency sine waves are nearly inaudible, but I can still hear square/triangle/sawtooth waves well; most of what I perceive as
low-frequency sounds seemingly being based on higher-frequency harmonics
of those sounds).

So, say:
2kHz..4kHz, loud, heard easily;
4kHz..8kHz, also heard readily;
8..15kHz, fades away and disappears.
But, OTOH, for sine waves:
1kHz: much quieter than 2kHz
500Hz: fairly mild at full volume
250Hz: relatively quiet
125Hz: barely audible.

But, for sounds much under around 200Hz, I can feel the vibrations, and
can associate these with sound (but, this effect is not localized to
ears, also works with hands and similar; this effect seems strongest at
around 50-100 Hz, but has a lower range of around 6-8Hz, below this
point, feeling becomes less sensitive to it, but visual perception can
take over at this point).

I can take audio and apply a fairly aggressive 2kHz high-pass filter
(say, -48db per octave, applied several times), and for the most part it doesn't sound that much different, though does sound a little more
tinny. This "tinny" effect is reduced with a 1kHz high-pass filter.

Most of what I had perceived as low-frequency sounds are still present
even after the filtering (and while entirely absent in a spectrum plot). Zooming in generally shows patterns of higher frequency vibrations
following similar patterns to the low-frequency vibrations, which
seemingly I perceive "as" the low-frequency vibration.

And, in all this, I hadn't noticed that anything was amiss until looking
into it for other reasons.

I am left to wonder is some of this could be related to my preference
for the sound of ADPCM compression over that of MP3 at lower quality
levels (low bitrate MP3 sounds particularly awful, whereas ADPCM tends
to fare better; but seemingly other people disagree).

Does possibly explain some other past difficulties:
I can make a noise and hear the walls within a room;
But, trying to hit a metal tank to determine how much sand was in the
tank by hearing, was quite a bit more difficult (best I could do was hit
the tank, and then try to hear what parts of the tank had reduced echo;
but results were pretty mixed as the sand level did not significantly
change the echoes).

Apparently, it turns out, people were listening for "thud" vs "not
thud", but like, I couldn't really hear this part, and wasn't even
really aware there should be a "thud" (or even really what a "thud"
sounds like apart from the effects of, say, something hitting a chunk of
wood; hitting a sand-filled steel tank with a rubber mallet was nearly
silent, but, knuckles or tapping it with a screwdriver was easier to
hear, ...).

Well, also can't really understand what anyone is saying over the phone
(as the phone reduces everything to difficult to understand muffled noises).

Or, like the sound-effects in Wolfenstein 3D being theoretically voice
clips saying stuff, but are more things like "aaaa uunn" or "aaaauuuu"
or "uu aa uu" or similar owing to the poor audio quality.

Well, and my past failures to achieve any kind of intelligibility in
past experiments messing with formant synthesis.

And some experiments with vocoder like designs, noting that I could
seemingly discard pretty much everything much below 500Hz or 1kHz
without much ill effect; but theoretically there is "relevant stuff" in
these frequency ranges. Didn't really think of much at the time (it
seemed like all of this was a "based frequency" where the combined
amplitude of everything could be averaged together and treated like a
single channel).

Had noted that, one thing that did sort of work, was, say:
Split the audio into 32 frequency bands;
Pick the top 2 or 3 bands, ignoring low-frequency or adjacent bands;
Say, anything below 1kHz is ignored.
Record the band number and relative volume.

Then, regenerate waveforms at each of these bands with the measured
volume (along with alternate versions spread across different octaves;
it worked better if higher power-of-2 frequencies were also synthesized,
albeit at lower intensities). Get back "mostly intelligible" speech.

IIRC, had mostly used 32 bands spread across 2 octaves (say, 1-2 kHz and 2-4kHz, or 2-4 kHz and 4-8 kHz).
Can also mix in sounds from the same relative position in other octaves.

Seemed to have best results with mostly evenly-spread frequency bands.

...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Quadibloc on Thu Nov 9 18:50:37 2023

Quadibloc <quadibloc@servername.invalid> schrieb:

As Mitch Alsup has rightly noted, I want to have my cake and eat it
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.

So, r1 = r2 + r3 + offset.

Three registers is 15 bits plus a 16-bit offset, which gives you
31 bits. You're left with one bit of opcode, one for load and
one for store.

The /360 had 12 bits for three registers plus 12 bits of offset, so
24 bits left eight bits for the opcode (the RX format).

So, if you want to do this kind of thing, why not go for a full 32-bit
offset in a second 32-bit word?

[...]

The ISA is still tremendously complicated, since I've put room in it for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.

Have you ever written an assembler for your ISA?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Thomas Koenig on Thu Nov 9 21:38:31 2023

On Thu, 09 Nov 2023 18:50:37 +0000, Thomas Koenig wrote:

So, r1 = r2 + r3 + offset.

Three registers is 15 bits plus a 16-bit offset, which gives you 31
bits. You're left with one bit of opcode, one for load and one for
store.

Yes, and obviously that isn't enough. So I do have to make some
compromises.

The offset is 16 bits, because the 68000 (and the 8086, and others) had 16
bit offsets!

But the base and index registers are each specified by only 3 bits - only
the destination register gets a 5-bit field.

I need 5 bits for the opcode. That lets me have load and store for four floating-point types, load, store, unsigned load, and insert for four
integer types (the largest one only uses load and store).

So it is doable! 5 plus 5 plus 3 plus 3 equals 16, so I have 16 bits left
for the offset.

But that leaves only 1/4 of the opcode space. Which would be fine for a conventional RISC design, as that's plenty for the operate instructions.
But I needed to reserve _half_ the opcode space, because I needed another
1/4 of the opcode space for putting two 16-bit instructions in a 32-bit
word for more compact code.

That led me to look for compromises... and I found some that would not
overly impair the effectiveness of the memory reference instructions,
which I discussed previously. I ended up using _both_ of two alternatives
each of which alone would have given me the needed savings in opcode
space... that way, the compromised memory-reference instructions could be accompanied by another complete set of memory-reference instructions with
_no_ compromise... except for only being able to specify aligned operands.

The /360 had 12 bits for three registers plus 12 bits of offset, so 24
bits left eight bits for the opcode (the RX format).

Oh, yes, I remember it well.

So, if you want to do this kind of thing, why not go for a full 32-bit
offset in a second 32-bit word?

Because the 360 only took 32 bits for a memory-reference instruction, so
using 32 bits for one is sinfully wasteful!

I want to "have my cake and eat it too" - to have a computer that's just
as good as a Power PC or a 68000 or a System/360, even though they have different, incompatible, strengths that conflict with a computer being
able to be good at what each of them is good at simultaneously.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Thu Nov 9 21:42:43 2023

On Thu, 09 Nov 2023 21:38:31 +0000, Quadibloc wrote:

I want to "have my cake and eat it too" - to have a computer that's just
as good as a Power PC or a 68000 or a System/360, even though they have different, incompatible, strengths that conflict with a computer being
able to be good at what each of them is good at simultaneously.

Actually, it's worse than that, since I also want the virtues of processors like the TMS320C2000 or the Itanium.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to BGB-Alt on Thu Nov 9 21:51:31 2023

On Thu, 09 Nov 2023 15:36:12 -0600, BGB-Alt wrote:

On 11/9/2023 12:50 PM, Thomas Koenig wrote:

So, r1 = r2 + r3 + offset.

Three registers is 15 bits plus a 16-bit offset, which gives you 31
bits. You're left with one bit of opcode, one for load and one for
store.

Oh, that is even worse than I understood it as, namely:
LDx Rd, (Rs, Disp16)
...

But, yeah, 1 bit of opcode clearly wouldn't work...

And indeed, he is correct, that is what I'm trying to do.

But I easily solve _most_ of the problem.

I just use 3 bits for the index register and the base register.

The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

16-bit register-to-register instructions use eight bits to specify their
source and destination registers, so both registers must be from the same
group of eight registers.

This lends itself to writing code where four distinct threads are
interleaved, helping pipelining in implementations too cheap to have out-of-order executiion.

The index register can be one of registers 1 to 7 (0 means no indexing).

The base register can be one of registers 25 to 31. (24, or a 0 in the three-bit base register field, indicates a special addressing mode.)

This sort of is reminiscent of System/360 coding conventions.

The special addressing modes do stuff like using registers 17 to 23 as
base registers with a 12 bit displacement, so that additional short
segments can be accessed.

As I noted, shaving off two bits each from two fields gives me four more
bits, and five bits is exactly what I need for the opcode field.

Unfortunately, I needed one more bit, because I also wanted 16-bit instructions, and they take up too much space. That led me... to some interesting gyrations, but I finally found a compromise that was
acceptable to me for saving those bits, so acceptable that I could drop
the option of using the block header to switch to using "full" instructions instead. Finally!

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Thu Nov 9 22:11:41 2023

On Thu, 09 Nov 2023 21:42:43 +0000, Quadibloc wrote:

On Thu, 09 Nov 2023 21:38:31 +0000, Quadibloc wrote:

I want to "have my cake and eat it too" - to have a computer that's
just as good as a Power PC or a 68000 or a System/360, even though they
have different, incompatible, strengths that conflict with a computer
being able to be good at what each of them is good at simultaneously.

Actually, it's worse than that, since I also want the virtues of
processors like the TMS320C2000 or the Itanium.

And don't forget the Cray-I.

So the idea is to have *one* ISA that will serve for...

embedded microcontrollers,
data-base servers,
desktop workstations, and
HPC supercomputers.

Of course, these different tasks will require different implementations,
which focus on doing parts of the ISA well.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB-Alt@21:1/5 to Thomas Koenig on Thu Nov 9 15:36:12 2023

On 11/9/2023 12:50 PM, Thomas Koenig wrote:

Quadibloc <quadibloc@servername.invalid> schrieb:

As Mitch Alsup has rightly noted, I want to have my cake and eat it
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.

So, r1 = r2 + r3 + offset.

Three registers is 15 bits plus a 16-bit offset, which gives you
31 bits. You're left with one bit of opcode, one for load and
one for store.

Oh, that is even worse than I understood it as, namely:
LDx Rd, (Rs, Disp16)
...

But, yeah, 1 bit of opcode clearly wouldn't work...

The /360 had 12 bits for three registers plus 12 bits of offset, so
24 bits left eight bits for the opcode (the RX format).

So, if you want to do this kind of thing, why not go for a full 32-bit
offset in a second 32-bit word?

Originally, I had turned any displacements that didn't fit into 9 bits
into a 2-op sequence:
MOV Imm25s, R0
MOV.x (Rb, R0), Rn

Actually, worse yet, the first form of BJX2 only had 5-bit Load/Store displacements, but it didn't take long to realize that 5 bits wasn't
really enough (say, when roughly 2/3 of the load and store operations
can't fit in the displacement).

But, now, there are Jumbo-encodings, which can encode a full 33-bit displacement in a 64-bit encoding. Not everything is perfect though,
mostly because these encodings are bigger and can't be used in a bundle.

But, still "less bad" in this sense than my original 48-bit encodings,
where "for reasons", these couldn't co-exist with bundles in the same
code block.

Despite the loss of 48-bit ops though:
The jumbo encodings give larger displacements (33s vs 24u or 17s);
They reuse the existing 32-bit decoders, rather than needing a dedicated
48-bit decoder.

But, yeah, "use another instruction word" if one needs a larger
displacement, is mostly the option that I would probably recommend.

At first, the 5-bit encodings went away, but later came back as a zombie
of sorts (cases emerged where their existence was still valuable).

But, then it later came up to a tradeoff (with the design of XG2):
Do I expand the Disp9u to Disp10u, and then keep with the XGPR encoding
of using the Disp5u encodings to encode a Disp6s case (for a small range
of negative displacements), or expand to Disp9u to Disp10s?...

In this case, Disp10s won out by a small margin, as I needed non-trivial negative displacements at least slightly more often than I needed 8K for structs and stack frames and similar.

But, for most things, a 16-bit displacement would be a waste...
If I were going to go the route of using a signed 12-bit displacement
(like RISC-V), would probably still keep it scaled though, as 8K/16K is
still more useful than 2K.

Branch displacements are typically still hard-wired as 2 though, partly
as the ISA started out with 16-bit ops, and switching XG2 over to 4-byte
scale would have broken its symmetry with the Baseline ISA.

Though, could pull a cheap trick and repurpose the LSB of branch ops in
XG2, given as-is, it is effectively "Must Be Zero" (all instructions
have a 32-bit alignment in this mode, and branches to an odd address are
not allowed).

So, the idea of a BSR that uses R1 as an alternate Link-Register is
still not (entirely) dead (while at the same time allowing for the
'.text' section to be expanded to 8MB).

There are 64-bit Disp33s and Abs48 branch encodings, but, yeah, they
have costs:
They are 64-bit vs 32-bit, thus, bigger;
Are ignored by the branch predictor, thus, slower;
The Abs48 case is not PC relative
Using it within a program requires a base reloc;
Is generally useful for DLL imports and special cases though (*1).

*1: Its existence is mostly as an alternative in these cases to a more expensive option:
MOV Addr64, R1
JMP R1
Which needs 128-bits, and is also ignored by the branch predictor.

[...]

The ISA is still tremendously complicated, since I've put room in it for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.

Have you ever written an assembler for your ISA?

Yeah, whether someone can write an assembler, or disassembler/emulator,
and not drive themselves insane in the attempt, is possibly a test of
"sanity".

Granted, still not foolproof, as it isn't that bad to write an assembler/disassembler for x86 either, but trying to decode it in
hardware would be nightmarish.

Best guess I can have would be a "preclassify" stage:
If this is an opcode byte, how long will it be, and will a Mod/RM
follow, ...?
If this is a Mod/RM byte, how many bytes will this add.

Then in theory, one can figure instruction length like:
Fetch OpLen for IP;
Fetch Mod/RM len for IP+OpLen if Mod/RM flag is set;
Add OpLen+ModRmLen.
Add an extra 2/4 bytes if an Immed is present for this opcode.

Nicer to not bother.

For my 75 MHz experiment, did end up adding a similar sort of
"preclassify" logic to deal with instruction-lengths though, at the cost
that now L1 I$ cache-lines are specific to the operating mode in which
they were fetched (which now needs to be checked along with the address
and similar).

Mostly all this is a case of "looking up 4 bits of tag metadata" being
less latency than "feed 9 bits of instruction bits through some LUTs"
(or 12 bits if RISC-V decoding is enabled). There is still some latency
due to MUX'ing and similar, but this part is unavoidable.

So, former case:
8 bits: Classify BJX2 instruction length;
1 bit: Specify Baseline or XG2.
Latter case:
8 bits: Classify BJX2 instruction length;
2 bits: Classify RISC-V instruction length (16/32)
2 bits: Specify Baseline, XG2, RISC-V, or XG2RV.

Which map to 4 bits (IIRC):
(0): 16-bit
(1): (WEX && WxE) || Jumbo
(2): WEX
(3): Jumbo

As-is, after MUX'ing, this can effectively turn op-len determination
into a 4 or 6 bit lookup, say (bits tag 1:0 for two adjacent 32-bit words):
00zz: 32-bit
01zz: 16-bit
1000: 64-bit
1001: 48-bit (unused)
1010: 96-bit (*)
1011: Invalid
11zz: Invalid

*: Here, we just assume that the 3'rd instruction word is 00.
Would actually need to check this if either 4-wide bundles or 80-bit
encodings were "actually a thing".

Where, handling both XG2 and WXE (WEX Enable) in the preclassify step
greatly simplifies the logic during instruction fetch.

This could, in premise, be reduced further in an "XG2 only" core, or to
a lesser extent by eliminating the original XGPR scheme. These are not currently planned though (say, the first-stage lookup width could be
reduced from 8 to 5 or 7 bits).

...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB-Alt@21:1/5 to Quadibloc on Thu Nov 9 17:49:03 2023

On 11/9/2023 3:51 PM, Quadibloc wrote:

On Thu, 09 Nov 2023 15:36:12 -0600, BGB-Alt wrote:

On 11/9/2023 12:50 PM, Thomas Koenig wrote:

So, r1 = r2 + r3 + offset.

Three registers is 15 bits plus a 16-bit offset, which gives you 31
bits. You're left with one bit of opcode, one for load and one for
store.

Oh, that is even worse than I understood it as, namely:
LDx Rd, (Rs, Disp16)
...

But, yeah, 1 bit of opcode clearly wouldn't work...

And indeed, he is correct, that is what I'm trying to do.

But I easily solve _most_ of the problem.

I just use 3 bits for the index register and the base register.

The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

Unless, maybe, registers were being treated like a stack, but even then,
this is still gonna suck.

Much preferable for a compiler to have a flat space of 32 or 64
registers. Having 16 sorta works, but does still add a bit to spill and
fill.

Theoretically, 32 registers should be "pretty good", but I ended up with
64 partly due to arguable weakness in my compilers' register allocation.

Say, 64 makes it possible to static assign most of the variables in most
of the functions, which avoids the need for spill and fill; at least
with a register allocator that isn't smart enough to locally assign
registers across basic-block boundaries).

I am not sure if a more clever compiler (such as GCC) could also find
ways to make effective use of 64 GPRs.

I guess, IA-64 did have 128 registers in banks of 32. Not sure how well
this worked.

16-bit register-to-register instructions use eight bits to specify their source and destination registers, so both registers must be from the same group of eight registers.

When I added R32..R63, I ended up not bothering adding any way to access
them from 16-bit ops.

So:
R0..R15: Generally accessible for all of 16-bit land;
R16..R31: Accessible from a limited subset of 16-bit operations.
R32..R63: Inaccessible from 16-bit land.
Only accessible for an ISA subset for 32-bit ops in XGPR.

Things are more orthogonal in XG2:
No 16-bit ops;
All of the 32-bit ops can access R0..R63 in the same way.

This lends itself to writing code where four distinct threads are interleaved, helping pipelining in implementations too cheap to have out-of-order executiion.

Considered variations on this in my case as well, just with static
control flow.

However, BGBCC is nowhere near clever enough to pull this off...

Best that can be managed is doing this sort of thing manually (this is
sort of how "functions with 100+ local variables" are born).

In theory, a compiler could infer when blocks of code or functions are
not sequentially dependent and inline everything and schedule it in
parallel, but alas, this sort of thing requires a bit of cleverness that
is hard to pull off.

The index register can be one of registers 1 to 7 (0 means no indexing).

The base register can be one of registers 25 to 31. (24, or a 0 in the three-bit base register field, indicates a special addressing mode.)

This sort of is reminiscent of System/360 coding conventions.

OK.

The special addressing modes do stuff like using registers 17 to 23 as
base registers with a 12 bit displacement, so that additional short
segments can be accessed.

As I noted, shaving off two bits each from two fields gives me four more bits, and five bits is exactly what I need for the opcode field.

Unfortunately, I needed one more bit, because I also wanted 16-bit instructions, and they take up too much space. That led me... to some interesting gyrations, but I finally found a compromise that was
acceptable to me for saving those bits, so acceptable that I could drop
the option of using the block header to switch to using "full" instructions instead. Finally!

A more straightforward encoding would make things, more straightforward...

Main debates I think are, say:
Whether to start with the MSB of each word (what I had often done);
Or, start from the LSB (like RISC-V);
Whether 5 or 6 bit register fields;
How much bits for immediate and opcode fields;
...

Bundling and predication may eat a few bits, say:
00: Scalar
01: Bundle
10/11: If-True / If-False

In my case, this did leave an ugly hack case to support conditional ops
in bundles. Namely, the instruction to "Load 24 bits into R0" has
different interpretations in each case (Scalar: Load 24 bits into R0;
Bundle: Jumbo Prefix; If-True/If-False, repeat a different instruction
block, but understood as both conditional and bundled).

This could be fully orthogonal with 3 bits, but it seems, this is a big ask:
000, Unconditional, Scalar
001, Unconditional, Bundle
010, Special, Scalar (Eg: Large constant load or Branch)
011, Special, Bundle (Eg: Jumbo Prefix)
100, If-True, Scalar
101, If-True, Bundle
110, If-False, Scalar
111, If-False, Bundle

This leads to a lopsided encoding though, and it seems like things only
really fit together nicely with a limited combination of sizes.

Say, for an immediate field:
24+ 9 => 33s
24+24+16 => 64
This is almost magic...

Though:
26+ 7 => 33s
26+26+12 => 64
Could also work.

But, does end up with an ISA layout where immediate values are mostly 7u
or 7n, which is not nearly as attractive as 9u and 9n.

Say, for Load/Store displacement hit (rough approximations, from memory):
5u: 35%
7u: 65%
9u: 90%
...

All turns into a bit of an annoying numbers game sometimes...

But, this ended up as part of why I ended up with XG2, which didn't give
me everything I wanted, and the encodings of some things does have more
"dog chew" than I would like (I would have preferred if everything were
nice contiguous fields, rather than the bits for each register field
being scattered across the instruction word).

But, the numbers added up in a way that worked better than most of the alternatives I could come up with (and happened to also be the "least
effort" implementation path).

Granted, I still keep half expecting people to be like "Dude, just jump
onto the RISC-V wagon...".

Or, failing this, at least implement enough of RISC-V to be able to run
Linux on it (but, this would require significant architectural changes;
being able to run a "stock" RV64GC Linux build would effectively require partially cloning a bunch of SiFive's architectural choices or similar;
which is not something I would be happy with).

But, otherwise, pretty much any other option in this area would still
mean a porting effort...

Well, and the on/off consideration of trying to port a BSD variant, as
BSD seemed like potentially less effort (there is far less implicit
assumptions of GNU related stuff being used).

...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Fri Nov 10 01:11:13 2023

Quadibloc wrote:

Some progress has been made in advancing a small step towards sanity
in the description of the Concertina II architecture described at

http://www.quadibloc.com/arch/ct17int.htm

As Mitch Alsup has rightly noted, I want to have my cake and eat it
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.

<
My 66000 has all of this.
<

I want memory-reference instructions to still fit in 32 bits, despite
asking for so much more capacity.

<
The simple/easy ones definitely, the ones with longer displacements no.
<

So what I had done was, after squeezing as much as I could into a basic instruction format, I provided for switching into alternate instruction formats which made different compromises by using the block headers.

<
Block headers are simply consuming entropy.
<

This has now been dropped. Since I managed to get the normal (unaligned) memory-reference instruction squeezed into so much less opcode space that
I also had room for the aligned memory-reference format without compromises in the basic instruction set, it wasn't needed to have multiple instruction formats.

<
I never had any aligned memory references. The HW overhead to "fix" the
problem is so small as to be compelling.
<

I had to change the instructions longer than 32 bits to get them in the
basic instruction format, so now they're less dense.

Block structure is still used, but now for only the two things it's
actually needed for: reserving part of a block as unused for the pseudo-immediates, and for VLIW features (explicitly indicating
parallelism, and instruction predication).

The ISA is still tremendously complicated, since I've put room in it for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.

<
Yet, mine remains simple and compact.
<

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Quadibloc on Fri Nov 10 00:29:00 2023

In article <uijjoj$2dc2i$1@dont-email.me>, quadibloc@servername.invalid (Quadibloc) wrote:

Actually, it's worse than that, since I also want the virtues of
processors like the TMS320C2000 or the Itanium.

What do you consider the virtues of Itanium to be?

No company ever seems to have taken it up on technical grounds, only as a result of Intel and HP persuading commercial managers that it would
become widely used owing to their market power.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to John Dallman on Fri Nov 10 04:31:45 2023

On Fri, 10 Nov 2023 00:29:00 +0000, John Dallman wrote:

In article <uijjoj$2dc2i$1@dont-email.me>, quadibloc@servername.invalid (Quadibloc) wrote:

Actually, it's worse than that, since I also want the virtues of
processors like the TMS320C2000 or the Itanium.

What do you consider the virtues of Itanium to be?

Well, I think that superscalar operation of microprocessors is a good
thing. Explicitly indicating which instructions may execute in parallel
is one way to facilitate that. Even if the Itanium was an unsuccessful implementation of that principle.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to BGB-Alt on Fri Nov 10 04:37:16 2023

On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:

On 11/9/2023 3:51 PM, Quadibloc wrote:

The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

Errm, splitting up registers like this is likely to hurt far more than anything that 16-bit displacements are likely to gain.

For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.

It's only in the 16-bit operate instructions that this splitting of
registers is actively present as a constraint. It is needed to make
16-bit operate instructions possible.

So the cure is that if a compiler finds this too much trouble, it
doesn't have to use the 16-bit instructions.

Of course, if compilers can't use them, that raises the question of
whether 16-bit instructions are worth having. Without them, the
complications that I needed to be happy about my memory-reference
instructions could have been entirely avoided.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to MitchAlsup on Thu Nov 9 22:19:48 2023

On 11/9/2023 7:11 PM, MitchAlsup wrote:

Quadibloc wrote:

Good to see you are back on here...

Some progress has been made in advancing a small step towards sanity
in the description of the Concertina II architecture described at

http://www.quadibloc.com/arch/ct17int.htm

As Mitch Alsup has rightly noted, I want to have my cake and eat it
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.

<
My 66000 has all of this.
<

I want memory-reference instructions to still fit in 32 bits, despite
asking for so much more capacity.

<
The simple/easy ones definitely, the ones with longer displacements no.
<

Yes.

As noted a few times, as I see it, 9 .. 12 is sufficient.
Much less than 9 is "not enough", much more than 12 is wasting entropy,
at least for 32-bit encodings.

12u-scaled would be "pretty good", say, being able to handle 32K for
QWORD ops.

So what I had done was, after squeezing as much as I could into a basic
instruction format, I provided for switching into alternate instruction
formats which made different compromises by using the block headers.

<
Block headers are simply consuming entropy.
<

Also yes.

This has now been dropped. Since I managed to get the normal (unaligned)
memory-reference instruction squeezed into so much less opcode space that
I also had room for the aligned memory-reference format without
compromises
in the basic instruction set, it wasn't needed to have multiple
instruction
formats.

<
I never had any aligned memory references. The HW overhead to "fix" the problem is so small as to be compelling.
<

In my case, it is only for 128-bit load/store operations, which require
64-bit alignment.

Well, and an esoteric edge case:
if((PC&0xE)==0xE)
You can't use a 96-bit encoding, and will need to insert a NOP if one
needs to do so.

One can argue that aligned-only allows for a cheaper L1 D$, but also
"sucks pretty bad" for some tasks:
Fast memcpy;
LZ decompression;
Huffman;
...

I had to change the instructions longer than 32 bits to get them in
the basic instruction format, so now they're less dense.

Block structure is still used, but now for only the two things it's
actually needed for: reserving part of a block as unused for the
pseudo-immediates, and for VLIW features (explicitly indicating
parallelism, and instruction predication).

The ISA is still tremendously complicated, since I've put room in it for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.

<
Yet, mine remains simple and compact.
<

Mostly similar.
Though, I guess some people could debate this in my case.

Granted, I specify the entire ISA in a single location, rather than
spreading it across a bunch of different documents (as was the case with RISC-V).

Well, and where there is a lot that is left up to the specific hardware implementations in terms of stuff that one would need to "actually have
an OS run on it", ...

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Fri Nov 10 04:43:14 2023

On Fri, 10 Nov 2023 01:11:13 +0000, MitchAlsup wrote:

I never had any aligned memory references. The HW overhead to "fix" the problem is so small as to be compelling.

Since I have a complete set of memory-reference instructions for which unaligned memory-reference instructions are supported, the problem isn't
that I think unaligned fetches and stores take too many gates.

Rather, being able to only specify aligned accesses saves *opcode space*,
which lets me fit in one complete set of memory-reference instructions that
can use all the base registers, all the index registers, and always use all
the registers as destination registers.

While the unaligned-capable instructions, that offer also important
additional addressing modes, had to have certain restrictions to fit in.

So they use six out of the seven index registers, they can use only half
the registers as destination registers on indexed accesses, and they use
four out of the seven base registers.

Having 16-bit instructions for the possibility of more compact code meant
that I had to have at least one of the two restrictions noted above -
having both restrictions meant that I could offer the alternative of aligned-only instructions with neither restriction, which may be far less painful for some.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Quadibloc on Fri Nov 10 00:46:43 2023

On 11/9/2023 10:37 PM, Quadibloc wrote:

On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:

On 11/9/2023 3:51 PM, Quadibloc wrote:

The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.

It's only in the 16-bit operate instructions that this splitting of
registers is actively present as a constraint. It is needed to make
16-bit operate instructions possible.

FWIW: I went with 16-bit ops with 4-bit register fields (with a small
subset with 5-bit register fields).

Granted, layout was different than SH:
zzzz-nnnn-mmmm-zzzz //typical SH layout
zzzz-zzzz-nnnn-mmmm //typical BJX2 layout

Where, as noted, typical 32-bit layout in my case is:
111p-ZwZZ-nnnn-mmmm ZZZZ-qnmo-oooo-ZZZZ
And, in XG2:
NMOP-ZwZZ-nnnn-mmmm ZZZZ-qnmo-oooo-ZZZZ

I guess, a "minor" reorganization might yield, say:
PwZZ-ZZZZ-ZZnn-nnnn-mmmm-mmoo-oooo-ZZZZ (3R)
PwZZ-ZZZZ-ZZnn-nnnn-mmmm-mmZZ-ZZZZ-ZZZZ (2R)
PwZZ-ZZZZ-ZZnn-nnnn-mmmm-mmii-iiii-iiii (3RI, Imm10)
PwZZ-ZZZZ-ZZnn-nnnn-ZZZZ-ZZii-iiii-iiii (2RI, Imm10)
PwZZ-ZZZZ-ZZnn-nnnn-iiii-iiii-iiii-iiii (2RI, Imm16)
PwZZ-ZZZZ-iiii-iiii-iiii-iiii-iiii-iiii (Imm24)

Which seems like actually a relatively nice layout thus far...

Possibly, going further:
Pw00-ZZZZ-ZZnn-nnnn-mmmm-mmoo-oooo-ZZZZ (3R Space)
Pw00-1111-ZZnn-nnnn-mmmm-mmZZ-ZZZZ-ZZZZ (2R Space)

Pw01-ZZZZ-ZZnn-nnnn-mmmm-mmii-iiii-iiii (Ld/St Disp10)

Pw10-0ZZZ-ZZnn-nnnn-mmmm-mmii-iiii-iiii (3RI Imm10, ALU Block)
Pw10-1ZZZ-ZZnn-nnnn-ZZZZ-ZZii-iiii-iiii (2RI Imm10)

Pw11-0ZZZ-ZZnn-nnnn-iiii-iiii-iiii-iiii (2RI, Imm16)

Pw11-1110-iiii-iiii-iiii-iiii-iiii-iiii BRA Disp24s (+/- 32MB)
Pw11-1111-iiii-iiii-iiii-iiii-iiii-iiii BSR Disp24s (+/- 32MB)

1111-111Z-iiii-iiii-iiii-iiii-iiii-iiii Jumbo

Though, might almost make sense for PrWEX to be N/E, as the PrWEX blocks
seem to be infrequently used in BJX2 (basically, for predicated
instructions that exist as part of an instruction bundle).

Say:
Scalar: 77.3%
WEX : 8.9%
Pred : 13.5%
PrWEX : 0.3%

So the cure is that if a compiler finds this too much trouble, it
doesn't have to use the 16-bit instructions.

Of course, if compilers can't use them, that raises the question of
whether 16-bit instructions are worth having. Without them, the
complications that I needed to be happy about my memory-reference instructions could have been entirely avoided.

For performance optimized cases, I am starting to suspect 16-bit ops are
not worth it.

For size optimization, they make sense; but size optimization also means
mostly confining register allocation to R0..R15 in my case, with
heuristics for when to enable additional registers, where enabling the
higher registers effectively hinders the use of 16-bit instructions.

The other option I have found is that, rather than optimizing for
smaller instructions (as in an ISA with 16 bit instructions), one can
instead optimize for doing stuff in as few instructions as it is
reasonable to do so, which in turn further goes against the use of
16-bit instructions.

And, thus far, I am ending up building a lot of my programs in XG2 mode
despite the slightly worse code density (leaving the main "hold outs"
for the Baseline encoding mostly being the kernel and Boot ROM).

The kernel could go over to XG2 without too much issue, mostly leaving
the Boot ROM. Switching over the ROM would require some functional
tweaks (coming out of reset in a different mode), as well as probably
either increasing the size of the ROM or removing some stuff (building
the Boot ROM as-is in XG2 mode would exceed the current 32K limit).

Granted, the main things the ROM contains is a bunch of boot-time sanity
check stuff, a RAM counter, FAT32 driver, and stuff to init the graphics
module (such as a Boot-time ASCII font, *).

*: Though, this font saves some space by only encoding the ASCII-range characters, and packing the character glyphs into 5*6 pixels (allowing
32-bits, rather than the 64-bits needed for an 8x8 glyph). This won out aesthetically over using a 7-segment or 14-segment font (as well as it
taking more complex logic to unpack 7 or 14 segment into an 8x8
character cell).

Where, say, unlike a CGA or VGA, the initial font is not held in a
hardware ROM. There was originally, but it was cheaper to manage the
font in software, effectively using the VRAM as a plain color-cell
display in text mode.

...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Quadibloc on Fri Nov 10 14:51:44 2023

Quadibloc <quadibloc@servername.invalid> writes:

On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:

On 11/9/2023 3:51 PM, Quadibloc wrote:

The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.

As soon as you make 'general purpose registers' not 'general'
you've significantly complicated register allocation in compilers
and likely caused additional memory accesses due to the need to
spill registers unnecessarily.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Fri Nov 10 18:26:20 2023

Quadibloc wrote:

On Fri, 10 Nov 2023 00:29:00 +0000, John Dallman wrote:

In article <uijjoj$2dc2i$1@dont-email.me>, quadibloc@servername.invalid
(Quadibloc) wrote:

Actually, it's worse than that, since I also want the virtues of
processors like the TMS320C2000 or the Itanium.

What do you consider the virtues of Itanium to be?

Itanic's main virtue was to consume several Intel design teams, over 20
years, preventing Intel from taking over the entire µprocessor market.

I, personally, don't believe in exposing the scalarity to the compiler,
nor the rotating register file to do what renaming does naturally,
nor the lack of proper FP instructions (FDIV, SQRT), ...

Academic quality at industrial prices.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Fri Nov 10 18:29:56 2023

Quadibloc wrote:

On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:

On 11/9/2023 3:51 PM, Quadibloc wrote:

The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.

It's only in the 16-bit operate instructions that this splitting of
registers is actively present as a constraint. It is needed to make
16-bit operate instructions possible.

So the cure is that if a compiler finds this too much trouble, it
doesn't have to use the 16-bit instructions.

<
Then why are they there ??
<
I think you will find (like RISC-V is) that having and not mandating use
means you get a bit under ½ of what you think you are getting.
<

Of course, if compilers can't use them, that raises the question of
whether 16-bit instructions are worth having. Without them, the
complications that I needed to be happy about my memory-reference instructions could have been entirely avoided.

<
There is a subset of RISC-V designers who want to discard the 16-bit
subset in order to solve the problems of the 32-bit set.
<
I might note: given the space of the compressed ISA in RISC-V, I could
install the entire My 66000 ISA and then not need any of the RISC-V
ISA.....
<

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Scott Lurndal on Fri Nov 10 12:24:08 2023

On 11/10/2023 8:51 AM, Scott Lurndal wrote:

Quadibloc <quadibloc@servername.invalid> writes:

On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:

On 11/9/2023 3:51 PM, Quadibloc wrote:

The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.

As soon as you make 'general purpose registers' not 'general'
you've significantly complicated register allocation in compilers
and likely caused additional memory accesses due to the need to
spill registers unnecessarily.

Yeah.

Either banks of 8, or an 8 data + 8 address, or ... would kinda "rather
suck".

Or, even smaller cases, like, "most instructions can use all the
registers, but these ops only work on a subset" is kind of an annoyance
(this is a big part of why I bothered with the whole XG2 thing).

Much better to have a big flat register space.

Though, within reason.
Say:
* 8: Pain, can barely hold anything in registers.
** One barely has enough for working values for expressions, etc.
* 16: Not quite enough, still lots of spill/fill.
* 32: Can work well, with a good register allocator;
* 64: Can largely eliminate spill/fill, but a little much.
* 128: Too many.
* 256: Absurd.

So, say, 32 and 64 seem to be the "good" area, where with 32, a majority
of the functions can sit comfortably with most or all of their variables
held in registers. But, for functions with a large number of variables
(say, 100 or more), spill/fill becomes an issue (*).

Having 64 allows a majority of functions to use a "static assign
everything" strategy, where spill/fill can be eliminated entirely (apart
from the prolog/epilog sequences), and otherwise seems to deal better
with functions with large numbers of variables.

*: And is more of a pain with a register allocator design which can't
keep any non-static-assigned values in registers across basic-block
boundaries. This issue is, ironically, less obvious with 16 registers
(since spill/fill runs rampant anyways). But having nearly every basic
block start with a blob of stack loads, and end with a blob of stores,
only to reload them all again on the other side of a label, is fairly
obvious.

Having 64 registers does at least mostly hit this nail...

Meanwhile, for 128, there aren't really enough variables and temporaries
in most functions to make effective use of them. Also, 7 bit register
fields wont fit easily into a 32-bit instruction word.

As for register arguments:
* Probably 8 or 16.
** 8 makes the most sense with 32 GPRs.
*** 16 is asking too much.
*** 8 deals with around 98% of functions.
** 16 makes sense with 64 GPRs.
*** Nearly all functions can use exclusively register arguments.
*** Gain is small though, if it only benefits 2% of functions.
*** It is almost a "shoe in", except for cost of fixed spill space
*** 128 bytes at the bottom of every non-leaf stack-frame is noticeable.
*** Though, an ABI could decide to not have a spill space in this way.

Though, admittedly, for a lot of my programs I had still ended up going
with 8 register arguments with 64 GPRs, mostly as the gains of 16
arguments is small, relative of the cost of spending an additional 64
bytes in nearly every stack frame (and also there are still some
unresolved bugs when using 16 argument mode).

...

Current leaning is also that:
32-bit primary instruction size;
32/64/96 bit for variable-length instructions;
Is "pretty good".

In performance-oriented use cases, 16-bit encodings "aren't really worth
it".
In cases where you need a 32 or 64 bit value, being able to encode them
or load them quickly into a register is ideal. Spending multiple
instructions to glue a value together isn't ideal, nor is needing to
load it from memory (this particularly sucks from the compiler POV).

As for addressing modes:
(Rb, Disp) : ~ 66-75%
(Rb, Ri) : ~ 25-33%
Can address the vast majority of cases.

Displacements are most effective when scaled by the size of the element
type, as unaligned displacements are exceedingly rare. The vast majority
of displacements are also positive.

Not having a register-indexed mode is shooting oneself in the foot, as
these are "not exactly rare".

Most other possible addressing modes can be mostly ignored.
Auto-increment becomes moot if one has superscalar or VLIW;
(Rb, Ri, Disp) is only really applicable in niche cases
Eg, array inside struct, etc.
...

RISC-V did sort of shoot itself in the foot in several of these areas,
albeit with some workarounds in "Bitmanip":
SHnADD, can mimic a LEA, allowing array access in fewer ops.
PACK, allows an inline 64-bit constant load in 5 instructions...
LUI+ADD+LUI+ADD+PACK
...

Still not ideal...

An extra cycle for memory access is not ideal for a close second place addressing mode; nor are 64-bit constants rare enough that one
necessarily wants to spend 5 or so clock cycles on them.

But, still better than the situation where one does not have these instructions.

...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Fri Nov 10 18:22:43 2023

BGB wrote:

On 11/9/2023 7:11 PM, MitchAlsup wrote:

Quadibloc wrote:

Good to see you are back on here...

Some progress has been made in advancing a small step towards sanity
in the description of the Concertina II architecture described at

http://www.quadibloc.com/arch/ct17int.htm

As Mitch Alsup has rightly noted, I want to have my cake and eat it
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.

<
My 66000 has all of this.
<

I want memory-reference instructions to still fit in 32 bits, despite
asking for so much more capacity.

<
The simple/easy ones definitely, the ones with longer displacements no.
<

Yes.

As noted a few times, as I see it, 9 .. 12 is sufficient.
Much less than 9 is "not enough", much more than 12 is wasting entropy,
at least for 32-bit encodings.

<
Can you suggest something I could have done by sacrificing 16-bits
down to 12-bits that would have improved "something" in my ISA ??
{{You see I did not have any trouble in having all 16-bits for MEM references--just like having 16-bits for integer, logical, and branch offsets.}}
<

12u-scaled would be "pretty good", say, being able to handle 32K for
QWORD ops.

<
IBM 360 found so, EMBench is replete with stack sizes and struct sizes
where My 66000 uses 1×32-bit instruction where RISC-V needs 2×32-bit... Exactly the difference between 12-bits and 14-bits....

So what I had done was, after squeezing as much as I could into a basic
instruction format, I provided for switching into alternate instruction
formats which made different compromises by using the block headers.

<
Block headers are simply consuming entropy.
<

Also yes.

This has now been dropped. Since I managed to get the normal (unaligned) >>> memory-reference instruction squeezed into so much less opcode space that >>> I also had room for the aligned memory-reference format without
compromises
in the basic instruction set, it wasn't needed to have multiple
instruction
formats.

<
I never had any aligned memory references. The HW overhead to "fix" the
problem is so small as to be compelling.
<

In my case, it is only for 128-bit load/store operations, which require 64-bit alignment.

<
VVM does all the wide stuff without necessitating the wide stuff in
registers or instructions.
<

Well, and an esoteric edge case:
if((PC&0xE)==0xE)
You can't use a 96-bit encoding, and will need to insert a NOP if one
needs to do so.

<
Ehhhhh...
<

One can argue that aligned-only allows for a cheaper L1 D$, but also
"sucks pretty bad" for some tasks:
Fast memcpy;
LZ decompression;
Huffman;
...

<
Time found that HW can solve the problem way more than adequately--
obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
<

I had to change the instructions longer than 32 bits to get them in
the basic instruction format, so now they're less dense.

Block structure is still used, but now for only the two things it's
actually needed for: reserving part of a block as unused for the
pseudo-immediates, and for VLIW features (explicitly indicating
parallelism, and instruction predication).

The ISA is still tremendously complicated, since I've put room in it for >>> a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.

<
Yet, mine remains simple and compact.
<

Mostly similar.
Though, I guess some people could debate this in my case.

Granted, I specify the entire ISA in a single location, rather than
spreading it across a bunch of different documents (as was the case with RISC-V).

Well, and where there is a lot that is left up to the specific hardware implementations in terms of stuff that one would need to "actually have
an OS run on it", ...

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to MitchAlsup on Fri Nov 10 12:48:10 2023

On 11/10/2023 12:22 PM, MitchAlsup wrote:

BGB wrote:

On 11/9/2023 7:11 PM, MitchAlsup wrote:

Quadibloc wrote:

Good to see you are back on here...

Some progress has been made in advancing a small step towards sanity
in the description of the Concertina II architecture described at

http://www.quadibloc.com/arch/ct17int.htm

As Mitch Alsup has rightly noted, I want to have my cake and eat it
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.

<
My 66000 has all of this.
<

I want memory-reference instructions to still fit in 32 bits, despite
asking for so much more capacity.

<
The simple/easy ones definitely, the ones with longer displacements no.
<

Yes.

As noted a few times, as I see it, 9 .. 12 is sufficient.
Much less than 9 is "not enough", much more than 12 is wasting
entropy, at least for 32-bit encodings.

<
Can you suggest something I could have done by sacrificing 16-bits
down to 12-bits that would have improved "something" in my ISA ??
{{You see I did not have any trouble in having all 16-bits for MEM references--just like having 16-bits for integer, logical, and branch offsets.}}
<

12u-scaled would be "pretty good", say, being able to handle 32K for
QWORD ops.

<
IBM 360 found so, EMBench is replete with stack sizes and struct sizes
where My 66000 uses 1×32-bit instruction where RISC-V needs 2×32-bit... Exactly the difference between 12-bits and 14-bits....

RISC-V is 12-bit signed unscaled (which can only do +/- 2K).

On average, 12-bit signed unscaled is actually worse than 9-bit unsigned
scaled (4K range, for QWORD).

So, ironically, despite BJX2 having smaller displacements than RISC-V,
it actually deals better with the larger stack frames.

But, if one could address 32K, this should cover the vast majority of
structs and stack-frames.

A 16-bit unsigned scaled displacement would cover 512K for QWORD ops,
which could be nice, but likely unnecessary.

So what I had done was, after squeezing as much as I could into a basic >>>> instruction format, I provided for switching into alternate instruction >>>> formats which made different compromises by using the block headers.

<
Block headers are simply consuming entropy.
<

Also yes.

This has now been dropped. Since I managed to get the normal
(unaligned)
memory-reference instruction squeezed into so much less opcode space
that
I also had room for the aligned memory-reference format without
compromises
in the basic instruction set, it wasn't needed to have multiple
instruction
formats.

<
I never had any aligned memory references. The HW overhead to "fix" the
problem is so small as to be compelling.
<

In my case, it is only for 128-bit load/store operations, which
require 64-bit alignment.

<
VVM does all the wide stuff without necessitating the wide stuff in
registers or instructions.
<

Well, and an esoteric edge case:
   if((PC&0xE)==0xE)
You can't use a 96-bit encoding, and will need to insert a NOP if one
needs to do so.

<
Ehhhhh...
<

This is mostly due to a quirk in the L1 I$ design, where "fixing" it
costs more than just being like, "yeah, this case isn't allowed" (and
having the compiler emit a NOP in the rare edge cases it is encountered).

One can argue that aligned-only allows for a cheaper L1 D$, but also
"sucks pretty bad" for some tasks:
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

<
Time found that HW can solve the problem way more than adequately--
obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
<

Wait, are you arguing for aligned-only memory ops here?...

But, yeah, for me, a major selling points for unaligned access is mostly
that I can copy blocks of memory around like:
v0=((uint64_t *)cs)[0];
v1=((uint64_t *)cs)[1];
v2=((uint64_t *)cs)[2];
v3=((uint64_t *)cs)[3];
((uint64_t *)ct)[0]=v0;
((uint64_t *)ct)[1]=v1;
((uint64_t *)ct)[2]=v2;
((uint64_t *)ct)[3]=v3;
cs+=32; ct+=32;

For Huffman, some of the fastest strategies to implement the bitstream reading/writing, tend to be to casually make use of unaligned access
(shifting in and loading bytes is slower in comparison).

Though, all this falls on its face, if encountering a CPU that uses
traps to emulate unaligned access (apparently a lot of the SiFive cores
and similar).

I had to change the instructions longer than 32 bits to get them in
the basic instruction format, so now they're less dense.

Block structure is still used, but now for only the two things it's
actually needed for: reserving part of a block as unused for the
pseudo-immediates, and for VLIW features (explicitly indicating
parallelism, and instruction predication).

The ISA is still tremendously complicated, since I've put room in it
for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.

<
Yet, mine remains simple and compact.
<

Mostly similar.
Though, I guess some people could debate this in my case.

Granted, I specify the entire ISA in a single location, rather than
spreading it across a bunch of different documents (as was the case
with RISC-V).

Well, and where there is a lot that is left up to the specific
hardware implementations in terms of stuff that one would need to
"actually have an OS run on it", ...

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to BGB on Fri Nov 10 11:17:37 2023

On 11/10/2023 10:24 AM, BGB wrote:

On 11/10/2023 8:51 AM, Scott Lurndal wrote:

Quadibloc <quadibloc@servername.invalid> writes:

On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:

On 11/9/2023 3:51 PM, Quadibloc wrote:

The 32 general registers aren't _quite_ general. They're divided into >>>>> four groups of eight.

Errm, splitting up registers like this is likely to hurt far more than >>>> anything that 16-bit displacements are likely to gain.

For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.

As soon as you make 'general purpose registers' not 'general'
you've significantly complicated register allocation in compilers
and likely caused additional memory accesses due to the need to
spill registers unnecessarily.

Yeah.

Either banks of 8, or an 8 data + 8 address, or ... would kinda "rather suck".

Or, even smaller cases, like, "most instructions can use all the
registers, but these ops only work on a subset" is kind of an annoyance
(this is a big part of why I bothered with the whole XG2 thing).

Much better to have a big flat register space.

Yes, but sometimes you just need "another bit" in the instructions. So
an alternative is to break the requirement that all register specifier
fields in the instruction be the same length. So, for example, allow
access to all registers from one source operand position, but say only
half from the other source operand position. So, for a system with 32 registers, you would need 5 plus 5 plus 4 bits. Much of the time, such
as with commutative operations like adds, this doesn't hurt at all.

Yes, this makes register allocation in the compiler harder. And
occasionally you might need an extra instruction to copy a value to the
half size field, but on high end systems, this can be done in the rename
stage without taking an execution slot.

A more extreme alternative is to only allow the destination field to
also be one bit smaller. Of course, this makes things even harder for
the compiler, and probably requires extra "copy" instructions more
frequently, but sometimes you just gotta do what you gotta do. :-(

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Quadibloc on Fri Nov 10 22:03:23 2023

Quadibloc <quadibloc@servername.invalid> schrieb:

On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:

On 11/9/2023 3:51 PM, Quadibloc wrote:

The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.

This breaks with the central tenet of the /360, the PDP-11,
the VAX, and all RISC architectures: (Almost) all registers are general-purpose registers.

This would make your ISA very un-S/360-like.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Thomas Koenig on Fri Nov 10 23:25:41 2023

Thomas Koenig wrote:

Quadibloc <quadibloc@servername.invalid> schrieb:

On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:

On 11/9/2023 3:51 PM, Quadibloc wrote:

The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.

This breaks with the central tenet of the /360, the PDP-11,
the VAX, and all RISC architectures: (Almost) all registers are general-purpose registers.

<
But follows S.E.L 32/{...} series and several other minicomputers with
isolated base registers. In the 32/{..} series, there was 2 LDs and 2 STs
1 LD was byte (signed) with 19-bit displacement
2 LD was size (signed) with the lower bits of displacement specifying size.
3 ST was byte <ibid>
3 ST was size <ibid>
<
only registers 1-7 could be used as base register.
<
I saw several others using similar tricks but can't remember.....
<

This would make your ISA very un-S/360-like.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Fri Nov 10 23:21:08 2023

BGB wrote:

On 11/10/2023 12:22 PM, MitchAlsup wrote:

One can argue that aligned-only allows for a cheaper L1 D$, but also
"sucks pretty bad" for some tasks:
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

<
Time found that HW can solve the problem way more than adequately--
obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
<

Wait, are you arguing for aligned-only memory ops here?...

<

No, I am arguing that all memory references are inherently un aligned, but where
aligned references never suffer a stall penalty; and the the compiler does not need to understand if the reference is aligned or unaligned.
<

But, yeah, for me, a major selling points for unaligned access is mostly
that I can copy blocks of memory around like:
v0=((uint64_t *)cs)[0];
v1=((uint64_t *)cs)[1];
v2=((uint64_t *)cs)[2];
v3=((uint64_t *)cs)[3];
((uint64_t *)ct)[0]=v0;
((uint64_t *)ct)[1]=v1;
((uint64_t *)ct)[2]=v2;
((uint64_t *)ct)[3]=v3;
cs+=32; ct+=32;

<
MM Rcs,Rct,#length // without the for loop
<

For Huffman, some of the fastest strategies to implement the bitstream reading/writing, tend to be to casually make use of unaligned access (shifting in and loading bytes is slower in comparison).

Though, all this falls on its face, if encountering a CPU that uses
traps to emulate unaligned access (apparently a lot of the SiFive cores
and similar).

<
Traps to perform unaligned are so 1985......either don't allow them at all (SIGSEGV) or treat them as first class citizens. The former fails in the market.
<

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to MitchAlsup on Fri Nov 10 20:37:38 2023

On 11/10/2023 5:21 PM, MitchAlsup wrote:

BGB wrote:

On 11/10/2023 12:22 PM, MitchAlsup wrote:

One can argue that aligned-only allows for a cheaper L1 D$, but also
"sucks pretty bad" for some tasks:
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

<
Time found that HW can solve the problem way more than adequately--
obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
<

Wait, are you arguing for aligned-only memory ops here?...

<

No, I am arguing that all memory references are inherently un aligned,
but where
aligned references never suffer a stall penalty; and the the compiler
does not
need to understand if the reference is aligned or unaligned.
<

OK, fair enough.

I don't have separate aligned/unaligned ops for anything QWORD or
smaller, as all these cases are implicitly unaligned.

Though, aligned is sometimes a little faster, due to playing better with
the L1 cache; but, using misaligned memory access is generally faster
than any of the traditional workarounds (the difference being mostly a
slight increase in the probability of triggering an L1 cache miss).

The main exception is MOV.X requiring 64-bit alignment (for a 128-bit
memory access), but the unaligned fallback here is to use a pair of
MOV.Q instructions instead.

But, this was in part because of how the L1 caches were implemented, and supporting fully unaligned 128-bit access would have been more expensive
(and the relative gain is smaller).

This does mean alternate logic for aligned vs unaligned "memcpy()", with
the unaligned case being a little slower as a result of needing to use
MOV.Q ops.

It is possible a case could be made for allowing fully unaligned MOV.X
as well.

Would mostly involve reworking how MOV.X is implemented relative to the extract/insert logic (likely internally working with 192 bits rather
than 128; with as-is, MOV.X implemented by bypassing the main
extract/insert logic).

But, yeah, for me, a major selling points for unaligned access is
mostly that I can copy blocks of memory around like:
   v0=((uint64_t *)cs)[0];
   v1=((uint64_t *)cs)[1];
   v2=((uint64_t *)cs)[2];
   v3=((uint64_t *)cs)[3];
   ((uint64_t *)ct)[0]=v0;
   ((uint64_t *)ct)[1]=v1;
   ((uint64_t *)ct)[2]=v2;
   ((uint64_t *)ct)[3]=v3;
   cs+=32; ct+=32;

<
    MM   Rcs,Rct,#length            // without the for loop <

I typically use a "while()" loop or similar, but yeah...

At present, the fastest loop strategy is generally:
while(n--)
{
...
}

For Huffman, some of the fastest strategies to implement the bitstream
reading/writing, tend to be to casually make use of unaligned access
(shifting in and loading bytes is slower in comparison).

Though, all this falls on its face, if encountering a CPU that uses
traps to emulate unaligned access (apparently a lot of the SiFive
cores and similar).

<
Traps to perform unaligned are so 1985......either don't allow them at all (SIGSEGV) or treat them as first class citizens. The former fails in the market.
<

Apparently SiFive went this way, for some reason...

Like, RISC-V requires unaligned access to work, but doesn't specify how,
and apparently they considered trapping to be an acceptable option, but trapping sucks for performance.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to BGB-Alt on Sat Nov 11 05:39:59 2023

On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:

Errm, splitting up registers like this is likely to hurt far more than anything that 16-bit displacements are likely to gain.

No doubt you're right.

As that means my 16-bit instructions, with the registers split into four
parts, are useless to compilers, now I have to go around in circles again.
I thought I had finally achieved a single instruction format that satisfied
my ambitions - and now I find it is fatally flawed.

One possibility is to go back to the full format for 32-bit memory
reference instructions. That will still leave me enough opcode space that a four-bit prefix could precede three 20-bit short instructions. To avoid creating a variable-length instruction set, which complicates decoding,
I would require such blocks to be aligned on 64-bit boundaries.

So now there's a nested block structure, of 64-bit blocks inside 256-bit blocks!

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Quadibloc on Sat Nov 11 07:07:00 2023

In article <uikbng$2lh5f$1@dont-email.me>, quadibloc@servername.invalid (Quadibloc) wrote:

Well, I think that superscalar operation of microprocessors is a
good thing.

Indeed.

Explicitly indicating which instructions may execute in parallel
is one way to facilitate that. Even if the Itanium was an
unsuccessful implementation of that principle.

Intel tried that with the Pentium, with its two pipelines and run-time automatic instruction scheduling, to moderate success. They tried it with
the i860, with compiler scheduling and a comprehensive lack of success.
The Itanium tried the i860 method, much harder and was still unsuccessful.

In engineering, the gap between "Doing this would be good" and "Here it
is working" generally involves having a good idea about /how/ to do it.

Finding an example where explicit but non-automatic parallelism worked
for general-purpose code and figuring out how that was done should be
easier than inventing a method. In the absence of that, we have some
evidence that just hoping the software people will solve this problem for
you doesn't work.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Quadibloc on Sat Nov 11 06:50:00 2023

In article <uijk93$2dc2i$2@dont-email.me>, quadibloc@servername.invalid (Quadibloc) wrote:

This lends itself to writing code where four distinct threads are interleaved, helping pipelining in implementations too cheap to have out-of-order executiion.

This is not the conventional way of implementing threads, and seems to
have some drawbacks:

One of the uses of threads is to scale to the hardware resources
available. With this approach, the number of threads is baked in at
compile time.

Debugging such interleaved threads is likely to be even more confusing
than debugging multiple threads usually is.

Pipeline stalls affect every thread, rather than just the thread that
triggers them.

The common threading APIs also lack a way to set such threads to work,
but that's a far more soluble problem.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Sat Nov 11 07:22:21 2023

BGB <cr88192@gmail.com> writes:

On 11/10/2023 12:22 PM, MitchAlsup wrote:

BGB wrote:

One can argue that aligned-only allows for a cheaper L1 D$, but also
"sucks pretty bad" for some tasks:
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

Hashing

Though, all this falls on its face, if encountering a CPU that uses
traps to emulate unaligned access (apparently a lot of the SiFive cores
and similar).

Let's see what this SiFive U74 does:

[fedora-starfive:~/nfstmp/gforth-riscv:98397] perf stat -e instructions:u -e cycles:u gforth-fast -e "create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x x ! x foo bye "

Performance counter stats for 'gforth-fast -e create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x x ! x foo bye ':

469832112 instructions:u # 0.79 insn per cycle
591015904 cycles:u

0.609751748 seconds time elapsed

0.533195000 seconds user
0.061522000 seconds sys

[fedora-starfive:~/nfstmp/gforth-riscv:98398] perf stat -e instructions:u -e cycles:u gforth-fast -e "create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x 1+ x 1+ ! x 1+ foo bye "

Performance counter stats for 'gforth-fast -e create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x 1+ x 1+ ! x 1+ foo bye ':

53533370273 instructions:u # 0.77 insn per cycle
69304924487 cycles:u

69.368484169 seconds time elapsed

69.256290000 seconds user
0.049997000 seconds sys

So when we do aligned accesses (first command), the code performs 4.7 instructions and 5.9 cycles per load, while for unaligned accesses
(second command) the same code performs 535.3 instructions and 693.0
cycles per load. So apparently an unaligned load triggers >500
additional instructions, confirming your claim. Interestingly, all
that is attributed to user time; maybe the fixup is performed by a
user-level trap or microcode.

Still, the approach of having separate instructions for aligned and
unaligned accesses (typically with several instructionf for the
unaligned case) has been tried and discarded. Software just does not
declare that some access will be unaligned.

A particularly strong evidence for this is that gas generated
non-working code for ustq (unaligned store quadword) on Alpha for
several years, and apparently nobody noticed until I gave an exercise
to my students where they should use ustq (so no production use,
either).

So, every general-purpose architecture, including RISC-V, the
spiritual descendent of MIPS and Alpha (which had the division),
settled on having memory access instructions that perform both aligned
and unaligned accesses (with performance advantages for aligned
accesses).

If RISC-V implementations want to perform well for code that uses
unaligned accesses for memory copying, compression/decompression, or
hashing, they will eventually have to implement unaligned accesses
more efficiently, but at least the code works, and aligned accesses
are fast.

Why would you not go the same way? It would also save on instruction
encoding space.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Anton Ertl on Sat Nov 11 03:03:18 2023

On 11/11/2023 1:22 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

On 11/10/2023 12:22 PM, MitchAlsup wrote:

BGB wrote:

One can argue that aligned-only allows for a cheaper L1 D$, but also
"sucks pretty bad" for some tasks:
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

Hashing

Possibly true.

Some of my data hash/checksum functions were along the lines of:
uint32_t *cs, *cse;
uint64_t v0, v1, v;

cs=buf; cse=buf+((sz+3)>>2);
v0=1; v1=1;
while(cs<cse)
{
v=*cs++;
v0+=v;
v1+=v0;
}
v0=((uint32_t)v0)+(v0>>32); //*
v1=((uint32_t)v1)+(v1>>32);
v0=((uint32_t)v0)+(v0>>32);
v1=((uint32_t)v1)+(v1>>32);
v=(uint32_t)(v0^v1);

*: This step may seem frivolous, but seems to increase the strength of
the checksum.

There are faster variants, but this one gives the general idea.
Not aware of anyone else doing it this way, but it is faster than either Adler32 or CRC32, while giving some similar properties (the second sum detecting various issues which would be missed with a single sum).

A faster variant of this being to run multiple sets of sums in parallel
and then combine the values at the end.

Though, all this falls on its face, if encountering a CPU that uses
traps to emulate unaligned access (apparently a lot of the SiFive cores
and similar).

Let's see what this SiFive U74 does:

[fedora-starfive:~/nfstmp/gforth-riscv:98397] perf stat -e instructions:u -e cycles:u gforth-fast -e "create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x x ! x foo bye "

Performance counter stats for 'gforth-fast -e create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x x ! x foo bye ':

469832112 instructions:u # 0.79 insn per cycle
591015904 cycles:u

0.609751748 seconds time elapsed

0.533195000 seconds user
0.061522000 seconds sys

[fedora-starfive:~/nfstmp/gforth-riscv:98398] perf stat -e instructions:u -e cycles:u gforth-fast -e "create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x 1+ x 1+ ! x 1+ foo bye "

Performance counter stats for 'gforth-fast -e create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x 1+ x 1+ ! x 1+ foo bye ':

53533370273 instructions:u # 0.77 insn per cycle
69304924487 cycles:u

69.368484169 seconds time elapsed

69.256290000 seconds user
0.049997000 seconds sys

So when we do aligned accesses (first command), the code performs 4.7 instructions and 5.9 cycles per load, while for unaligned accesses
(second command) the same code performs 535.3 instructions and 693.0
cycles per load. So apparently an unaligned load triggers >500
additional instructions, confirming your claim. Interestingly, all
that is attributed to user time; maybe the fixup is performed by a
user-level trap or microcode.

I wasn't that sure how it was implemented, but it is "kinda weak" in any
case.

On the BJX2 core, the performance impact of using misaligned load and
store is approximately 3% in my tests, I suspect mostly due to a
slightly higher incidence of L1 cache misses.

Still, the approach of having separate instructions for aligned and
unaligned accesses (typically with several instructionf for the
unaligned case) has been tried and discarded. Software just does not
declare that some access will be unaligned.

A particularly strong evidence for this is that gas generated
non-working code for ustq (unaligned store quadword) on Alpha for
several years, and apparently nobody noticed until I gave an exercise
to my students where they should use ustq (so no production use,
either).

So, every general-purpose architecture, including RISC-V, the
spiritual descendent of MIPS and Alpha (which had the division),
settled on having memory access instructions that perform both aligned
and unaligned accesses (with performance advantages for aligned
accesses).

If RISC-V implementations want to perform well for code that uses
unaligned accesses for memory copying, compression/decompression, or
hashing, they will eventually have to implement unaligned accesses
more efficiently, but at least the code works, and aligned accesses
are fast.

Why would you not go the same way? It would also save on instruction encoding space.

I was never claiming that one should have separate instructions (since,
if the L1 cache supports unaligned access, what is the point of having
aligned only variants of the instructions?...).

Rather, that it might make sense to do an aligned-only core, and then
trap on misaligned (possibly allowing the access to be emulated, like if
SiFive cores); mostly in the name of making the L1 cache cheaper.

A few of my small core experiments had used aligned-only L1 caches, but
I mostly went with a natively unaligned designs for my bigger ISA
designs, mostly as I tend to make frequent use of unaligned memory
access as a "performance trick".

However, BJX2 has a natively unaligned L1 cache (well, apart from MOV.X).

Have gone and added the logic to allow MOV.X to be unaligned as well,
which mostly has the effect of a minor increase in LUT cost and similar
(mostly as the internal extract/insert logic needed to be widened from
128 to 192 bits to deal with this; with MOV.X now being handled in a
similar way to MOV.Q when this feature is enabled).

Though, one thing is whether to "formally fix" the Op96 at
((PC&0xE)==0xE) issue. Ironically, in this case, the "fix" is already
present in the Verilog code, just the restriction exists more as a
"break glass to save some LUTs" option.

Well, along with some other wonk, like leaving it as undefined what
happens if the instruction stream is allowed to cross a 4GB boundary,
... Branching is fine, just the PC increment logic can save some latency
by not bothering with the high 16 bits.

I guess, in an ideal world, there wouldn't be a lot of this wonk, but
needing to often battle with timing constraints and similar does create incentive for corner cutting in various areas.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Sat Nov 11 08:37:00 2023

In article <2023Nov11.082221@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Let's see what this SiFive U74 does:

...

So apparently an unaligned load triggers >500 additional instructions, confirming your claim.

Wow. I think I'd rather have SIGBUS on unaligned accesses. That is at
least obvious. Slowdowns like this will be a major drag on performance,
simply because finding them all is tricky.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Sat Nov 11 10:22:54 2023

jgd@cix.co.uk (John Dallman) writes:

In article <2023Nov11.082221@mips.complang.tuwien.ac.at>, >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

So apparently an unaligned load triggers >500 additional instructions,
confirming your claim.

Wow. I think I'd rather have SIGBUS on unaligned accesses. That is at
least obvious.

True, but that has been tried out and, in a world (like Linux) where
software is developed on a platform that supports unaligned accesses,
and then compiled by package maintainers (who often are not that
familiar with the software) on a lot of platforms, the end result was
that the kernel by default performed a fixup (and put a message in the
dmesg buffer) instead of delivering a SIGBUS.

There was a system call for switching to the SIGBUS behaviour. On
Tru64 OSF/1 (or whatever it is called this week), the default
behaviour was to SIGBUS, but it had the same system call, and a
shell-level tool "uac" to change the behaviour to fix it up. I
implemented a tool "uace" for Linux that can be used for running a
process with the SIGBUS behaviour that you desire: <https://www.complang.tuwien.ac.at/anton/uace.c>. Maybe something
similar is possible on the U74.

Anyway, it seems that the problems was not a big one on Linux-Alpha
(messages about unaligned accesses were not that frequent).
Apparently the large majority of code performs aligned accesses. It's
just that there are a few unaligned ones.

I would not worry about cores like the U74 (and I have a program that
uses unaligned accesses for hashing); that's just a stepping stone for
getting more capable RISC-V cores, and at some point (before RISC-V
becomes mainstream) the trapping will be replaced with something more efficient.

We have seen the same development on AMD64. The Penryn
(second-generation Core 2) takes 159 cycles for an unaligned load that
crosses a page boundary, the Sandy Bridge takes 28 <http://al.howardknight.net/?ID=143135464800>. The Sandy Bridge and
Ivy Bridge take 200 cycles for an unaligned page-crossing store,
Haswell and Skylake take 25 and 24.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Sat Nov 11 11:11:46 2023

BGB <cr88192@gmail.com> writes:

On 11/11/2023 1:22 AM, Anton Ertl wrote:

Hashing

Possibly true.

Definitely true: The data you want to hash may be aligned to byte
boundaries (e.g., strings), but a fast hash function loads it at the
largest granularity possible and also processes the loaded values at
the largest granularity possible.

And in contrast to block copying, where you can do some prelude, then
perform aligned accesses, and then a postlude (at least on one side of
the copying), for this kind of hashing you want to have in the first
step, the first n bytes in a register, because the first byte
influences the hash function result differently than the second byte.

What you could do is load aligned into a shift buffer (in a register),
and then use something like AMD64's shld to get the data in the needed
form. Same for the second side of block copying. But is this faster
on modern CPUs?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Sat Nov 11 16:53:00 2023

In article <2023Nov11.112254@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

True, but that has been tried out and, in a world (like Linux) where
software is developed on a platform that supports unaligned
accesses, and then compiled by package maintainers (who often are
not that familiar with the software) on a lot of platforms, the end
result was that the kernel by default performed a fixup (and put a
message in the dmesg buffer) instead of delivering a SIGBUS.

Yup. The software I work on is meant, in itself, to work on platforms
that enforce alignment, and it was a useful catcher for some kinds of bug. However, I'm now down to one that actually enforces it, in SPARC Solaris,
and that isn't long for this world.

I dug into what it would take to have x86-64 Linux work with alignment enforcement turned on, and it's a huge job.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Stephen Fuld on Sat Nov 11 18:11:04 2023

Stephen Fuld wrote:

On 11/10/2023 10:24 AM, BGB wrote:

Much better to have a big flat register space.

Yes, but sometimes you just need "another bit" in the instructions. So
an alternative is to break the requirement that all register specifier
fields in the instruction be the same length. So, for example, allow

<
Another way to get a few more bits is to use a prefix-instruction like
CARRY for those seldom needed bits.
<

access to all registers from one source operand position, but say only
half from the other source operand position. So, for a system with 32 registers, you would need 5 plus 5 plus 4 bits. Much of the time, such
as with commutative operations like adds, this doesn't hurt at all.

Yes, this makes register allocation in the compiler harder. And
occasionally you might need an extra instruction to copy a value to the
half size field, but on high end systems, this can be done in the rename stage without taking an execution slot.

A more extreme alternative is to only allow the destination field to
also be one bit smaller. Of course, this makes things even harder for
the compiler, and probably requires extra "copy" instructions more frequently, but sometimes you just gotta do what you gotta do. :-(

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Chris M. Thomasson@21:1/5 to Anton Ertl on Sat Nov 11 11:30:19 2023

On 11/10/2023 11:22 PM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

On 11/10/2023 12:22 PM, MitchAlsup wrote:

BGB wrote:

One can argue that aligned-only allows for a cheaper L1 D$, but also
"sucks pretty bad" for some tasks:
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

Hashing

[...]

Fwiw, proper alignment is very important wrt a programmer to gain some
of the benefits of, basically, virtually "any" target architecture. For instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
the programmer can set up an array that is aligned on a cache line
boundary and pad each element of said array up to the size of a L2 cache
line.

Two steps... Align your memory on a proper cache line boundary, and pad
the size of each element up to the size of a single cache line.

Think of LL/SC... If one did not honor the reservation granule....
well... Shit.. False sharing on a reservation granule can cause live
lock and damage forward progress wrt some LL/SC setups.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB-Alt@21:1/5 to MitchAlsup on Sat Nov 11 14:33:20 2023

On 11/11/2023 12:11 PM, MitchAlsup wrote:

Stephen Fuld wrote:

On 11/10/2023 10:24 AM, BGB wrote:

Much better to have a big flat register space.

Yes, but sometimes you just need "another bit" in the instructions.
So an alternative is to break the requirement that all register
specifier fields in the instruction be the same length. So, for
example, allow

<
Another way to get a few more bits is to use a prefix-instruction like
CARRY for those seldom needed bits.
<

Or, a similar role is served by my Jumbo-Op64 prefix.

So, there are two different Jumbo prefixes:
Jumbo-Imm, which mostly just makes the immed/disp field bigger;
Jumbo-Op64, which mostly extends the opcode and other things;
May extend immediate, but less so, and that is not its main purpose.

Op64 also does, optionally:
Being the original mechanism to address R32..R63, before XGPR and XG2
encodings were added, and needed (in Baseline) for the parts of the ISA
not covered by the XGPR encodings;
Adds a potential 4'th register, extra displacement (or smaller Immed extension), or rounding-mode / opcode bits (depends on the base
instruction).

As-is, 8 bits in the Op64 prefix are Must Be Zero, as-is, they are
designated specifically towards expanding the opcode space (with the 00
case designated as mapping to the same instruction as in the basic
32-bit encoding).

access to all registers from one source operand position, but say only
half from the other source operand position. So, for a system with 32
registers, you would need 5 plus 5 plus 4 bits. Much of the time,
such as with commutative operations like adds, this doesn't hurt at all.

Yes, this makes register allocation in the compiler harder. And
occasionally you might need an extra instruction to copy a value to
the half size field, but on high end systems, this can be done in the
rename stage without taking an execution slot.

A more extreme alternative is to only allow the destination field to
also be one bit smaller. Of course, this makes things even harder for
the compiler, and probably requires extra "copy" instructions more
frequently, but sometimes you just gotta do what you gotta do. :-(

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Dallman on Sat Nov 11 21:28:05 2023

jgd@cix.co.uk (John Dallman) writes:

In article <2023Nov11.112254@mips.complang.tuwien.ac.at>, >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

True, but that has been tried out and, in a world (like Linux) where
software is developed on a platform that supports unaligned
accesses, and then compiled by package maintainers (who often are
not that familiar with the software) on a lot of platforms, the end
result was that the kernel by default performed a fixup (and put a
message in the dmesg buffer) instead of delivering a SIGBUS.

Yup. The software I work on is meant, in itself, to work on platforms
that enforce alignment, and it was a useful catcher for some kinds of bug. >However, I'm now down to one that actually enforces it, in SPARC Solaris,
and that isn't long for this world.

I dug into what it would take to have x86-64 Linux work with alignment >enforcement turned on, and it's a huge job.

It might be easier with AArch64. Just set the A bit (bit 1) in SCTLR_EL1;
it only effects code executing in usermode.

There may even already be some ELF flag that will set it when the
file is exec(2)'d.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Chris M. Thomasson on Sat Nov 11 21:22:00 2023

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> schrieb:

On 11/10/2023 11:22 PM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

On 11/10/2023 12:22 PM, MitchAlsup wrote:

BGB wrote:

One can argue that aligned-only allows for a cheaper L1 D$, but also >>>>> "sucks pretty bad" for some tasks:
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

Hashing

[...]

Fwiw, proper alignment is very important wrt a programmer to gain some
of the benefits of, basically, virtually "any" target architecture. For instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
the programmer can set up an array that is aligned on a cache line
boundary and pad each element of said array up to the size of a L2 cache line.

Two steps... Align your memory on a proper cache line boundary, and pad
the size of each element up to the size of a single cache line.

For smaller elements smaller than a cache line, that makes little
sense. as written. I think there is an unwritten assumption
"for elements larger than cache line" there, or we would all
be using 64-byte bools.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Chris M. Thomasson on Sat Nov 11 22:53:09 2023

Chris M. Thomasson wrote:

Think of LL/SC... If one did not honor the reservation granule....
well... Shit.. False sharing on a reservation granule can cause live
lock and damage forward progress wrt some LL/SC setups.

<
One should NEVER (N. E. V. E. R.) attempt ATOMIC stuff on an unaligned container. Only aligned containers possess ATOMIC-smelling properties.
<

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Chris M. Thomasson@21:1/5 to Thomas Koenig on Sat Nov 11 14:23:51 2023

On 11/11/2023 1:22 PM, Thomas Koenig wrote:

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> schrieb:

On 11/10/2023 11:22 PM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

On 11/10/2023 12:22 PM, MitchAlsup wrote:

BGB wrote:

One can argue that aligned-only allows for a cheaper L1 D$, but also >>>>>> "sucks pretty bad" for some tasks:
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

Hashing

[...]

Fwiw, proper alignment is very important wrt a programmer to gain some
of the benefits of, basically, virtually "any" target architecture. For
instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
the programmer can set up an array that is aligned on a cache line
boundary and pad each element of said array up to the size of a L2 cache
line.

Two steps... Align your memory on a proper cache line boundary, and pad
the size of each element up to the size of a single cache line.

For smaller elements smaller than a cache line, that makes little
sense. as written. I think there is an unwritten assumption
"for elements larger than cache line" there, or we would all
be using 64-byte bools.

:^). Basically, I am thinking along the lines of cache line allocators
that return properly aligned and padded l2 lines. Aligning and padding
on l2 lines helps get rid of any nasty false sharing. Remember those
damn hyperthreaded intel processors what had 128 byte l2 lines, but
could falsely share the low 64 bytes with the high 64 bytes? Iirc, Intel
had a work around that involved offsetting a threads stack using alloca.

Also, see what happens if you straddle a l2 cache line and use it for a
LOCK'ed atomic RMW on Intel. It just might assert a bus lock.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Chris M. Thomasson@21:1/5 to Thomas Koenig on Sat Nov 11 14:28:48 2023

On 11/11/2023 1:22 PM, Thomas Koenig wrote:

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> schrieb:

On 11/10/2023 11:22 PM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

On 11/10/2023 12:22 PM, MitchAlsup wrote:

BGB wrote:

One can argue that aligned-only allows for a cheaper L1 D$, but also >>>>>> "sucks pretty bad" for some tasks:
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

Hashing

[...]

Fwiw, proper alignment is very important wrt a programmer to gain some
of the benefits of, basically, virtually "any" target architecture. For
instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
the programmer can set up an array that is aligned on a cache line
boundary and pad each element of said array up to the size of a L2 cache
line.

Two steps... Align your memory on a proper cache line boundary, and pad
the size of each element up to the size of a single cache line.

For smaller elements smaller than a cache line, that makes little
sense. as written. I think there is an unwritten assumption
"for elements larger than cache line" there, or we would all
be using 64-byte bools.

Also, think about the atomic state for a mutex. Say:

<pseudo-code>

struct mutex_atomic_state
{
std::atomic<word> m_state;
};

Well, you want this state to be aligned on a cache line boundary and
padded up to the size of a cache line. You want to avoid false sharing
between this state and any user state used in the locked region.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Thomas Koenig on Sat Nov 11 22:58:32 2023

Thomas Koenig wrote:

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> schrieb:

Fwiw, proper alignment is very important wrt a programmer to gain some
of the benefits of, basically, virtually "any" target architecture. For
instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
the programmer can set up an array that is aligned on a cache line
boundary and pad each element of said array up to the size of a L2 cache
line.

Two steps... Align your memory on a proper cache line boundary, and pad
the size of each element up to the size of a single cache line.

For smaller elements smaller than a cache line, that makes little
sense. as written. I think there is an unwritten assumption
"for elements larger than cache line" there, or we would all
be using 64-byte bools.

<
Then consider a 4-way banked cache (¼ cache line per bank) and an access
that straddles a ¼ line boundary and multiple AGEN units. So one AGEN
unit creates the access to the container which straddles the boundary
while another creates an access into the second part of the spanning
access.
<
Then consider that "program order" information is not instantaneously available, and the bank selector picks the second access. Now, that
spanning access is no longer ATOMIC, and might even see a Snoop between
its first access and its spanning access...............
<

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Lurndal on Sat Nov 11 22:47:00 2023

In article <FlS3N.25739$_Oab.3565@fx15.iad>, scott@slp53.sl.home (Scott Lurndal) wrote:

jgd@cix.co.uk (John Dallman) writes:

I dug into what it would take to have x86-64 Linux work with
alignment enforcement turned on, and it's a huge job.

It might be easier with AArch64. Just set the A bit (bit 1) in
SCTLR_EL1; it only effects code executing in usermode.

There may even already be some ELF flag that will set it when the
file is exec(2)'d.

I'll take a look, but I doubt glibc on Aarch64 is built to be run with alignment trapping. Should it be EL0 for usermode?

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Nov 12 10:34:04 2023

According to Quadibloc <quadibloc@servername.invalid>:

What do you consider the virtues of Itanium to be?

Well, I think that superscalar operation of microprocessors is a good
thing. Explicitly indicating which instructions may execute in parallel
is one way to facilitate that. Even if the Itanium was an unsuccessful >implementation of that principle.

I knew the people at Yale who invented trace scheduling and started Multiflow.

It was and is a very clever technique for kind of computers we could build
in the 1980s. It works really well for programs with regular memory access patterns, not so well for programs without. Once we could build enough transistors to do dynamic memory and instruction scheduling, why try to
do it at compile time?

I gather it is still useful for embedded or realtime applications which
are fairly regular and for cost or power reasons you want to minimize
the number of transistors.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Sun Nov 12 13:59:06 2023

John Levine <johnl@taugh.com> writes:

I gather it is still useful for embedded or realtime applications which
are fairly regular and for cost or power reasons you want to minimize
the number of transistors.

Even there, VLIW-inspired CPUs like Philips Trimedia was terminated,
and I have not heard much about TI's C6000 lately. Both NXP (spun of
from Philips) and TI seem to bet heavily on ARM.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Sun Nov 12 14:08:11 2023

jgd@cix.co.uk (John Dallman) writes:

I dug into what it would take to have x86-64 Linux work with alignment >enforcement turned on, and it's a huge job.

I did a first attempt in the IA-32 days, and there I found that the
alignment requirements of the hardware were incompatible with the ABI
(which required 4-byte alignment for 8-byte FP numbers).

My second attempt was with AMD64, and there I found that gcc produced misaligned 16-bit memory accesses for stuff like strcpy(buf, "a"). I
did not try to disable this with a flag at the time, but maybe -fno-tree-vectorize would help. But even if I use that for my code, I
would also have to recompile all the libraries with that flag.

Another problem (on both platforms) were memcpy, memmove, etc., but I
expected that one could link with alignment-clean versions. But I
don't know how many functions are affected.

I would be surprised if ARM A64 did not have the same problems (except
the idiotic incompatibility between Intel ABI and Intel hardware).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Nov 12 14:54:56 2023

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

jgd@cix.co.uk (John Dallman) writes:

I dug into what it would take to have x86-64 Linux work with alignment >>enforcement turned on, and it's a huge job.

I did a first attempt in the IA-32 days, and there I found that the
alignment requirements of the hardware were incompatible with the ABI
(which required 4-byte alignment for 8-byte FP numbers).

This is a very old problem. S/360 was the first byte addressed machine
and required aligned operands. They immediately realized that Fortran
programs that used COMMON or EQUIVALENCE often forced 8-byte FP onto
4-byte boundaries. The Fortran library had a hack that caught the
alignment fault and fixed it up very slowly. But they quickly dealt
with it in hardware. The 360/85 which brought us caches also had "byte
oriented opeands" i.e. misaligned, and it was carried into all
subsequent 370 and later machines.

It makes some sense that they did so since caches greatly decrease the
cost of misaligned operands.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Sun Nov 12 16:24:00 2023

In article <2023Nov12.150811@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

jgd@cix.co.uk (John Dallman) writes:

I dug into what it would take to have x86-64 Linux work with
alignment enforcement turned on, and it's a huge job.

I did a first attempt in the IA-32 days, and there I found that the
alignment requirements of the hardware were incompatible with the
ABI (which required 4-byte alignment for 8-byte FP numbers).

By the time I was running short of alignment-sensitive platforms, x86-64
was well established, and 64-bit is preferable for this kind of
bug-hunting since accidental correct alignment is rarer.

My second attempt was with AMD64, and there I found that gcc
produced misaligned 16-bit memory accesses for stuff like
strcpy(buf, "a"). I did not try to disable this with a flag
at the time, but maybe -fno-tree-vectorize would help. But
even if I use that for my code, I would also have to recompile
all the libraries with that flag.

I reached similar conclusions, reckoning that I'd need to rebuild the
Linux userland for the job, at minimum. An alternative is to wrap all
calls to system libraries and turn alignment traps off and on there,
which would be easier, given I have a well-defined set of software to
test.

I would be surprised if ARM A64 did not have the same problems
(except the idiotic incompatibility between Intel ABI and Intel
hardware).

Yup. I have a lot more x86-64 hardware available, so it would be the
choice, if I didn't have so many more urgent projects to do.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Dallman on Sun Nov 12 17:21:51 2023

jgd@cix.co.uk (John Dallman) writes:

In article <FlS3N.25739$_Oab.3565@fx15.iad>, scott@slp53.sl.home (Scott >Lurndal) wrote:

jgd@cix.co.uk (John Dallman) writes:

I dug into what it would take to have x86-64 Linux work with
alignment enforcement turned on, and it's a huge job.

It might be easier with AArch64. Just set the A bit (bit 1) in
SCTLR_EL1; it only effects code executing in usermode.

There may even already be some ELF flag that will set it when the
file is exec(2)'d.

I'll take a look, but I doubt glibc on Aarch64 is built to be run with >alignment trapping. Should it be EL0 for usermode?

The EL1 in the register name describes the minimum exception level
allowed to access the register. SCTLR_EL1 includes control bits
for both EL1 and EL0.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Lurndal on Sun Nov 12 17:40:00 2023

In article <PQ74N.100$ayBd.39@fx07.iad>, scott@slp53.sl.home (Scott
Lurndal) wrote:

jgd@cix.co.uk (John Dallman) writes:

In article <FlS3N.25739$_Oab.3565@fx15.iad>, scott@slp53.sl.home

(Scott Lurndal) wrote:

jgd@cix.co.uk (John Dallman) writes:
It might be easier with AArch64. Just set the A bit (bit 1) in
SCTLR_EL1; it only effects code executing in usermode.

There may even already be some ELF flag that will set it when the
file is exec(2)'d.

I'll take a look, but I doubt glibc on Aarch64 is built to be run
with alignment trapping. Should it be EL0 for usermode?

The EL1 in the register name describes the minimum exception level
allowed to access the register. SCTLR_EL1 includes control bits
for both EL1 and EL0.

Aha. It's harder for ARM64: I'd have to be in supervisor mode to set that
bit, and the stuff I work on is strictly application code.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Sun Nov 12 17:27:36 2023

mitchalsup@aol.com (MitchAlsup) writes:

Chris M. Thomasson wrote:

Think of LL/SC... If one did not honor the reservation granule....
well... Shit.. False sharing on a reservation granule can cause live
lock and damage forward progress wrt some LL/SC setups.

<
One should NEVER (N. E. V. E. R.) attempt ATOMIC stuff on an unaligned >container. Only aligned containers possess ATOMIC-smelling properties.
<

That is indeed the case. Consider the effect of a page fault when
an unaligned access crosses a page boundary, for example; leaving
aside, of course, all the difficulties inherent in dealing with
atomicity when the access spans two cache lines.

ARM implementations of LL/SC (Load Exclusive/Store Exclusive) can
have an arbitrary sized reservation granule (ARM's Cortex-M7,
for example, has a single reservation granule the size of the
full address space). Any store between the loadex znd storex
instructions is allowed by the architecture (V7 and V8) to cause
the storex to fail.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to BGB-Alt on Sun Nov 12 20:55:27 2023

On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:

Errm, splitting up registers like this is likely to hurt far more than anything that 16-bit displacements are likely to gain.

Unless, maybe, registers were being treated like a stack, but even then,
this is still gonna suck.

Much preferable for a compiler to have a flat space of 32 or 64
registers. Having 16 sorta works, but does still add a bit to spill and
fill.

This led me to seriously reconsider the path down which I was
heading.

I had tried, with all sorts of ingenious compromises of register spaces and
the like, to fit all the capabilities I wanted into the opcode space of a single version of the instruction set, eliminating the need for blocks
which contained instructions belonging to alternate versions of the
instruction set.

But if the 16-bit instructions I'm making room for are useless to
compilers, that's questionable.

At first, when I mulled over this, I came up with multiple ideas to address
it, each one crazier than the last.

Seeing, therefore, that this was a difficult nut to crack, and not wanting
to go down in another wrong direction... instead, I found a way to go that seemed to me to be reasonably sensible.

Go back to uncompromised 32-bit instructions, even though that means there
are no 16-bit instructions.

Then, bring back short instructions - effectively 17 bits long - so as to
have room for full register specifications. This means an alternative block format where 16, 32, 48, 64... bit instructions are all possible.

*But* because of the room 17-bit short instructions take up in the header,
the 32-bit instructions are the same regular format as in the other case.
Not some kind of 33-bit or 35-bit instruction with a new set of instruction formats.

So, even though there are now two formats for code instead of one, one is merely the 32-bit subset of the other, so that although I have taken a step back in order to take steps forward, it still isn't too far back.

I'm _trying_ to keep a lid on the extravagances in Concertina II, even if
using the word "sanity" in the same breath with it may be considered inappropriate...

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Sun Nov 12 21:28:11 2023

BGB wrote:

On 11/9/2023 10:37 PM, Quadibloc wrote:

For performance optimized cases, I am starting to suspect 16-bit ops are
not worth it.

<
BINGO:: another near convert.......
<

For size optimization, they make sense; but size optimization also means mostly confining register allocation to R0..R15 in my case, with
heuristics for when to enable additional registers, where enabling the
higher registers effectively hinders the use of 16-bit instructions.

The other option I have found is that, rather than optimizing for
smaller instructions (as in an ISA with 16 bit instructions), one can
instead optimize for doing stuff in as few instructions as it is
reasonable to do so, which in turn further goes against the use of
16-bit instructions.

<
This is the My 66000 path-execute fewer instructions even if they take
the same number of bytes in .text.
<

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Sun Nov 12 21:25:20 2023

Quadibloc wrote:

On Fri, 10 Nov 2023 01:11:13 +0000, MitchAlsup wrote:

I never had any aligned memory references. The HW overhead to "fix" the
problem is so small as to be compelling.

Since I have a complete set of memory-reference instructions for which unaligned memory-reference instructions are supported, the problem isn't
that I think unaligned fetches and stores take too many gates.

Rather, being able to only specify aligned accesses saves *opcode space*,

<
I am not buying this. Which takes more opcode space::
a) an ISA with unaligned only LDs and STs (11)
or
b) an ISA with unaligned LDs and STs (11) and aligned LDs and STs (another 11) <
It is a simple entropy (allocated counting) problem
<

which lets me fit in one complete set of memory-reference instructions that can use all the base registers, all the index registers, and always use all the registers as destination registers.

While the unaligned-capable instructions, that offer also important additional addressing modes, had to have certain restrictions to fit in.

So they use six out of the seven index registers, they can use only half
the registers as destination registers on indexed accesses, and they use
four out of the seven base registers.

Having 16-bit instructions for the possibility of more compact code meant that I had to have at least one of the two restrictions noted above -
having both restrictions meant that I could offer the alternative of aligned-only instructions with neither restriction, which may be far less painful for some.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Sun Nov 12 21:35:11 2023

BGB wrote:

On 11/10/2023 8:51 AM, Scott Lurndal wrote:

As for register arguments:
* Probably 8 or 16.
** 8 makes the most sense with 32 GPRs.
*** 16 is asking too much.
*** 8 deals with around 98% of functions.
** 16 makes sense with 64 GPRs.
*** Nearly all functions can use exclusively register arguments.
*** Gain is small though, if it only benefits 2% of functions.
*** It is almost a "shoe in", except for cost of fixed spill space
*** 128 bytes at the bottom of every non-leaf stack-frame is noticeable.
*** Though, an ABI could decide to not have a spill space in this way.

<
For the reasons stated above (some clipped) I agree with this whole block of statements.
<
Since My 66000 has 32 registers, I went with upto 8 arguments in registers, upto 8 results in registers, with the 9th of either on-the-stack in such a
way that if the callee is vararg the argument registers can be pushed on the stack to form a memory resident vector of arguments {{just perfect for printf().}}
<
With 8 registers covering 98%-ile of calls, there is too little left by
making this boundary 12-16 both of which ARE still possible.
<

Though, admittedly, for a lot of my programs I had still ended up going
with 8 register arguments with 64 GPRs, mostly as the gains of 16
arguments is small, relative of the cost of spending an additional 64
bytes in nearly every stack frame (and also there are still some
unresolved bugs when using 16 argument mode).

<
It is a delicate balance and it is easy to make the code look better
while actually running slower.
<

....

Current leaning is also that:
32-bit primary instruction size;
32/64/96 bit for variable-length instructions;
Is "pretty good".

In performance-oriented use cases, 16-bit encodings "aren't really worth
it".
In cases where you need a 32 or 64 bit value, being able to encode them
or load them quickly into a register is ideal. Spending multiple
instructions to glue a value together isn't ideal, nor is needing to
load it from memory (this particularly sucks from the compiler POV).

As for addressing modes:
(Rb, Disp) : ~ 66-75%
(Rb, Ri) : ~ 25-33%
Can address the vast majority of cases.

Displacements are most effective when scaled by the size of the element
type, as unaligned displacements are exceedingly rare. The vast majority
of displacements are also positive.

Not having a register-indexed mode is shooting oneself in the foot, as
these are "not exactly rare".

Most other possible addressing modes can be mostly ignored.
Auto-increment becomes moot if one has superscalar or VLIW;
(Rb, Ri, Disp) is only really applicable in niche cases
Eg, array inside struct, etc.
...

RISC-V did sort of shoot itself in the foot in several of these areas,
albeit with some workarounds in "Bitmanip":
SHnADD, can mimic a LEA, allowing array access in fewer ops.
PACK, allows an inline 64-bit constant load in 5 instructions...
LUI+ADD+LUI+ADD+PACK
...

Still not ideal...

An extra cycle for memory access is not ideal for a close second place addressing mode; nor are 64-bit constants rare enough that one
necessarily wants to spend 5 or so clock cycles on them.

But, still better than the situation where one does not have these instructions.

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Sun Nov 12 21:37:39 2023

BGB wrote:

On 11/10/2023 12:22 PM, MitchAlsup wrote:

BGB wrote:

One can argue that aligned-only allows for a cheaper L1 D$, but also
"sucks pretty bad" for some tasks:
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

<
Time found that HW can solve the problem way more than adequately--
obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
<

Wait, are you arguing for aligned-only memory ops here?...

<
I have not argued for aligned memory references since about 2000 (maybe as early as 1991).
<

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kent Dickey@21:1/5 to Scott Lurndal on Sun Nov 12 22:18:31 2023

In article <FlS3N.25739$_Oab.3565@fx15.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

jgd@cix.co.uk (John Dallman) writes:

In article <2023Nov11.112254@mips.complang.tuwien.ac.at>, >>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

True, but that has been tried out and, in a world (like Linux) where
software is developed on a platform that supports unaligned
accesses, and then compiled by package maintainers (who often are
not that familiar with the software) on a lot of platforms, the end
result was that the kernel by default performed a fixup (and put a
message in the dmesg buffer) instead of delivering a SIGBUS.

Yup. The software I work on is meant, in itself, to work on platforms
that enforce alignment, and it was a useful catcher for some kinds of bug. >>However, I'm now down to one that actually enforces it, in SPARC Solaris, >>and that isn't long for this world.

I dug into what it would take to have x86-64 Linux work with alignment >>enforcement turned on, and it's a huge job.

It might be easier with AArch64. Just set the A bit (bit 1) in SCTLR_EL1; >it only effects code executing in usermode.

There may even already be some ELF flag that will set it when the
file is exec(2)'d.

On Aarch64, with GCC at least, you also need to specify "-mstrict-align"
when compiling all source code, to prevent the compiler from assuming it
can access structure fields in an unaligned way, even if all of your
code accesses are fully aligned. GCC can mess around behind your back, changing ptr->array32[1] = 0 and ptr->array32[2] = 0 into a single
64-bit write of ptr->array32[1] = 0, among other things. If the offset
of array32[1] wasn't 64-bit aligned, it's an alignment trap if
SCTLR_EL1.A=1.

On all Arm system, Device memory accesses must always be aligned. User code
in general does not get access to Device memory, so this does not affect regular users.

Kent

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Sun Nov 12 22:09:24 2023

Quadibloc <quadibloc@servername.invalid> writes:

On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

...

Much preferable for a compiler to have a flat space of 32 or 64
registers. Having 16 sorta works, but does still add a bit to spill and
fill.

...

But if the 16-bit instructions I'm making room for are useless to
compilers, that's questionable.

It works for the RISC-V C (compressed) extension. Some of these
compressed instrutions use registers 8-15 (others use all 32
registers, but have other restrictions). But it works fine exactly
because, if your register usage does not fit the limitations of the
16-bit encoding, you just use the 32-bit version of the instruction.
It seems that they designed the ABI such that registers 8-15 occur
often in the code. Maybe the gcc maintainer also put some work into
preferring these registers.

OTOH, ARM who have extensive experience with mixed 32-bit/16-bit
instruction sets with their A32/T32 instruction set(s), designed their
A64 instruction set to strictly use 32-bit instructions.

So if MIPS, SPARC, Power, Alpha, and ARM A64 went for fixed-width
32-bit instructions, why make your task harder by also implementing
short instructions? Of course, if that is your goal or you have fun
with this, why not? But if you want to make progress, it seems to be
something that can be skipped.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Kent Dickey on Mon Nov 13 00:09:00 2023

Kent Dickey wrote:

In article <FlS3N.25739$_Oab.3565@fx15.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

jgd@cix.co.uk (John Dallman) writes:

In article <2023Nov11.112254@mips.complang.tuwien.ac.at>, >>>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

True, but that has been tried out and, in a world (like Linux) where
software is developed on a platform that supports unaligned
accesses, and then compiled by package maintainers (who often are
not that familiar with the software) on a lot of platforms, the end
result was that the kernel by default performed a fixup (and put a
message in the dmesg buffer) instead of delivering a SIGBUS.

Yup. The software I work on is meant, in itself, to work on platforms >>>that enforce alignment, and it was a useful catcher for some kinds of bug. >>>However, I'm now down to one that actually enforces it, in SPARC Solaris, >>>and that isn't long for this world.

I dug into what it would take to have x86-64 Linux work with alignment >>>enforcement turned on, and it's a huge job.

It might be easier with AArch64. Just set the A bit (bit 1) in SCTLR_EL1; >>it only effects code executing in usermode.

There may even already be some ELF flag that will set it when the
file is exec(2)'d.

On Aarch64, with GCC at least, you also need to specify "-mstrict-align"
when compiling all source code, to prevent the compiler from assuming it
can access structure fields in an unaligned way, even if all of your
code accesses are fully aligned. GCC can mess around behind your back, changing ptr->array32[1] = 0 and ptr->array32[2] = 0 into a single
64-bit write of ptr->array32[1] = 0, among other things. If the offset
of array32[1] wasn't 64-bit aligned, it's an alignment trap if
SCTLR_EL1.A=1.

On all Arm system, Device memory accesses must always be aligned. User code in general does not get access to Device memory, so this does not affect regular users.

<
For all the same reasons one does not do misaligned accesses to ATOMIC
memory locations, one does not do misaligned accesses to device control registers.
<

Kent

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Sun Nov 12 23:15:43 2023

On Sun, 12 Nov 2023 21:25:20 +0000, MitchAlsup wrote:

I am not buying this. Which takes more opcode space::
a) an ISA with unaligned only LDs and STs (11)
or b) an ISA with unaligned LDs and STs (11) and aligned LDs and STs
(another 11)

That is true, *other things being equal*.

However, what I had was:

An ISA with unaligned loads and stores, that could use all 32 destination registers, and all 8 index and base registers. (Call this A)

That took up too much opcode space to allow 16-bit instructions.

So I made various compromises to shave one bit off the loads and stores,
and then I could have 16 bit instructions. (Call this B)

But I didn't like the compromises.

So I made _more_ compromises, to shave _another_ bit off the loads and
stores. This way, I had enough opcode space to add aligned-only loads
and stores... that could use all 32 destination registers, and all 8
index and base registers. (Call this C)

Since other things _were not equal_, it was perfectly possible for C
to use less opcode space than A, and about the same amount of opcode
space as B. So I got to use 16-bit instructions AND have a set of loads
and stores that used all 32 destnation registers, and all 8 index and
base registers.

The compromises on the _unaligned_ loads and stores were painful, but
they were chosen so that code using them wouldn't have to be be
significantly less efficient than code with the set of loads and stores
in A.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Mon Nov 13 00:10:44 2023

Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> writes:

On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

....

Much preferable for a compiler to have a flat space of 32 or 64
registers. Having 16 sorta works, but does still add a bit to spill and
fill.

....

But if the 16-bit instructions I'm making room for are useless to >>compilers, that's questionable.

It works for the RISC-V C (compressed) extension. Some of these
compressed instrutions use registers 8-15 (others use all 32
registers, but have other restrictions). But it works fine exactly
because, if your register usage does not fit the limitations of the
16-bit encoding, you just use the 32-bit version of the instruction.
It seems that they designed the ABI such that registers 8-15 occur
often in the code. Maybe the gcc maintainer also put some work into preferring these registers.

OTOH, ARM who have extensive experience with mixed 32-bit/16-bit
instruction sets with their A32/T32 instruction set(s), designed their
A64 instruction set to strictly use 32-bit instructions.

So if MIPS, SPARC, Power, Alpha, and ARM A64 went for fixed-width
32-bit instructions, why make your task harder by also implementing
short instructions? Of course, if that is your goal or you have fun
with this, why not? But if you want to make progress, it seems to be something that can be skipped.

<
Sound
<

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Mon Nov 13 00:16:24 2023

Quadibloc wrote:

On Sun, 12 Nov 2023 21:25:20 +0000, MitchAlsup wrote:

I am not buying this. Which takes more opcode space::
a) an ISA with unaligned only LDs and STs (11)
or b) an ISA with unaligned LDs and STs (11) and aligned LDs and STs
(another 11)

That is true, *other things being equal*.

However, what I had was:

<
A poorly chosen starting point (dark alley)
<

An ISA with unaligned loads and stores, that could use all 32 destination registers, and all 8 index and base registers. (Call this A)

That took up too much opcode space to allow 16-bit instructions.

So I made various compromises to shave one bit off the loads and stores,
and then I could have 16 bit instructions. (Call this B)

But I didn't like the compromises.

<
Captain Obvious to the rescue::
<

So I made _more_ compromises, to shave _another_ bit off the loads and stores. This way, I had enough opcode space to add aligned-only loads
and stores... that could use all 32 destination registers, and all 8
index and base registers. (Call this C)

<
Back out of the dark alley, and start from first principles again.
<

Since other things _were not equal_, it was perfectly possible for C
to use less opcode space than A, and about the same amount of opcode
space as B. So I got to use 16-bit instructions AND have a set of loads
and stores that used all 32 destnation registers, and all 8 index and
base registers.

<
Maybe "less opcode space" if you count bits, but it is "more opcode space" if/when you enumerate all the opcodes within the space.
<

The compromises on the _unaligned_ loads and stores were painful, but
they were chosen so that code using them wouldn't have to be be
significantly less efficient than code with the set of loads and stores
in A.

<
Does you compiler agree with this assertion ??
<

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Mon Nov 13 00:54:49 2023

On Mon, 13 Nov 2023 00:16:24 +0000, MitchAlsup wrote:

Does you compiler agree with this assertion ??

As I'm still only in the early stages of roughing out
the bare outlines of an ISA, I have not yet built such
advanced diagnostic tools, I must admit.

However, my original compromise had been to reduce
the number of index registers used with memory-reference
instructions to 3 from 7.

The two improved compromises I used in this later effort
were:

Compromise 1:

Reduce the number of base registers used with memory-reference
instructions (when using a 16-bit displacement) to 3 from 7.

I figured that _this_ was far less likely to reduce efficiency,
since normally not that many base registers were used in any
case.

Compromise 2:

When an instruction is not indexed, reduce the size of the index
register field to two bits, both containing 0.

When an instruction is indexed, reduce the size of the destination
register field to 4 bits from 5, thus allowing only 16 of the 32
registers to be used with indexed memory accesses.

This one is more painful, but it had historical precedent. One
consequence is that the number of index registers is reduced, to
six from 7, because now index register 4 "looks like zero".

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to MitchAlsup on Sun Nov 12 19:28:51 2023

On 11/12/2023 3:37 PM, MitchAlsup wrote:

BGB wrote:

On 11/10/2023 12:22 PM, MitchAlsup wrote:

BGB wrote:

One can argue that aligned-only allows for a cheaper L1 D$, but also
"sucks pretty bad" for some tasks:
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

<
Time found that HW can solve the problem way more than adequately--
obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
<

Wait, are you arguing for aligned-only memory ops here?...

<
I have not argued for aligned memory references since about 2000 (maybe as early as 1991).
<

Makes sense, but I was confused as to what was being argued here...

I prefer unaligned memory access, since it allows a lot of nifty stuff
to be done.

But, I can note that the main drawback it has is in terms of requiring a
more expensive L1 cache.

Aligned-only cache only needs:
A single row of cache-lines
To check a single address for hit/miss;
Can use a simpler set of MUX'es for extract/insert.

Vs, say:
Two rows of cache lines (say, even and odd);
Needs to check two addresses;
More complicated extract/insert logic.

But, say, if one needs to operate within the limits of an aligned-only
cache, then even something like an LZ4 decompressor is painfully slow,
as it has to basically do damn near everything 1 byte at a time (or, at
least, more so than it does already).

I once did have a compressor (FeLZ32) more designed for the constraints
of the SuperH ISA (and aligned-only memory access), but its main
"feature" was that pretty much everything was defined in terms of 32-bit
words (it was not copying bytes, rather, 32 bit words, and the encoded
stream was itself an array of 32-bit words).

It also managed to beat out LZ4's performance by a fair margin on the Piledriver I was using at the time.

But, this performance advantage effectively evaporated on my Ryzen
(where LZ4 speed increased significantly), and was also mostly N/A on
BJX2. In this case, the byte-oriented formats were more preferable as
they got better compression.

Like, a lot of the performance tricks I had developed on the Piledriver
were effectively rendered moot.

Though, some amount of the tricks were mostly workarounds for "things
that were slow", which the newer CPU had made effectively unnecessary or counter productive.

...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Mon Nov 13 02:44:57 2023

On Mon, 13 Nov 2023 00:16:24 +0000, MitchAlsup wrote:

A poorly chosen starting point (dark alley)

Back out of the dark alley, and start from first principles again.

By the way, I think you mean a _blind_ alley.

A dark alley is just a dangerous place, since robbers can attack you
there without being seen.

A _blind_ alley is one that had no exit, one that is a dead end. That
seems to better fit the context of your remarks.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Mon Nov 13 03:06:03 2023

Quadibloc wrote:

On Mon, 13 Nov 2023 00:16:24 +0000, MitchAlsup wrote:

A poorly chosen starting point (dark alley)

Back out of the dark alley, and start from first principles again.

By the way, I think you mean a _blind_ alley.

A dark alley is just a dangerous place, since robbers can attack you
there without being seen.

A _blind_ alley is one that had no exit, one that is a dead end. That
seems to better fit the context of your remarks.

<
based on our definitions I definitively meant dark as in dangerous as
opposed to no way out except backwards.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to MitchAlsup on Sun Nov 12 20:21:13 2023

On 11/12/2023 3:35 PM, MitchAlsup wrote:

BGB wrote:

On 11/10/2023 8:51 AM, Scott Lurndal wrote:

As for register arguments:
* Probably 8 or 16.
** 8 makes the most sense with 32 GPRs.
*** 16 is asking too much.
*** 8 deals with around 98% of functions.
** 16 makes sense with 64 GPRs.
*** Nearly all functions can use exclusively register arguments.
*** Gain is small though, if it only benefits 2% of functions.
*** It is almost a "shoe in", except for cost of fixed spill space
*** 128 bytes at the bottom of every non-leaf stack-frame is noticeable.
*** Though, an ABI could decide to not have a spill space in this way.

<
For the reasons stated above (some clipped) I agree with this whole
block of statements.
<
Since My 66000 has 32 registers, I went with upto 8 arguments in registers, upto 8 results in registers, with the 9th of either on-the-stack in such a way that if the callee is vararg the argument registers can be pushed on
the
stack to form a memory resident vector of arguments {{just perfect for printf().}}
<
With 8 registers covering 98%-ile of calls, there is too little left by making this boundary 12-16 both of which ARE still possible.
<

Yeah.

Short of things like using 128-bit pointers, or lots of 128-bit
arguments (with an ABI that expresses these in pairs), the 8 argument
ABI seems to be slightly ahead here (even with 64 registers).

Mostly, because 2% of functions needing to use memory arguments seems to
cost less than the indirect cost of every other non-leaf function
needing to reserve an extra 64 bytes in the stack frame.

Had considered a possible ABI tweak where functions that only call other functions with fewer than 8 register arguments (likely excluding
vararg); only need to reserve space for the first 8 arguments.

But, the gains are likely to be rather small compared to the added
debugging effort.

Though, admittedly, for a lot of my programs I had still ended up
going with 8 register arguments with 64 GPRs, mostly as the gains of
16 arguments is small, relative of the cost of spending an additional
64 bytes in nearly every stack frame (and also there are still some
unresolved bugs when using 16 argument mode).

<
It is a delicate balance and it is easy to make the code look better
while actually running slower.
<

Yeah.

I suspect it is likely due mostly to something like L1 cache misses or
similar (bigger stack frame, more area for the L1 cache to miss).

OTOH: Had recently added the logic to shuffle prolog register-stores in
an attempt to reduce WAW stalls. Turned out, fully aligning stuff would
be a much bigger pain than initially hope (the curse of multiple cases
of duplicated logic that needs to operate in lockstep).

Did come up with an intermediate option:
Generate an temporary array of which registers are saved at which offsets; Generate a permutation array for which order to store these registers;
Initial permutation uses simple XOR shuffling;
Have a function to model the WAW cost of each permutation;
Shuffle the permutations with a PRNG (up to N times);
Pick the permutation with the smallest WAW cost.

Mostly works OK, but granted, nearly any ordering is better at this
metric than saving them in a linear order.

Though, doesn't really gain much if the forwarding option is enabled.

Relatedly, was also able to make Doom a little faster with another trick: Instead of drawing into an off-screen buffer, and then copying this to
the screen in the form of a DIB Bitmap object...

There can be functions to request and release framebuffers for a given Drawing-Context (with a supplied BITMAPINFOHEADER; this request failing
and returning NULL if the BITMAPINFOHEADER doesn't match the format used
by the HDC or similar; forcing fallback to the older method).

Similarly, there is a "SwapBuffers" style call, with these buffers
effectively operating in a double-buffering style.

In effect, it is an interface slightly more like what SDL uses.

Was kind of a hassle to modify Doom to play well with double buffering
though, initially it was a strobe-filled / flickering mess , with the
status bar effectively having a seizure. Does still have the annoyance
that when one noclip's though a wall, then whatever garbage is left over
is now prone to a strobe effect.

However, using shared buffers and then having Doom draw into them, does
reduce the amount of framebuffer copying needed for each screen update.

As-is, will currently only work though in 320x200 hi-color mode (where biHeight==-200, where negative height indicates an origin in the
top-left corner).

However, the DIB drawing method does allow more flexibility here (the
internal bitmap can be in a wider range of formats, and will be
converted as needed).

Granted, one can note that things like pixel format conversion and
similar aren't free.

Also recently encountered a video online where someone was running Doom
on a 386, and, the framerates *sucked*... ( Like, mostly single-digit territory, and with somewhat longer load-times as well. )

Can at least probably say, with reasonable confidence, that my BJX2 core
is faster than a 386...

Some other information implies that the speeds I am seeing are more
on-par with a high-end 486 or maybe a low-end Pentium.

( Nevermind that Quake performance is still crap in my case... )

( Somehow, it seems like old computers were generally worse and less
capable than my childhood self remembered. )

Formats supported in DIB form at present:
RGB555, RGB24, RGBA32, Indexed 1/2/4/8-bit, UTX2.

Formats used by the display hardware:
Color-Cell 8x8 as 4x 4x4x2bpp (2 endpoints per 4x4 cell);
Color-Cell 8x8x1 (2 color endpoints).
Also used for text-mode display.
4x4x16bit RGB555
4x4x8bit Indexed
(New/Experimental) Linear RGB555 and Indexed 8-bit
Framebuffer pixels now in a conventional linear raster ordering.
Also, the framebuffer is now movable, allowing double-buffering.
Framebuffer will require a 32 byte alignment though.
And needs to be in a physically-mapped address range.

Still don't have any "good" 256 color palettes:
6*6*6 and 6*7*6 (216 and 252 color)
Good for bright cartoony graphics, poor for much else.
Generally loses any detail in things like shading.
6*7*6 can't do grays effectively, only purple and green tints.
16 shades of 16 colors
Better "in general", obvious color distortion for cartoon images
13 shades of 19 colors (*1)
Slightly better than the previous
Mostly cutting off "near black" for additional colors.
Say: adding an Orange, Olive-Green, and Sky-Blue gradient.
Don't need 48 colors of "almost black"...

I don't know of any palette optimization algorithms that are fast enough
to run in real-time on the BJX2 core (I suspect "in the old days",
palette optimization was likely offline only).

Granted, other palettes are possible, mostly just the difficulty of
finding an organization that "looks good in the general case".

*1:
0z: Gray
1z: Blue (High Sat)
2z: Green (High Sat)
3z: Cyan (High Sat)
4z: Red (High Sat)
5z: Magenta (High Sat)
6z: Yellow (High Sat)
7z: Pink (Off-White)
8z: Beige (Off-White)
9z: Blue (Low Sat)
Az: Green (Low Sat)
Bz: Cyan (Low Sat)
Cz: Red (Low Sat)
Dz: Magenta (Low Sat)
Ez: Yellow (Low Sat)
Fz: Sky Blue (Off-White)

z0: Orange (Mid Sat)
z1: Olive (Mid Sat)
z2: Sky Blue (Mid Sat)

00: Black
01, 02: Very dark gray.
10/11/12/20/21/22: Various other "nearly black" colors.
Technically, the bottoms of the orange/olive/sky bars;
But, these can effectively "merge" the other colors.

In my fiddling, this was generally the "best performing" palette layout
I could seem to find thus far.

....

Current leaning is also that:
   32-bit primary instruction size;
   32/64/96 bit for variable-length instructions;
   Is "pretty good".

In performance-oriented use cases, 16-bit encodings "aren't really
worth it".
In cases where you need a 32 or 64 bit value, being able to encode
them or load them quickly into a register is ideal. Spending multiple
instructions to glue a value together isn't ideal, nor is needing to
load it from memory (this particularly sucks from the compiler POV).

As for addressing modes:
   (Rb, Disp) : ~ 66-75%
   (Rb, Ri)   : ~ 25-33%
Can address the vast majority of cases.

Displacements are most effective when scaled by the size of the
element type, as unaligned displacements are exceedingly rare. The
vast majority of displacements are also positive.

Not having a register-indexed mode is shooting oneself in the foot, as
these are "not exactly rare".

Most other possible addressing modes can be mostly ignored.
   Auto-increment becomes moot if one has superscalar or VLIW;
   (Rb, Ri, Disp) is only really applicable in niche cases
     Eg, array inside struct, etc.
   ...

RISC-V did sort of shoot itself in the foot in several of these areas,
albeit with some workarounds in "Bitmanip":
   SHnADD, can mimic a LEA, allowing array access in fewer ops.
   PACK, allows an inline 64-bit constant load in 5 instructions...
     LUI+ADD+LUI+ADD+PACK
   ...

Still not ideal...

An extra cycle for memory access is not ideal for a close second place
addressing mode; nor are 64-bit constants rare enough that one
necessarily wants to spend 5 or so clock cycles on them.

But, still better than the situation where one does not have these
instructions.

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to MitchAlsup on Mon Nov 13 16:10:20 2023

MitchAlsup wrote:

Chris M. Thomasson wrote:

Think of LL/SC... If one did not honor the reservation granule....
well... Shit.. False sharing on a reservation granule can cause live
lock and damage forward progress wrt some LL/SC setups.

<
One should NEVER (N. E. V. E. R.) attempt ATOMIC stuff on an unaligned container. Only aligned containers possess ATOMIC-smelling properties.

This is so obviously correct that you should not have needed to mention
it. Hammering HW with unaligned (maybe even page-straddling) LOCKed
updates is something that should only ever be done for testing purposes.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Dallman on Mon Nov 13 14:44:15 2023

jgd@cix.co.uk (John Dallman) writes:

In article <PQ74N.100$ayBd.39@fx07.iad>, scott@slp53.sl.home (Scott
Lurndal) wrote:

jgd@cix.co.uk (John Dallman) writes:

In article <FlS3N.25739$_Oab.3565@fx15.iad>, scott@slp53.sl.home

(Scott Lurndal) wrote:

jgd@cix.co.uk (John Dallman) writes:
It might be easier with AArch64. Just set the A bit (bit 1) in
SCTLR_EL1; it only effects code executing in usermode.

There may even already be some ELF flag that will set it when the
file is exec(2)'d.

I'll take a look, but I doubt glibc on Aarch64 is built to be run
with alignment trapping. Should it be EL0 for usermode?

The EL1 in the register name describes the minimum exception level
allowed to access the register. SCTLR_EL1 includes control bits
for both EL1 and EL0.

Aha. It's harder for ARM64: I'd have to be in supervisor mode to set that >bit, and the stuff I work on is strictly application code.

Unless the ELF flag trick is implemented. I haven't looked at the kernel
with respect to that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Mon Nov 13 11:46:47 2023

That took up too much opcode space to allow 16-bit instructions.

You might want to try and get fancy in your short instructions by
"randomizing" the subset of registers they can access.

E.g. allow both your short LD and ST instruction access 16 registers
but not exactly the same 16.
Or allow your arithmetic instructions to access only 8 registers for their input and output args but not exactly the same 8 for the two inputs
and/or for the output.

I suspect that if done well, it could give benefits similar to the skewed-associative caches. The other upside is that it makes register allocation *really* interesting, thus opening up opportunities to
spend a few more years working on that subproblem :-)

To up the ante, you could make the set of registers reachable from each instruction depend not just on the opcode but also on the instruction's address, so you can sometimes avoid a spill by swapping two
instructions. This would allow the register allocation to interact in
even more interesting ways with instruction scheduling.
There could be a few more PhDs worth of research there.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Anton Ertl on Mon Nov 13 14:12:16 2023

On 11/12/2023 4:09 PM, Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> writes:

On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

...

Much preferable for a compiler to have a flat space of 32 or 64
registers. Having 16 sorta works, but does still add a bit to spill and
fill.

...

But if the 16-bit instructions I'm making room for are useless to
compilers, that's questionable.

It works for the RISC-V C (compressed) extension. Some of these
compressed instrutions use registers 8-15 (others use all 32
registers, but have other restrictions). But it works fine exactly
because, if your register usage does not fit the limitations of the
16-bit encoding, you just use the 32-bit version of the instruction.
It seems that they designed the ABI such that registers 8-15 occur
often in the code. Maybe the gcc maintainer also put some work into preferring these registers.

Yeah. They can be used by a compiler, and can make a difference for code-density.

Just, it is more a case of, if one has a tradeoff of:
Fewer instructions but more bytes;
More instructions but fewer bytes.
Then the former is better for performance.

Things like reusing registers more aggressively and using a smaller
subset of the registers, are good for making 16-bit instructions usable,
but are less good for performance.

...

Though, granted, one doesn't want to try to reserve too many registers
(on an ISA with plenty of registers), as one may find that
saving/restoring them costs more than that gained by having them
available for use.

Though, the partial workaround for this (in my case) was dividing the
registers up into sub-groups, and using heuristics to enable these
groups based on an estimate of the register pressure.

Say:
R8 ..R14: Always available, prioritized for size optimization ("/Os");
R24..R31: Enables as needed for "/Os", always enabled for perf opt.
R40..R47: Enabled with high register pressure.
R56..R63: Enabled with very high register pressure.

Note:
BGBCC's command-line accepts both "/Os" and "-Os" style arguments.
"/Os": Size optimize
"/O1": Moderate speed (try to balance speed and size)
"/O2": Prioritize speed.
"/Z*": Mostly debug related options (like "-g" in GCC)
"/f*": Optional feature flags.
"/m*": Selects target arch/profile.
"/Fe*": Specify output binary (like "-o" in GCC)
Else, it will try to guess an output file name.
Eg: "foo.c" -> "foo.exe"
...

It does try to guess whether the '/' is part of an option or the start
of a filename. If it sees more than one '/', or sees a '.' or similar,
without encountering an '=', assume it is a filename.

It is almost, but not quite, based on a count of the in-use variables.

It helps to also apply a scale factor for each variable based on how
deeply nested in a loop it is (so that if one has a lot of variables in
use inside a deeply nested loop, the register pressure estimate will be
higher than if most are used outside of a loop).

Though, this scale-factor is nowhere near as severe as with the register allocation priority (where the nesting level was effectively raised to
an exponent). For pressure estimates, one can use a gentler scale, more
like, say: "scale=sqrt(deepest_nest_level+1.0);".

For dynamically allocated variables in leaf blocks (basic block does not contain a function call), it may make sense to allocate them in scratch registers.

Scratch registers are similar:
R0..R1: Not used as GPRs by compiler;
R2..R3: Designated scratch, not used for reg alloc.
R4..R7: Always available;
R16..R17: Designated scratch, not used for reg alloc.
R18..R23: Available when R24..R31 are enabled (always for perf opt);
R32..R39, R48..R55: Available under high register pressure.
Always available if the registers are available and perf optimized.

In performance optimized code, in my case, the spread of the registers
is generally too disperse to really make any sort of small sub-setting particularly effective.

OTOH, ARM who have extensive experience with mixed 32-bit/16-bit
instruction sets with their A32/T32 instruction set(s), designed their
A64 instruction set to strictly use 32-bit instructions.

I guess it can also be noted, that 64-bit ARM went all-in with a lot of
the sorts of features that RISC-V avoided. For example, it still has
some more complex addressing modes, etc.

I guess also they approached constants a little differently:
You can load a 16-bit value into 1 of 4 positions within a register,
with one of: zero fill, one fill, or keeping the prior contents.

This allows loading an arbitrary constant in between 1 and 4 instructions.

Though, I did realize that with RISC-V's Bitmanip extensions, it is
possible to get a 64-bit constant load down to 5 instructions, which is
better than RV64I needing 6 (and in both cases, needing 2 registers).

In BJX2, with Jumbo, it is 3 instruction words and 1 clock cycle.
Without Jumbo, it is 4 instructions (albeit less flexible than the
mechanism in ARM).

So if MIPS, SPARC, Power, Alpha, and ARM A64 went for fixed-width
32-bit instructions, why make your task harder by also implementing
short instructions? Of course, if that is your goal or you have fun
with this, why not? But if you want to make progress, it seems to be something that can be skipped.

In my case, I am left with an awkward split in my ISA:
Baseline Mode, which has both 16 and 32-bit instructions (and bigger);
XG2, which is 32-bit (and bigger).

Some of my newer design variants had leaned towards 32-bit and 64
registers, mostly because the higher register count does towards
performance (at least, performance per clock; not so sure it helps with
LUTs or timing constraints though, *).

*: Mostly because the 5-bit LUTRAMs work with 3 bits of data, but the
6-bit LUTRAMs only have 2 bits of data.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Chris M. Thomasson@21:1/5 to Quadibloc on Mon Nov 13 13:58:06 2023

On 11/12/2023 6:44 PM, Quadibloc wrote:

On Mon, 13 Nov 2023 00:16:24 +0000, MitchAlsup wrote:

A poorly chosen starting point (dark alley)

Back out of the dark alley, and start from first principles again.

By the way, I think you mean a _blind_ alley.

A dark alley is just a dangerous place, since robbers can attack you
there without being seen.

Expose the darkness to the light, before any adventures...? ;^)

A _blind_ alley is one that had no exit, one that is a dead end. That
seems to better fit the context of your remarks.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Stefan Monnier on Tue Nov 14 14:54:32 2023

On Mon, 13 Nov 2023 11:46:47 -0500, Stefan Monnier wrote:

You might want to try and get fancy in your short instructions by "randomizing" the subset of registers they can access.

E.g. allow both your short LD and ST instruction access 16 registers but
not exactly the same 16.
Or allow your arithmetic instructions to access only 8 registers for
their input and output args but not exactly the same 8 for the two
inputs and/or for the output.

I suspect that if done well, it could give benefits similar to the skewed-associative caches. The other upside is that it makes register allocation *really* interesting, thus opening up opportunities to spend
a few more years working on that subproblem :-)

I would like to be able to say that this idea was too bizarre even for
me.

However, one of the ideas I toyed with before settling on my current
iteration of Concertina II was to

- drop the aligned memory-reference instructions
- somehow squeeze the 32-bit operate instructions into the space left
over by the byte instructions in the family
- thereby doubling the space available for 16-bit instructions.

The instruction slots of the form 0-0- would be as before: two instructions where both source and destination are in the same group of eight registers.

The instruction slots of the form 0-1- would contain two 16-bit
instructions where the source and destination registers are each
four bits long, allowing (as in the indexed memory-reference
instructions) the use of the first four registers in each of the four
groups of eight registers.

Thus, one instruction type uses all the registers, and the other
allows transfers between the 8-bit banks.

So, sadly, I actually *did* contemplate going there. Fortunately, I
thought better of it.

To up the ante, you could make the set of registers reachable from each instruction depend not just on the opcode but also on the instruction's address, so you can sometimes avoid a spill by swapping two
instructions. This would allow the register allocation to interact in
even more interesting ways with instruction scheduling.
There could be a few more PhDs worth of research there.

That would definitely be one trick to allow access to more registers than
the number of opcode bits allows.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to MitchAlsup on Wed Nov 15 10:38:56 2023

On 11/11/2023 10:11 AM, MitchAlsup wrote:

Stephen Fuld wrote:

On 11/10/2023 10:24 AM, BGB wrote:

Much better to have a big flat register space.

Yes, but sometimes you just need "another bit" in the instructions.
So an alternative is to break the requirement that all register
specifier fields in the instruction be the same length. So, for
example, allow

<
Another way to get a few more bits is to use a prefix-instruction like
CARRY for those seldom needed bits.

Good point. A combination of the two ideas could be to have the prefix instruction specify which register to use instead of the one specified
in the reduced register specifier for whichever instructions in its
shadow have the bit set in the prefix. Worst case, this is the same as
my original proposal - one extra, not really executed, instruction
(prefix versus register to register move) for one where you need to use
it, but this idea might, by allowing the prefix to specify multiple instructions, save more than one extra "instruction". The only downside
is it requires an additional op code.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Stephen Fuld on Wed Nov 15 19:02:00 2023

Stephen Fuld wrote:

On 11/11/2023 10:11 AM, MitchAlsup wrote:

Stephen Fuld wrote:

On 11/10/2023 10:24 AM, BGB wrote:

Much better to have a big flat register space.

Yes, but sometimes you just need "another bit" in the instructions.
So an alternative is to break the requirement that all register
specifier fields in the instruction be the same length. So, for
example, allow

<
Another way to get a few more bits is to use a prefix-instruction like
CARRY for those seldom needed bits.

Good point. A combination of the two ideas could be to have the prefix instruction specify which register to use instead of the one specified
in the reduced register specifier for whichever instructions in its
shadow have the bit set in the prefix.

<
You could have the prefix instruction supply the missing bits of all
shortened register specifiers.
<
< Worst case, this is the same as

my original proposal - one extra, not really executed, instruction

<
Which is why I use the term instruction-modifier.
<

(prefix versus register to register move) for one where you need to use
it, but this idea might, by allowing the prefix to specify multiple instructions, save more than one extra "instruction". The only downside
is it requires an additional op code.

<
But by having an instruction-modifier that can add bits to several
succeeding instructions, you can avoid cluttering up ISA with things
like ADC, SBC, IMULD, DDIV, ....... So, in the end, you save OpCode
enumeration space not consume it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to MitchAlsup on Wed Nov 15 11:58:25 2023

On 11/15/2023 11:02 AM, MitchAlsup wrote:

Stephen Fuld wrote:

On 11/11/2023 10:11 AM, MitchAlsup wrote:

Stephen Fuld wrote:

On 11/10/2023 10:24 AM, BGB wrote:

Much better to have a big flat register space.

Yes, but sometimes you just need "another bit" in the instructions.
So an alternative is to break the requirement that all register
specifier fields in the instruction be the same length. So, for
example, allow

<
Another way to get a few more bits is to use a prefix-instruction like
CARRY for those seldom needed bits.

Good point. A combination of the two ideas could be to have the prefix
instruction specify which register to use instead of the one specified
in the reduced register specifier for whichever instructions in its
shadow have the bit set in the prefix.

<
You could have the prefix instruction supply the missing bits of all shortened register specifiers.

I am not sure what you are proposing here. Can you show an example?

< < Worst case, this is the same as

my original proposal - one extra, not really executed, instruction

<
Which is why I use the term instruction-modifier.

Agreed.

<

(prefix versus register to register move) for one where you need to
use it, but this idea might, by allowing the prefix to specify
multiple instructions, save more than one extra "instruction". The
only downside is it requires an additional op code.

<
But by having an instruction-modifier that can add bits to several
succeeding instructions, you can avoid cluttering up ISA with things
like ADC, SBC, IMULD, DDIV, ....... So, in the end, you save OpCode enumeration space not consume it.

In the general case, I certainly agree. But here you need a different
op-code than CARRY, as this has different semantics, and I think the new instruction modifier has no other use, hence it is an additional op code
versus the original proposal of using essentially a register copy
instruction, which already exists (i.e. a load with a zero displacement
and the source register as the address modifier).

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Stephen Fuld on Wed Nov 15 21:10:52 2023

Stephen Fuld wrote:

On 11/15/2023 11:02 AM, MitchAlsup wrote:

Stephen Fuld wrote:

On 11/11/2023 10:11 AM, MitchAlsup wrote:

Stephen Fuld wrote:

On 11/10/2023 10:24 AM, BGB wrote:

Much better to have a big flat register space.

Yes, but sometimes you just need "another bit" in the instructions.
So an alternative is to break the requirement that all register
specifier fields in the instruction be the same length. So, for
example, allow

<
Another way to get a few more bits is to use a prefix-instruction like >>>> CARRY for those seldom needed bits.

Good point. A combination of the two ideas could be to have the prefix
instruction specify which register to use instead of the one specified
in the reduced register specifier for whichever instructions in its
shadow have the bit set in the prefix.

<
You could have the prefix instruction supply the missing bits of all
shortened register specifiers.

I am not sure what you are proposing here. Can you show an example?

Let us postulate an MoreBits instruction-modifier with a 16-bit immediate field. Now each 16-bit instruction, that has access to only 8 registers,
strips off 2-bits/specifier, so now all its register specifiers are 5-bits.
The immediate supplies the bits and as bits are stripped off the Decoder
shifts the field down by the consumed bits. When the last bit has been
stripped off you would need another MB im to supply those bits. Since
only 16-bit instructions are "limited" one MB should last about a basic
block or extended basic block.

Note I don't care how the bits are apportioned, formatted, consumed, ...

<
< Worst case, this is the same as

my original proposal - one extra, not really executed, instruction

<
Which is why I use the term instruction-modifier.

Agreed.

<

(prefix versus register to register move) for one where you need to
use it, but this idea might, by allowing the prefix to specify
multiple instructions, save more than one extra "instruction". The
only downside is it requires an additional op code.

<
But by having an instruction-modifier that can add bits to several
succeeding instructions, you can avoid cluttering up ISA with things
like ADC, SBC, IMULD, DDIV, ....... So, in the end, you save OpCode
enumeration space not consume it.

In the general case, I certainly agree. But here you need a different op-code than CARRY, as this has different semantics, and I think the new instruction modifier has no other use, hence it is an additional op code versus the original proposal of using essentially a register copy instruction, which already exists (i.e. a load with a zero displacement
and the source register as the address modifier).

CARRY is your access to ALL extended precision calculations (saving 20+
OpCodes when you consider a robust commercial ISA rather than an Academic
ISA.) Carry accesses integer arithmetic, shifts, extracts, inserts, and
exact floating point calculations larger than 64-bits including Kahan- Babashuka summation. {{Not bad for 1 OpCode !!}}

Similarly:: VEC-LOOP provide access to 1,000+ SIMD instructions and 400+
Vector instructions at the cost of 2 units in the OpCode Space !! It also allows a future implementation to execute wider (or narrower) than SIMD
with no change in the instruction sequence.

MoreBits is effectively just like REX except it can span instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to MitchAlsup on Mon Nov 20 09:31:11 2023

On 11/15/2023 1:10 PM, MitchAlsup wrote:

Stephen Fuld wrote:

On 11/15/2023 11:02 AM, MitchAlsup wrote:

Stephen Fuld wrote:

On 11/11/2023 10:11 AM, MitchAlsup wrote:

Stephen Fuld wrote:

On 11/10/2023 10:24 AM, BGB wrote:

Much better to have a big flat register space.

Yes, but sometimes you just need "another bit" in the
instructions. So an alternative is to break the requirement that
all register specifier fields in the instruction be the same
length. So, for example, allow

<
Another way to get a few more bits is to use a prefix-instruction like >>>>> CARRY for those seldom needed bits.

Good point. A combination of the two ideas could be to have the
prefix instruction specify which register to use instead of the one
specified in the reduced register specifier for whichever
instructions in its shadow have the bit set in the prefix.

<
You could have the prefix instruction supply the missing bits of all
shortened register specifiers.

I am not sure what you are proposing here. Can you show an example?

Let us postulate an MoreBits instruction-modifier with a 16-bit immediate field. Now each 16-bit instruction, that has access to only 8 registers, strips off 2-bits/specifier, so now all its register specifiers are 5-bits. The immediate supplies the bits and as bits are stripped off the Decoder shifts the field down by the consumed bits. When the last bit has been stripped off you would need another MB im to supply those bits. Since
only 16-bit instructions are "limited" one MB should last about a basic
block or extended basic block.

Note I don't care how the bits are apportioned, formatted, consumed, ...

Oh, so you have changed the meaning of the "immediate bit map" from
specifying which of the following instructions it applies to (e.g.
CARRY) to the actual data. I like it!

If using 16 bit instructions, and if you only have one small register
field per instruction, I think it is better to make "MoreBits" a 16 bit instruction modifier itself, with say a five bit op code and an eleven
bit immediate, which supplies the extra bit for the next 11
instructions. More compact than a 32 bit instruction, and almost as
"far reaching". If you need more than 11 bits, even if you add a second
MB instruction modifier 11 instructions later, you are still no worse
off than an instruction modifier plus a 16 bit immediate.

Of course, if you need more than one extra bit per instruction, then
more "drastic" measures, such as your proposal, are needed.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Stephen Fuld on Mon Nov 20 17:51:46 2023

On 11/20/2023 11:31 AM, Stephen Fuld wrote:

On 11/15/2023 1:10 PM, MitchAlsup wrote:

Stephen Fuld wrote:

On 11/15/2023 11:02 AM, MitchAlsup wrote:

Stephen Fuld wrote:

On 11/11/2023 10:11 AM, MitchAlsup wrote:

Stephen Fuld wrote:

On 11/10/2023 10:24 AM, BGB wrote:

Much better to have a big flat register space.

Yes, but sometimes you just need "another bit" in the
instructions. So an alternative is to break the requirement that >>>>>>> all register specifier fields in the instruction be the same
length. So, for example, allow

<
Another way to get a few more bits is to use a prefix-instruction
like
CARRY for those seldom needed bits.

Good point. A combination of the two ideas could be to have the
prefix instruction specify which register to use instead of the one
specified in the reduced register specifier for whichever
instructions in its shadow have the bit set in the prefix.

<
You could have the prefix instruction supply the missing bits of all
shortened register specifiers.

I am not sure what you are proposing here. Can you show an example?

Let us postulate an MoreBits instruction-modifier with a 16-bit immediate
field. Now each 16-bit instruction, that has access to only 8 registers,
strips off 2-bits/specifier, so now all its register specifiers are
5-bits.
The immediate supplies the bits and as bits are stripped off the Decoder
shifts the field down by the consumed bits. When the last bit has been
stripped off you would need another MB im to supply those bits. Since
only 16-bit instructions are "limited" one MB should last about a
basic block or extended basic block.

Note I don't care how the bits are apportioned, formatted, consumed, ...

Oh, so you have changed the meaning of the "immediate bit map" from specifying which of the following instructions it applies to (e.g.
CARRY) to the actual data. I like it!

If using 16 bit instructions, and if you only have one small register
field per instruction, I think it is better to make "MoreBits" a 16 bit instruction modifier itself, with say a five bit op code and an eleven
bit immediate, which supplies the extra bit for the next 11
instructions. More compact than a 32 bit instruction, and almost as
"far reaching". If you need more than 11 bits, even if you add a second
MB instruction modifier 11 instructions later, you are still no worse
off than an instruction modifier plus a 16 bit immediate.

Of course, if you need more than one extra bit per instruction, then
more "drastic" measures, such as your proposal, are needed.

Ironically, this is closer to how 32-bit ops were originally intended to
work in BJX2, and how they worked in BJX1 (where most of the 32-bit ops
were basically prefixes on the existing 16-bit SuperH ops).

Say:
ZnmZ //typical layout of a 16-bit op, R0..R15
8Ceo-ZnmZ //Op gains an extra register field, and R16..R31.

Then, in the original form of BJX2:
ZZnm
F0eo-ZZnm

For some ops, the 3rd register (Ro) would instead operate as a 5-bit immediate/displacement field. Which was initially a similar idea, with
the 32-bit space mirroring the 16-bit space.

When I later added the Imm9 encodings, the encoding of the other ops was changed to be more consistent with this:
F0nm-ZeoZ
F2nm-Zeii

This was originally designed as a possible successor ISA, but it seemed "better" to back-fold it into my existing ISA (effectively replacing the original encoding scheme in the process).

This encoding was relatively stable, until Jumbo prefixes were added and
shook things up a little more (and the more recent shakeup with XG2,
which has effectively fragmented the ISA into two sub-variants with
neither being a "clear winner", *).

*: The previous Baseline encoding is better for code density (due to
still having 16-bit ops), XG2 is better for performance (due to more orthogonality, such as the ability to use every register from every instruction, and adding a bit to the Immed/Displacement fields, or 3 in
the case of plain branches).

Had considered possible options for "Make XG2's encoding less dog
chewed", but the issue is not so simple as simply shifting the bits
around (shuffling the bits would just make it dog-chewed in other ways).

So, existing encoding, expressed in bits, is roughly:
NMOP-ZwZZ-nnnn-mmmm ZZZZ-Qnmo-oooo-ZZZZ

And the possible revised form:
PwZZ-ZZZZ-ZZnn-nnnn-mmmm-mmoo-oooo-ZZZZ

However, what I have thus far would effectively amount to nearly a full
reboot of the encoding (which would be a huge pile of effort), so less
likely to be "worth it" in the name of a slightly less chewed encoding
scheme (and, hell, RISC-V is going along OK with its immediate fields
being effectively confetti).

Though, another option could be closer to a straight reshuffle:
NMOP-ZwZZ-nnnn-mmmm YYYY-Qnmo-oooo-XXXX
NMIP-ZwZZ-nnnn-mmmm YYYY-Qnmi-iiii-iiii
To:
PwZZ-ZQnn-nnnn-YYYY-mmmm-mmoo-oooo-XXXX
PwZZ-ZQnn-nnnn-YYYY-mmmm-mmii-iiii-iiii

So, the existing ISA listing could be mapped over mostly as-is, with the
main changes (besides the bit-reshuffle) being in the immediate field.

However:
DDDP-0w00-nnnn-mmmm 1100-dddd-dddd-dddd
To:
Pw00-0ddd-dddd-YYYY-dddd-dddd-dddd-dddd

Is gonna need some new relocs, ...

OTOH, it would allow making the F8 block's encoding consistent with the
rest of the ISA.

But, recently I am left feeling uncertain if any of this is anything
more than moot...

Did recently make a little bit of progress towards having a GUI in
TestKern, in that I now have a console window with a shell "sorta" able
to run inside this console.

Has partly opened the "pandora's box" though that is needing to deal
with multitasking, re-entrance, and the possible need for needing to use
mutex locking (as-is, it was "barely working" in that I had to carefully
avoid re-entrance in a few areas to keep the kernel from exploding; as
none of this stuff has mutexes).

Well, and then having to fix-up issues like making the scheduler not try
to schedule the syscall-handler task and then promptly causing the "OS"
to explode (for now, these are special cased; I may need to come up with
a general way of flagging some tasks as "do not schedule", since they
will exist as special-cases to handle syscalls or specifically as the
target of inter-process VTable calls, as is the case with TKGDI, where
the call itself will schedule the task). Where, in this case, the
mechanism for inter-task control flow will take a form resembling that
of COM objects (it is likely that TKRA-GL may need to be reworked into
this form as well, *2).

Also looking like I will need to rework how the shell works.
Effectively, now, rather than the CLI running directly in the kernel, it
needs to be a userland (or "superuserland", *) task communicating with
the kernel via syscalls. So, the shell can no longer directly invoke the PE/COFF loader, but will now need to use a "CreateProcess" call (and
then probably sleep-loop until the created process terminates).

*: Where a task is being run more like a userland task, but still in
running in supervisor mode (the syscall handler task and TKGDI backend
running in this mode).

Where, say:
Thread: Logical thread of execution within some existing process;
Process: Distinct collection of 1 or more threads within a shared
address space and shared process identity (may have its own address
space, though as-of-yet, TestKern uses a shared global address space);
Task: Supergroup that includes Threads, Processes, and other thread-like entities (such as call and method handlers), may be either thread-like
or process-like.

Where, say, the Syscall interrupt handler doesn't generally handle
syscalls itself (since the ISRs will only have access to
physically-mapped addresses), but effectively instead initiates a
context switch to the task that can handle the request (or, to context
switch back to the task that made the request, or to yield to another
task, ...).

Though, will need to probably add more special case handling such that
the Syscall task can not yield or try to itself make a syscall (the only
valid exit point for this task being where it transfers control back to
the caller and awaits the next syscall to arrive; and it is not valid
for this task to try to syscall back into itself).

As-is, I am running a lot of tasks in userland, but for now there is effectively no real memory protection in TestKern, but the plan is to
try to resolve this. This is itself work; needing to gradually weed-out programs accessing privileged resources; and in some system-level APIs
needing to distinguish "Local" from "Global" memory ("malloc" will give
local memory, whereas "tkgGlobalAlloc" will give global memory; the idea
being for now that global memory will be identity mapped and accessible
across process boundaries).

Doesn't "yet" matter, but easier to try to address this now than later.

*2: For TKRA-GL, it generally needs to work with physically mapped
memory and MMIO to access the rasterizer module, which means the backend
parts will likely need to run either in "superuserland" or in "kernel land".

Likely rework is to try to separate the OpenGL API front-end from some
backend machinery, which will be a more narrowly focused interface
mostly dealing with things like:
Uploading textures and similar;
Drawing vertex arrays.
All the things like glEnable/glDisable, matrix-stack manipulations, etc,
will need to be kept in the front-end (making a context switch every
time the program used glEnable or glColor4f or similar would be an
impractical level of overhead).

Though, in Windows, the division point seems to be a little higher
(closer to the level of the OpenGL API itself). To mimic the Windows
model, I would effectively need two division points:

A front-end interface whose purpose is mostly to wrap over a bunch of "GetProcAddress" funk (with some way to plug in an interface to provide
the GetProcAddress backend). This isn't asking too much more, since one
needs to provide all the GetProcAddress cruft either way.

An division interface between the frontend part which needs to run
directly in the userland task, and the backend part which deals with the "actually making stuff happen" parts.

One could design a lower-level API for this latter part, but
(ironically) it would probably end up sort of resembling some sort of
weird OpenGL/Direct3D hybrid...

Though, could still do like TKGDI and provide a C wrapper over the
internal VTable calls.
HRESULT fooDooTheThing()
{
fooContext *ctx;
ctx=fooGetCurrentContext();
return(ctx->vt->DooTheThing(ctx));
}
...

A lot of this stuff gets kind of annoying sometimes though...

Like, one can't just "do the thing", they end up needing a bunch of
layers and boilerplate getting from "the place where the thing needs to
be done" to "the place where the thing can be done" (but, I guess, the
other alternative being to effectively not have an OS at all).

...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Tue Nov 21 22:12:18 2023

BGB wrote:

On 11/20/2023 11:31 AM, Stephen Fuld wrote:

On 11/15/2023 1:10 PM, MitchAlsup wrote:

For some ops, the 3rd register (Ro) would instead operate as a 5-bit immediate/displacement field. Which was initially a similar idea, with
the 32-bit space mirroring the 16-bit space.

Almost all My 66000 {1,2,3}-operand instructions can convert a 5-bit register specifier into a 5-bit immediate of either positive or negative integer
value. This makes::

1<<n
~0<<n
container.bitfield = 7;

single instructions.

Where, say:
Thread: Logical thread of execution within some existing process;

has a register file and a stack.

Process: Distinct collection of 1 or more threads within a shared

has a memory map a heap and a vector of threads.

address space and shared process identity (may have its own address
space, though as-of-yet, TestKern uses a shared global address space);
Task: Supergroup that includes Threads, Processes, and other thread-like entities (such as call and method handlers), may be either thread-like
or process-like.

Where, say, the Syscall interrupt handler doesn't generally handle
syscalls itself (since the ISRs will only have access to
physically-mapped addresses), but effectively instead initiates a
context switch to the task that can handle the request (or, to context
switch back to the task that made the request, or to yield to another
task, ...).

We call these things:: dispatchers.

Though, will need to probably add more special case handling such that
the Syscall task can not yield or try to itself make a syscall (the only valid exit point for this task being where it transfers control back to
the caller and awaits the next syscall to arrive; and it is not valid
for this task to try to syscall back into itself).

In My 66000, every <effective> SysCall goes deeper into the privilege hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV,
Guest HV SysCalls real HV. No data structures need maintenance during
these transitions of the hierarchy.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Finch@21:1/5 to MitchAlsup on Tue Nov 21 22:47:26 2023

On 2023-11-21 5:12 p.m., MitchAlsup wrote:

BGB wrote:

On 11/20/2023 11:31 AM, Stephen Fuld wrote:

On 11/15/2023 1:10 PM, MitchAlsup wrote:

For some ops, the 3rd register (Ro) would instead operate as a 5-bit
immediate/displacement field. Which was initially a similar idea, with
the 32-bit space mirroring the 16-bit space.

Almost all My 66000 {1,2,3}-operand instructions can convert a 5-bit
register
specifier into a 5-bit immediate of either positive or negative integer value. This makes::

    1<<n
   ~0<<n
    container.bitfield = 7;

single instructions.

Q+ CPU allows immediates of any length to be used in place of source
operand register values via postfix instructions. Virtually all
instructions may use immediates instead of registers. There are also
quick immediate form instructions that have the second source operand as
an immediate constant encoded directly in the instruction as this is the
most common use.

The postfix immediate instructions come in four lengths. 23-bit, 39-bit,
71-bit and 135-bit. Currently float values make use of on 32 or 64 bits
out of the 39 and 71-bit formats. I have been pondering having the float immediates left aligned with additional trailing bits. These bits are
zero for now.

Postfixes are treated as part of the current instruction by the CPU.

Where, say:
Thread: Logical thread of execution within some existing process;

         has a register file and a stack.

Process: Distinct collection of 1 or more threads within a shared

         has a memory map a heap and a vector of threads.

address space and shared process identity (may have its own address
space, though as-of-yet, TestKern uses a shared global address space);
Task: Supergroup that includes Threads, Processes, and other
thread-like entities (such as call and method handlers), may be either
thread-like or process-like.

Where, say, the Syscall interrupt handler doesn't generally handle
syscalls itself (since the ISRs will only have access to
physically-mapped addresses), but effectively instead initiates a
context switch to the task that can handle the request (or, to context
switch back to the task that made the request, or to yield to another
task, ...).

We call these things:: dispatchers.

Though, will need to probably add more special case handling such that
the Syscall task can not yield or try to itself make a syscall (the
only valid exit point for this task being where it transfers control
back to the caller and awaits the next syscall to arrive; and it is
not valid for this task to try to syscall back into itself).

In My 66000, every <effective> SysCall goes deeper into the privilege hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV, Guest HV SysCalls real HV. No data structures need maintenance during
these transitions of the hierarchy.

Does it follow the same way for hardware interrupts? I think RISCV goes
to the deepest level first, machine level, then redirects to lower
levels as needed. I was planning on Q+ operating the same way.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to MitchAlsup on Tue Nov 21 21:36:30 2023

On 11/21/2023 4:12 PM, MitchAlsup wrote:

BGB wrote:

On 11/20/2023 11:31 AM, Stephen Fuld wrote:

On 11/15/2023 1:10 PM, MitchAlsup wrote:

For some ops, the 3rd register (Ro) would instead operate as a 5-bit
immediate/displacement field. Which was initially a similar idea, with
the 32-bit space mirroring the 16-bit space.

Almost all My 66000 {1,2,3}-operand instructions can convert a 5-bit
register
specifier into a 5-bit immediate of either positive or negative integer value. This makes::

    1<<n
   ~0<<n
    container.bitfield = 7;

single instructions.

Originally, the pattern depended on the 16-bit operation, IIRC:
(Rm), Rn => (Rm, Disp5), Rn
(Rm, R0), Rn => (Rm, Ro), Rn
ALU Ops:
OP Rm, Rn => OP Rm, Ro, Rn
OP Rm, R0, Rn => OP Rm, Imm5u, Rn

Initially, BJX2 started out in a similar camp to BJX1, but when it
became obvious that the 16-bit and 32-bit encodings effectively needed
separate encoders, there was no real point keeping up the concept of
32-bit ops being prefix-extended 16-bit ops.

Then some other analysis/testing showed that for "general case
tradeoffs", it was better to have an ISA with primarily 32-bit encodings
with a 16-bit subset, than one with primarily 16-bit encodings with
32-bit extended forms (though, by this point, I had already settled on
the general encoding scheme).

The main practical consequence of this realization was that the ISA did
not need to be able to operate entirely within the limits of the 16-bit encoding space (but, did need to be able to operate without any of the
16-bit encodings).

After more development, I now have:
Imm5u/Disp5u, some ops (Baseline)
Imm6s/Disp6s (XG2)
Imm9u: Typical ALU ops
Imm10u (XG2)
Imm9n: A few ALU ops
Imm10n (XG2)
Disp9u: LD/ST ops
Disp10s (XG2)
TBD if Disp10u+Disp6s would have been better.
Since negative displacements are still pretty rare.
Might have been better to have larger positive displacements.
Imm10{u/n}: Various 2RI ops
Imm11{u/n} {XG2}
Disp11s / Disp12s (XG2), Branch-Compare-Zero
Effectively uses an opcode bit as the sign bit.
Imm16u/Imm16n: Some 2RI ops.
Disp20s: BRA/BSR
Disp23s (XG2)
Imm24{u/n}: LDIZ/LDIN ("MOV Imm25s, R0")

However, they are only available in specific combinations.
Imm9u: ADD, ADDS.L, ADDU.L, AND, OR, XOR, SH{A/L}D{L/Q}, MULS, MULU
Imm9n: ADD, ADDS.L, ADDU.L

Which does mean, say:
y=x&(~7);
Needs either to load a constant into a register, or use a jumbo prefix.

The Disp9u/Disp10s encoding exists on all basic Load/Store ops, however "special" ops (like XMOV.x) only have Disp5u/Disp6s encodings (not a
huge loss though).

With a Jumbo-Imm prefix, many of the Disp/Imm cases expand to 33 bits
(except Disp5 which only goes to 29 bits).

Where, say:
Thread: Logical thread of execution within some existing process;

         has a register file and a stack.

Process: Distinct collection of 1 or more threads within a shared

         has a memory map a heap and a vector of threads.

address space and shared process identity (may have its own address
space, though as-of-yet, TestKern uses a shared global address space);
Task: Supergroup that includes Threads, Processes, and other
thread-like entities (such as call and method handlers), may be either
thread-like or process-like.

Where, say, the Syscall interrupt handler doesn't generally handle
syscalls itself (since the ISRs will only have access to
physically-mapped addresses), but effectively instead initiates a
context switch to the task that can handle the request (or, to context
switch back to the task that made the request, or to yield to another
task, ...).

We call these things:: dispatchers.

Yeah.

As-is, I have several major interrupt handlers:

Fault: Something has gone wrong, current handling is to stall the CPU
until reset (and/or terminate the emulator). Could in premise do other
things.

IRQ: Deals with timer, may potentially be used for preemptive task
scheduling (code is in place, but this is not currently enabled). Does
not currently perform any other "complex" actions (and the "practical"
use of IRQ's remains limited in my case, due in large part to the
limitations of interrupt handling).

TLB Miss: Handles TLB miss and ACL Miss events, may initiate further
action if a "page fault" style event occurs (or something needs to be
paged in/paged out from the swapfile).

SYSCALL: Mostly initiates task switches and similar, and little else.

Unlike x86, the design of the interrupt mechanisms means it isn't
practical to hang the whole OS off of an interrupt handler. The closest
option is mostly to use the interrupt handlers to trigger context
switches (which is, ironically, slightly less of an issue, as many of
the "hard" parts of a context switch are already performed for sake of
dealing with the "rather minimalist" interrupt mechanism).

Basically, in this design, it isn't possible to enter a new interrupt
without first returning from the prior interrupt (at least not without
f*ing the CPU state). And, as-is, interrupts can only operate in
physically addressed mode.

They also need to manually save and restore all the registers, since
unlike either SuperH or RISC-V, BJX2 does not have any banked registers
(apart from SP/SSP, which switch places when entering/leaving an ISR).

Unlike x86 (protected mode), it doesn't have TSS's either, and unlike
8086 real-mode, it doesn't implicitly push anything to the stack (nor
have an "interrupt vector table").

So, the interrupt handling is basically a computed branch; which was
basically about the cheapest mechanism I could come up with at the time.

Did create a little bit of a puzzle initially as to how to get the CPU
state saved off and restored with no free registers. Though, there are a
few CR's which capture the CPU state at the time the ISR happens (these registers getting overwritten every time a new interrupt occurs).

So, say:
Interrupt entry:
Copy low bits of SR into high bits of EXSR;
Copy PC into SPC.
Copy fault address into TEA;
Swap SP and SSP (*1);
Set CPU flags to Supervisor+ISR mode;
CPU Mode bits now copied from high bits of VBR.
Computed branch relative to VBR.
Offset depends on interrupt category.
Interrupt return (RTE):
Copy EXSR bits back into SR;
Unswap SP/SSP (*1);
Branch to SPC.

*1: At the time, couldn't figure a good way to shave more logic off the mechanism. Though, now, the most obvious candidate now would be to
eliminate the implicit SP/SSP swapping (this part is currently handled
in the instruction decoder).

So, instead, the ISR entry point would do something like:
MOV SP, SSP
MOV 0xDE00, SP //Designated ISR stack SRAM
MOV.Q R0, (SP, 0)
NOV.Q R1, (SP, 8)
... Now save off everything else ...

But, didn't really think of it at the time.

There is already the trick of requiring VBR to be aligned (currently 64B
in practice; formally 256B), mostly so as to allow the "address
computation" to be done via bit-slicing.

Not sure if many CPUs have a cheaper mechanism here...

Note that in my case, generally the interrupt handlers are written in C,
with the compiler managing all the ISR prolog/epilog stuff (mostly saving/restoring pretty much the entire CPU state to the ISR stack).

Generally, the ISR's also need to deal with having a comparably small
stack (with 0.75K already used for the saved CPU state).

Where:
0000..7FFF: Boot ROM
8000..BFFF: (Optional) Extended Boot ROM
C000..DFFF: Boot/ISR SRAM
E000..FFFF: (Optional) Extended SRAM

Generally, much of the work of the context switch is pulled off using
"memcpy" calls (with the compiler providing a special "__arch_regsave"
variable giving the address of the location it has dumped the CPU
registers into; which in turn covers most of the core state that needs
to be saved/restored for a process context switch).

Though, I guess one other possibility would be if the compiler-generated
ISR code assumed TBR to always be valid (and then copied the registers
to a fixed location relative to TBR instead of the ISR stack), which
could in-theory allow for faster context switching (by eliminating the
need for the memcpy calls), but would be a bit more brittle (if TBR is
invalid, stuff is going to break pretty hard as soon as an interrupt
happens).

Would likely need special compiler attributes for this (would not make
sense for interrupts which do not, or are unlikely to, perform a context switch).

Though, will need to probably add more special case handling such that
the Syscall task can not yield or try to itself make a syscall (the
only valid exit point for this task being where it transfers control
back to the caller and awaits the next syscall to arrive; and it is
not valid for this task to try to syscall back into itself).

In My 66000, every <effective> SysCall goes deeper into the privilege hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV, Guest HV SysCalls real HV. No data structures need maintenance during
these transitions of the hierarchy.

No way to handle a syscall recursively in my case, partly because of how
the task works:
It gets started at a certain location, and switches off at the point
where it would receive a syscall request.

So, sort of like:
... //initial task setup
TK_Task_SyscallReturnToUser(task);
while(1)
{
TK_Task_SyscallGetArgs(&task, &sobj, &umsg, &rptr, &args);
//handle the syscall
TK_Task_SyscallReturnToUser(task);
}
Whenever ReturnToUser returns, it expects there to be a syscall request
for it to handle. This call effectively transfers control back to the
caller task, with the syscall task ready to receive a new request.

SyscallGetArgs basically invokes "arcane magic" to fetch the parameters
for the task that performed the syscall (the dispatch mechanism stashes
the parameters in a designated location in the syscall handler's task
context).

However, if the Syscall task itself tries to invoke yield, or otherwise triggers a context switch, then it will not be at the correct location
to handle a syscall if one were to arrive (at which point, the OS explodes).

Or, if it tries to perform a syscall, then the syscall attempt will
return immediately (since it effectively performs a context which back
to itself).

Granted, it is possible that the SYSCALL dispatcher could be made to
dispatch among one of multiple SYSCALL tasks, which could then handle up
to N levels of recursion.

On a multi-core system, each core would also need its own syscall tasks
(well, and/or they operate round-robin, and the syscall is directed at whichever task is in the correct state to handle a request).

There is a little flexibility here, at least in as far as pretty much
the whole mechanism is managed in software in this case (apart from the
ISR mechanism itself).

Note that for inter-task method-calls, a similar mechanism is used to
normal syscalls, except:
A range of special syscall numbers is used as a VTable index;
The object's VTable implicitly encodes the PID of the task to
dispatch the request to.

So, instead of waiting for syscalls, it waits for method calls, and then dispatches them as needed (locally) when they arrive.

On the reciever end, there is a mechanism to compose the VTable
interface, where the VTable is effectively composed of methods whose
sole purpose is to invoke a syscall, passing the argument list and
similar off to a handler, with the syscall number based on the method's location within the VTable.

Then, the SYSCALL ISR sees this, and then fetches the corresponding task
to dispatch to, ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Wed Nov 22 18:38:00 2023

BGB wrote:

On 11/21/2023 4:12 PM, MitchAlsup wrote:

BGB wrote:

Where, say, the Syscall interrupt handler doesn't generally handle
syscalls itself (since the ISRs will only have access to
physically-mapped addresses), but effectively instead initiates a
context switch to the task that can handle the request (or, to context
switch back to the task that made the request, or to yield to another
task, ...).

We call these things:: dispatchers.

Yeah.

As-is, I have several major interrupt handlers:

Fault: Something has gone wrong, current handling is to stall the CPU
until reset (and/or terminate the emulator). Could in premise do other things.

I call these checks:: a page fault is an unanticipated SysCall to the
Guest OS page fault handler; whereas a check is something that should
never happen but did (ECC repair fail): These trap to Real HV.

IRQ: Deals with timer, may potentially be used for preemptive task
scheduling (code is in place, but this is not currently enabled). Does
not currently perform any other "complex" actions (and the "practical"
use of IRQ's remains limited in my case, due in large part to the
limitations of interrupt handling).

Every My 66000 process has its own event table which combines exceptions interrupts, SysCalls,... This means there is no table surgery when switching between Guest OS and Guest Hypervisor and Real Hypervisor.

TLB Miss: Handles TLB miss and ACL Miss events, may initiate further
action if a "page fault" style event occurs (or something needs to be
paged in/paged out from the swapfile).

HW table walking.

SYSCALL: Mostly initiates task switches and similar, and little else.

Part of Event table.

Unlike x86, the design of the interrupt mechanisms means it isn't
practical to hang the whole OS off of an interrupt handler. The closest option is mostly to use the interrupt handlers to trigger context
switches (which is, ironically, slightly less of an issue, as many of
the "hard" parts of a context switch are already performed for sake of dealing with the "rather minimalist" interrupt mechanism).

My 66000 can perform a context (user->user) in a single instruction.
Old state goes to memory, new state comes from memory; by the time
state has arrived, you are fetching instructions in the new context
under the new context MMU tables and privileges and priorities.

Basically, in this design, it isn't possible to enter a new interrupt
without first returning from the prior interrupt (at least not without
f*ing the CPU state). And, as-is, interrupts can only operate in
physically addressed mode.

They also need to manually save and restore all the registers, since
unlike either SuperH or RISC-V, BJX2 does not have any banked registers (apart from SP/SSP, which switch places when entering/leaving an ISR).

Unlike x86 (protected mode), it doesn't have TSS's either, and unlike
8086 real-mode, it doesn't implicitly push anything to the stack (nor
have an "interrupt vector table").

So, the interrupt handling is basically a computed branch; which was basically about the cheapest mechanism I could come up with at the time.

Did create a little bit of a puzzle initially as to how to get the CPU
state saved off and restored with no free registers. Though, there are a
few CR's which capture the CPU state at the time the ISR happens (these registers getting overwritten every time a new interrupt occurs).

Why not just treat the RF as a cache with a known address in physical memory. In MY 66000 that is what I do and then just push and pull 4 cache lines at a time.

So, say:
Interrupt entry:
Copy low bits of SR into high bits of EXSR;
Copy PC into SPC.
Copy fault address into TEA;
Swap SP and SSP (*1);
Set CPU flags to Supervisor+ISR mode;
CPU Mode bits now copied from high bits of VBR.
Computed branch relative to VBR.
Offset depends on interrupt category.
Interrupt return (RTE):
Copy EXSR bits back into SR;
Unswap SP/SSP (*1);
Branch to SPC.

Interrupt Entry Point::
// by this point all the old registers have been saved where they
// are supposed to go, and the interrupt dispatcher registers are
// already loader up and ready to go, and the CPU is running at
// whatever privilege level was specified.
HR R1<-WHY
LD IP,[IP,R1<<3,InterruptVectorTable] // Call through table
RTI
//
InterruptHandler0:
// do what is necessary
// note this can all be written in C
RET
InterruptHandler1::

*1: At the time, couldn't figure a good way to shave more logic off the mechanism. Though, now, the most obvious candidate now would be to
eliminate the implicit SP/SSP swapping (this part is currently handled
in the instruction decoder).

So, instead, the ISR entry point would do something like:
MOV SP, SSP
MOV 0xDE00, SP //Designated ISR stack SRAM
MOV.Q R0, (SP, 0)
NOV.Q R1, (SP, 8)
... Now save off everything else ...

But, didn't really think of it at the time.

There is already the trick of requiring VBR to be aligned (currently 64B
in practice; formally 256B), mostly so as to allow the "address
computation" to be done via bit-slicing.

Not sure if many CPUs have a cheaper mechanism here...

Treat the CPU state and the register state as cache lines and have
HW shuffle them in and out. You can even start the 5 cache line reads
before you start the CPU state writes; saving latency (which you cannot
using SW only methods).

Note that in my case, generally the interrupt handlers are written in C,
with the compiler managing all the ISR prolog/epilog stuff (mostly saving/restoring pretty much the entire CPU state to the ISR stack).

My 66000 compiler remains blissfully ignorant of ISR prologue and
epilogue and it still works.

Generally, the ISR's also need to deal with having a comparably small
stack (with 0.75K already used for the saved CPU state).

Where:
0000..7FFF: Boot ROM
8000..BFFF: (Optional) Extended Boot ROM
C000..DFFF: Boot/ISR SRAM
E000..FFFF: (Optional) Extended SRAM

Generally, much of the work of the context switch is pulled off using "memcpy" calls (with the compiler providing a special "__arch_regsave" variable giving the address of the location it has dumped the CPU
registers into; which in turn covers most of the core state that needs
to be saved/restored for a process context switch).

Why not just make the HW push and pull cache lines.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Wed Nov 22 19:36:28 2023

Robert Finch wrote:

On 2023-11-21 5:12 p.m., MitchAlsup wrote:

In My 66000, every <effective> SysCall goes deeper into the privilege
hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV,
Guest HV SysCalls real HV. No data structures need maintenance during
these transitions of the hierarchy.

Does it follow the same way for hardware interrupts? I think RISCV goes
to the deepest level first, machine level, then redirects to lower
levels as needed. I was planning on Q+ operating the same way.

It depends, there is the school of thought that just deliver control to
someone who can always deal with it (Machine level in RISC-V) and there
is the other school of thought that some table should encode which level
of the system control is delivered to. The former allow SW to control
every step of the process, the later gets rid of all the SW checking
and simplifies the process of getting to and back from interrupt handlers
(and their associated soft IRQs.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Nov 22 17:17:30 2023

Why not just treat the RF as a cache with a known address in physical memory. In MY 66000 that is what I do and then just push and pull 4 cache lines at a

Hmm... I thought the "66000" came from the CDC 6600 but now I wonder if
it's not also a pun on the TI 9900.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Stefan Monnier on Wed Nov 22 23:58:19 2023

Stefan Monnier wrote:

Why not just treat the RF as a cache with a known address in physical memory.
In MY 66000 that is what I do and then just push and pull 4 cache lines at a

Hmm... I thought the "66000" came from the CDC 6600 but now I wonder if
it's not also a pun on the TI 9900.

In reverence to CDC 6600, not came from.

Exchange Jump on CDC 6600 causes a context switch that took 16+10 processor cycles
(after the scoreboard cleared.) And on the 6600, NOS was in the PPs and the CPUs
were there to just crunch numbers.

I have a hard real time version of My 66000 where the lower levels of the OS is in HW, and if you have fewer than 1024 threads running, you do not expend any (zero, 0, nada, zilch) cycles in the OS performing context switches or priority alterations. This system has the property that if an interrupt (or message) arrives to unblock a waiting thread that is of higher priority than any CPU in affinity group of CPUs, then the lowest priority CPU in that group receives the higher priority thread (without an excursion through the OS (damaging cache state).)

I have a Linux friendly version where context switch is a single instruction. When you write a context pointer that entire context is now available to support
whatever you want it to support. So, a unprivileged application can context switch to another unprivileged application by writing a single control register leaving Guest OS, Guest HV and Real HV in their original configuration. Guest OS can context switch to a different Guest OS in a single instruction and then the Guest OS receiving control needs to context switch to an application it wants
to run--so 20-ish cycles to perform a Guest OS switch. (This now costs typical old architectures 10,000 cycles)

But nowhere does any thread receiving control have to execute and state or register saving or restoring......Just like Exchange Jump.....

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to MitchAlsup on Wed Nov 22 21:50:30 2023

On 11/22/2023 12:38 PM, MitchAlsup wrote:

BGB wrote:

On 11/21/2023 4:12 PM, MitchAlsup wrote:

BGB wrote:

Where, say, the Syscall interrupt handler doesn't generally handle
syscalls itself (since the ISRs will only have access to
physically-mapped addresses), but effectively instead initiates a
context switch to the task that can handle the request (or, to
context switch back to the task that made the request, or to yield
to another task, ...).

We call these things:: dispatchers.

Yeah.

As-is, I have several major interrupt handlers:

Fault: Something has gone wrong, current handling is to stall the CPU
until reset (and/or terminate the emulator). Could in premise do other
things.

I call these checks:: a page fault is an unanticipated SysCall to the
Guest OS page fault handler; whereas a check is something that should
never happen but did (ECC repair fail): These trap to Real HV.

A lot of things here are things that could be handled, but are not
currently handled:
Invalid instructions;
Access to invalid memory regions;
Access to memory in a way which violates access protections;
A branch to an invalid address;
Code used the BREAK instruction or similar;
Etc.

Generally at present, if any of these happens, it means that something
has gone badly enough that I want to stall immediately and probably
debug it.

In a "real" OS, if this happens in userland, one would typically turn
this into "SEGFAULT" or similar.

For the emulator, if a BREAK occurs in ISR mode (or any other fault
happens in ISR mode), it causes the emulator to stop execution, dump a backtrace and registers, and then terminate. Otherwise, exiting the
emulator normally will dump a bunch of profiling information (this part
is not done if the emulator terminates due to a fault).

Stalling the core in the Verilog core causes it to dump the state of the pipeline and some other things via "$display" (potentially relevant for debugging). Or, allows seeing the crash PC on the 7-segment display on
the Nexys A7.

IRQ: Deals with timer, may potentially be used for preemptive task
scheduling (code is in place, but this is not currently enabled). Does
not currently perform any other "complex" actions (and the "practical"
use of IRQ's remains limited in my case, due in large part to the
limitations of interrupt handling).

Every My 66000 process has its own event table which combines exceptions interrupts, SysCalls,... This means there is no table surgery when
switching
between Guest OS and Guest Hypervisor and Real Hypervisor.

In my case, the VBR register is global (and set up during boot).

Any per-process event dispatching would need to be handled in software.

I didn't go with an x86-style IDT or similar partly because this would
have been significantly more expensive (in terms of Verilog code and
LUTs) than the existing mechanism. The role of an x86-style IDT could be
faked in software though.

So, VBR is sort of like:
(63:48): Encodes CPU state to use on ISR entry;
(47: 6): Encodes the ISR entry point.
In practice only (28:6) are "actually usable".
( 5: 0): Must be Zero

Where, low-order bits are replaced with an entry offset:
00: RESET
08: FAULT
10: IRQ
18: TLBMISS
20: SYSCALL
28: Reserved

The 8-bytes of space gives enough space to encode a relative or absolute
branch to the actual entry point (which not being so big as to be
needlessly wasteful).

During CPU reset, VBR is cleared to 0, and then control is transferred
to 0, which branches to the ROM's entry point.

The use of a computed branch was preferable to a "vector table" as the
vector table would have required some mechanism for the CPU to perform a
memory load to get the address. Computed branch was easier, since no
special memory load is needed, just branch there, and assume this lands
on a branch instruction which takes control where it needs to go.

TLB Miss: Handles TLB miss and ACL Miss events, may initiate further
action if a "page fault" style event occurs (or something needs to be
paged in/paged out from the swapfile).

HW table walking.

Yeah, no page-table hardware in my case.

Had on/off considered an "Inverted Page-Table" like in IA-64, but this
still seemed to be annoyingly expensive vs the "Throw a TLB-Miss
Exception" route. Even if I eliminated the TLB-Miss logic, would then
need to have Page-Fault logic, which doesn't really save anything there
either.

There is a designated register though for the page-table: TTB.

With the considered inverted-page-table using a separate VIPT register,
the idea being that VIPT would point to a region of, say, 4096x4x128b
TLBE's (~256K), effectively functioning as a RAM-backed L3 TLB. If this
table lacked the requested TLBE, this would still result in a TLB Miss
fault.

Note that the idea was still that trying to use 96-bit virtual address
mode would require two TLBE's, effectively halving associativity. This
in turn requires plain modulo-addressing as hashing can create a "bad situation" where a 2-way TLB will get stuck in an infinite loop (but
this infinite loop scenario is narrowly averted with modulo addressing).

Granted, 4-way is still better as it seems to result in a comparably
lower TLB miss rate.

It is still possible though to XOR the TLBE's index with a bit-pattern
derived from the ASID, to slightly reduce the cost of context switches
in some cases (if multiple address spaces were being used).

Note that the L1 I$ and D$ can get along reasonably well with an
optional 32-entry 1-way "Micro-TLB".

SYSCALL: Mostly initiates task switches and similar, and little else.

Part of Event table.

All software in my case.

Unlike x86, the design of the interrupt mechanisms means it isn't
practical to hang the whole OS off of an interrupt handler. The
closest option is mostly to use the interrupt handlers to trigger
context switches (which is, ironically, slightly less of an issue, as
many of the "hard" parts of a context switch are already performed for
sake of dealing with the "rather minimalist" interrupt mechanism).

My 66000 can perform a context (user->user) in a single instruction.
Old state goes to memory, new state comes from memory; by the time
state has arrived, you are fetching instructions in the new context
under the new context MMU tables and privileges and priorities.

Yeah, but that is not exactly minimalist in terms of the hardware.

Granted, burning around 1 kilocycle of overhead per syscall isn't ideal either...

Eg:
Save registers to ISR stack;
Copy registers to User context;
Copy handler-task registers to ISR stack;
Reload registers from ISR stack;
Handle the syscall;
Save registers to ISR stack;
Copy registers to Syscall context;
Copy User registers to ISR stack;
Reload registers from ISR stack.

Does mean that one needs to be economical with syscalls (say, doing
"printf" a whole line at a time, rather than individual characters, ...).

And, did create incentive to allow getting the microsecond-clock value
and hardware RNG values from CPUID rather than needing a syscall (say,
don't want to burn 20us to check the microsecond counter, ...).

If the "memcpy's" could be eliminated, this could roughly halve the cost
of doing a syscall.

One other option would be to do like RISC-V's privileged spec and have
multiple copies of the register file (and likely instructions for
accessing these alternate register files).

Worth the cost? Dunno.

Not too much different to modern Windows, where slow syscalls are still
fairly common (and despite the slowness of the mechanism, it seems like
BJX2 sycalls still manage to be around an order of magnitude faster than Windows syscalls in terms of clock-cycle cost...).

Well, and the seeming absurdity of WaitForSingleObject() on a mutex
generally taking upwards of 1 million clock-cycles IIRC in past
experiments (when the mutex isn't already locked; and, if it is
locked... yeah...).

You could lock a mutex... or you could render an entire frame in Doom,
then checksum the frame image, and use the checksum as a hash key. In a
roughly similar time-scale.

Luckily, at least, the CriticalSection objects were not absurdly slow...

Basically, in this design, it isn't possible to enter a new interrupt
without first returning from the prior interrupt (at least not without
f*ing the CPU state). And, as-is, interrupts can only operate in
physically addressed mode.

They also need to manually save and restore all the registers, since
unlike either SuperH or RISC-V, BJX2 does not have any banked
registers (apart from SP/SSP, which switch places when
entering/leaving an ISR).

Unlike x86 (protected mode), it doesn't have TSS's either, and unlike
8086 real-mode, it doesn't implicitly push anything to the stack (nor
have an "interrupt vector table").

So, the interrupt handling is basically a computed branch; which was
basically about the cheapest mechanism I could come up with at the time.

Did create a little bit of a puzzle initially as to how to get the CPU
state saved off and restored with no free registers. Though, there are
a few CR's which capture the CPU state at the time the ISR happens
(these registers getting overwritten every time a new interrupt occurs).

Why not just treat the RF as a cache with a known address in physical
memory.
In MY 66000 that is what I do and then just push and pull 4 cache lines
at a
time.

Possible, but poses its own share of problems...

Not sure how this could be implemented cost-effectively, or for that
matter, more cheaply than a RISC-V style mode-banked register-file.

Though, could make sense if one has a mechanism where a context switch
could have a mechanism to dump the whole register file to Block-RAM, and
some sort of mechanism to access this RAM via an MMIO interface.

Pros/cons, seems like each possibility would also come with drawbacks:
As-is: Slowness due to needing to save/reload everything;
RISC-V: Expensive regfile, only works for limited cases;
MMIO Backed + RV-like: Faster U<->S, but slower task switching.
RAM Backed: Cache coherence becomes a critical feature.

The RISC-V like approach makes sense if one assumes:
There is a user process;
There is a kernel running under it;
We want to call from the user process into the kernel.

Doesn't make so much sense, say, for:
User Process A calls a VTable entry which calls into User Process B;
Service A uses a VTable to call into the VFS;
...

Say, where one is making use of horizontal context switches for control
flow between logical tasks. Which would still remain fairly expensive
under a RISC-V like model.

One could have enough register banks for N logical tasks, but supporting
4 or 8 copies of the register file is going to cost more than 2 or 3.

Granted, possibly, handling system calls via using a mechanism along the
lines of a horizontal context switch, is a bit unusual...

But, ironically, this sort of ended up seeming like the most
straightforward approach in my case.

So, say:
   Interrupt entry:
     Copy low bits of SR into high bits of EXSR;
     Copy PC into SPC.
     Copy fault address into TEA;
     Swap SP and SSP (*1);
     Set CPU flags to Supervisor+ISR mode;
       CPU Mode bits now copied from high bits of VBR.
     Computed branch relative to VBR.
       Offset depends on interrupt category.
   Interrupt return (RTE):
     Copy EXSR bits back into SR;
     Unswap SP/SSP (*1);
     Branch to SPC.

    Interrupt Entry Point::
      // by this point all the old registers have been saved where they
      // are supposed to go, and the interrupt dispatcher registers are
      // already loader up and ready to go, and the CPU is running at
      // whatever privilege level was specified.
      HR   R1<-WHY
      LD   IP,[IP,R1<<3,InterruptVectorTable] // Call through table
      RTI
//
InterruptHandler0:
      // do what is necessary
      // note this can all be written in C
      RET
InterruptHandler1::

Above, I was describing what the hardware was doing.

The software side is basically more like:
Branch from VBR-table to ISR entry point;
Get R0 and R1 saved onto the stack;
Get some of the CRs saved off (we need R0 and R1 free here);
Get the rest of the GPRs saved onto the stack;
Call into the main part of the ISR handler (using normal C ABI);
Restore most of the GPRs;
Restore most of the CRs;
Restore R0 and R1;
Do an RTE.

If I were to make the ISR mechanism assume that TBR was valid:
Branch from VBR-table to ISR entry point;
Get R0/R1/R8/R9 saved onto the stack;
Load the address of the register-save area from the current TBR;
Save CRs and GPRs to register save area
Copy over the values saved onto the stack.
Call into the main part of the ISR handler (using normal C ABI);
Restore everything from the potentially new TBR;
...

Pros:
Could speed up syscalls and task switches;
No hardware-level changes needed.

Cons:
Now the compiler would be hard-coded for TestKern's TBR layout (this
stuff would need to be baked into the ABI, *).

*: This structure being comparable to the TEB in Windows (and also holds
the location to find things like TLS variables and similar).

It differs slightly from the Windows TEB though:
The main part is Read-Only in Userland;
Holds a pointer to a Kernel-Only part;
This part holds the saved registers.
Holds another pointer to a User Modifiable part
This part holds the TLS variables and some execution-state stuff.

Likely, in C land, might look something like:
__interrupt __declspec(isr_regsave_tbr) void __isr_syscall(void)
{
...
}

With the "__declspec(isr_regsave_tbr)" signaling to BGBCC that it should
save registers directly into the TBR's register-save area rather than
onto the ISR stack.

Should be workable at least under the assumption that no one is going to
try to invoke a syscall without a valid TBR.

*1: At the time, couldn't figure a good way to shave more logic off
the mechanism. Though, now, the most obvious candidate now would be to
eliminate the implicit SP/SSP swapping (this part is currently handled
in the instruction decoder).

So, instead, the ISR entry point would do something like:
   MOV    SP, SSP
   MOV    0xDE00, SP //Designated ISR stack SRAM
   MOV.Q R0, (SP, 0)
   NOV.Q R1, (SP, 8)
   ... Now save off everything else ...

But, didn't really think of it at the time.

There is already the trick of requiring VBR to be aligned (currently
64B in practice; formally 256B), mostly so as to allow the "address
computation" to be done via bit-slicing.

Not sure if many CPUs have a cheaper mechanism here...

Treat the CPU state and the register state as cache lines and have
HW shuffle them in and out. You can even start the 5 cache line reads
before you start the CPU state writes; saving latency (which you cannot
using SW only methods).

I meant hardware-side cost.

But, yeah, software-side could be a fair bit faster...

Note that in my case, generally the interrupt handlers are written in
C, with the compiler managing all the ISR prolog/epilog stuff (mostly
saving/restoring pretty much the entire CPU state to the ISR stack).

My 66000 compiler remains blissfully ignorant of ISR prologue and
epilogue and it still works.

Generally, the ISR's also need to deal with having a comparably small
stack (with 0.75K already used for the saved CPU state).

Where:
   0000..7FFF: Boot ROM
   8000..BFFF: (Optional) Extended Boot ROM
   C000..DFFF: Boot/ISR SRAM
   E000..FFFF: (Optional) Extended SRAM

Generally, much of the work of the context switch is pulled off using
"memcpy" calls (with the compiler providing a special "__arch_regsave"
variable giving the address of the location it has dumped the CPU
registers into; which in turn covers most of the core state that needs
to be saved/restored for a process context switch).

Why not just make the HW push and pull cache lines.

My current prediction is that the mechanism for doing this would make
the register file significantly more expensive, along with making for
more serious problems related to memory coherence if the CPU tries to
touch any of this (unlike the RAM-backed VRAM, I can't hand-wave this,
if things don't go perfectly, stuff is gonna explode).

Granted, going "true multicore" it likely to require addressing the
cache coherence issues somehow (likely needing to manually invoke cache
flushes to deal with multithreaded code isn't really going to fly).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Thu Nov 23 16:53:04 2023

BGB wrote:

On 11/22/2023 12:38 PM, MitchAlsup wrote:

BGB wrote:

Yeah, but that is not exactly minimalist in terms of the hardware.

Granted, burning around 1 kilocycle of overhead per syscall isn't ideal either...

Eg:
Save registers to ISR stack;
Copy registers to User context;
Copy handler-task registers to ISR stack;
Reload registers from ISR stack;
Handle the syscall;
Save registers to ISR stack;
Copy registers to Syscall context;
Copy User registers to ISR stack;
Reload registers from ISR stack.

Does mean that one needs to be economical with syscalls (say, doing
"printf" a whole line at a time, rather than individual characters, ...).

Not at all--I have reduced SysCalls to just a bit slower than actual CALL.
say around 10-cycles. Use them as often as you like.

And, did create incentive to allow getting the microsecond-clock value
and hardware RNG values from CPUID rather than needing a syscall (say,
don't want to burn 20us to check the microsecond counter, ...).

If the "memcpy's" could be eliminated, this could roughly halve the cost
of doing a syscall.

I have MM (memory move) as a 3-operand instruction.

One other option would be to do like RISC-V's privileged spec and have multiple copies of the register file (and likely instructions for
accessing these alternate register files).

There is one CPU register file, and every running thread has an address
where that file comes from and goes to--just like a block of 4 cache lines; There is a 5th cache line that contains all the other PSW stuff.

Worth the cost? Dunno.

In my opinion--Absolutely worth it.

Not too much different to modern Windows, where slow syscalls are still fairly common (and despite the slowness of the mechanism, it seems like
BJX2 sycalls still manage to be around an order of magnitude faster than Windows syscalls in terms of clock-cycle cost...).

Now, just get it down to a cache missing {L1, L2} instruction fetch.

Why not just treat the RF as a cache with a known address in physical
memory.
In MY 66000 that is what I do and then just push and pull 4 cache lines
at a
time.

Possible, but poses its own share of problems...

Not sure how this could be implemented cost-effectively, or for that
matter, more cheaply than a RISC-V style mode-banked register-file.

1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead
of having 4 cache lines of state and 1 doubleword of address, you need
16 cache lines of state.

Though, could make sense if one has a mechanism where a context switch
could have a mechanism to dump the whole register file to Block-RAM, and
some sort of mechanism to access this RAM via an MMIO interface.

Just put it in DRAM at SW controlled (via TLB) addresses.

Pros/cons, seems like each possibility would also come with drawbacks:
As-is: Slowness due to needing to save/reload everything;
RISC-V: Expensive regfile, only works for limited cases;
MMIO Backed + RV-like: Faster U<->S, but slower task switching.
RAM Backed: Cache coherence becomes a critical feature.

The RISC-V like approach makes sense if one assumes:
There is a user process;
There is a kernel running under it;
We want to call from the user process into the kernel.

So if you ae running under a Real OS you don't need 2 sets of RFs in my
model.

Doesn't make so much sense, say, for:
User Process A calls a VTable entry which calls into User Process B;
Service A uses a VTable to call into the VFS;
...

Say, where one is making use of horizontal context switches for control
flow between logical tasks. Which would still remain fairly expensive
under a RISC-V like model.

Yes, but PTHREADing can be done without privilege and in a single instruction.

One could have enough register banks for N logical tasks, but supporting
4 or 8 copies of the register file is going to cost more than 2 or 3.

Above, I was describing what the hardware was doing.

The software side is basically more like:
Branch from VBR-table to ISR entry point;
Get R0 and R1 saved onto the stack;

Where did you get the address of this stack ??

Get some of the CRs saved off (we need R0 and R1 free here);
Get the rest of the GPRs saved onto the stack;
Call into the main part of the ISR handler (using normal C ABI);
Restore most of the GPRs;
Restore most of the CRs;
Restore R0 and R1;
Do an RTE.

If HW does register file save/restore the above looks like::

The software side is basically more like:
Branch from VBR-table to ISR entry point;
Call into the main part of the ISR handler (using normal C ABI);
Do an RTE.

See what it saves ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Thu Nov 23 19:17:14 2023

mitchalsup@aol.com (MitchAlsup) writes:

Robert Finch wrote:

On 2023-11-21 5:12 p.m., MitchAlsup wrote:

In My 66000, every <effective> SysCall goes deeper into the privilege
hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV, >>> Guest HV SysCalls real HV. No data structures need maintenance during
these transitions of the hierarchy.

Does it follow the same way for hardware interrupts? I think RISCV goes
to the deepest level first, machine level, then redirects to lower
levels as needed. I was planning on Q+ operating the same way.

It depends, there is the school of thought that just deliver control to >someone who can always deal with it (Machine level in RISC-V) and there
is the other school of thought that some table should encode which level
of the system control is delivered to. The former allow SW to control
every step of the process, the later gets rid of all the SW checking
and simplifies the process of getting to and back from interrupt handlers >(and their associated soft IRQs.)

ARMv8 allows the interrupt and fast interrupt (IRQ, FIQ) signals to be delivered to the EL1 (operating system) ring unless system registers at
higher (more privileged) exception levels trap the signal. EL3 (firmware) level is the most privileged level and generally 'owns' the FIQ signal,
while the IRQ signal is owned by EL1 (bare metal OS) or EL2 (hypervisor).

The destination exception level of each signal is controlled by
bits in system registers (SCR_EL3 to direct them to EL3, HCR_EL2 to
direct them to EL2).

Interrupts can be assigned to one of two groups - group 0 which is
always delivered as an FIQ and group 1 which is delivered as an IRQ.

Group zero interrupts are considered "secure" interrupts and only
secure accesses can modify the configuration of such interrupts.

Group one interrupts can be either non-secure or secure depending on
the security state of the target exception level (secure or non-secure).

The higher priority half of the interrupt priority (8 bits) is considered
a secure range, the rest non-secure, thus secure interrupts will always have higher priority than non-secure interrupts.

There is no software "checking" required.

Exception return (i.e. context switch) loads the PSR from SPSR_ELx and
the PC from ELR_ELx[*] and that's the entirety of the software visible state handled by the hardware. Each exception level has its own page table
root registers (TTBR0_ELx, TTBR1_ELx for each half of the VA space), so
there is nothing for software to reload. Hardware manages the TLB entries which are tagged with both security state and exception level.

[*] Both are system registers (flops, not ram)

[**] The secure flag (!SCR_EL3[NS]) acts like an 'invisible'
address bit at bit N (where N is the number of bits of supported
physical address). This provides two completely distinct N-bit
address spaces - one secure and one non-secure with SCR_EL3[NS]
controlling which space is used by accesses. NS only applies
to EL 0 - 2, EL3 is always considered secure. N is typically 48,
but can be up to 52 in the current versions of the architecture.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Thu Nov 23 21:08:45 2023

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Stefan Monnier wrote:

I have a Linux friendly version where context switch is a single instruction.

The Burroughs B3500 had a single such instruction, called
Branch Reinstate (BRE).

My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
each privilege level has its own {IP, RF, Root Pointer, CSP, Exception {Enabled, Raised}, and a few more things contained in 5 contiguous cache
lines.

The 4 privilege levels, each, have a pointer to those 5 cache lines. By
writing the control register (HR instruction) one can change the control
point for each level (of course you have to have appropriate permission--
but I decided that a user should have the ability to context switch to
another user without needing OS intervention--thus pthreads do not need
an excursion through the Guest OS to switch threads under the same memory
map {but do when crossing processes}.

Thus, all 4 privileges are always resident in the privilege hierarchy
at the cost of 4 DoubleWord registers instead of at the cost of 4 RFs.
With these levels all resident simultaneously, no table surgery is needed
to switch levels {Root pointers, MTRR,...} and no RF save/restore is
needed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Thu Nov 23 20:46:38 2023

mitchalsup@aol.com (MitchAlsup) writes:

Stefan Monnier wrote:

I have a Linux friendly version where context switch is a single instruction.

The Burroughs B3500 had a single such instruction, called
Branch Reinstate (BRE).

The task context (base register, limit register, accumulator, comparison
and overflow flags) were stored in small region at absolute address 60
and BRE would restore that state (and interrupts would save it).
Index registers were mapped to base-relative addresses 8, 16 and 24
(8 digits each).

The V-Series did a complete revamp of the processor architecture to
support larger memory sizes (both per task and systemwide) and
SMP. A segmentation scheme was adopted (for backward compatability)
and seven additional base-limit pairs were added to support direct
access to 8 segments at any time (called an evironment). There
could be up to 1,000,000 environments per task, each with up to
8 active memory areas (and 92 inactive memory areas accessible to
three special instructions for data movement and comparison).

The instruction was renamed Branch Reinstate Virtual (BRV) and would
read the task table entry and load all the relevent state, including
loading the active environment table into the processor base-limit
registers. BRV accessed a table in memory, indexed by task number,
that stored all the state of the task (200 digits worth).

At the same time, we added SMP support including an inter-cpu
communication instruction (my invention) similar to the
mechanism adopted a few years later when Intel added SMP
support for P5.

We also added hardware mutex and condition variable instructions;
the "LOK" instruction would atomically acquire the mutex, if
available, or interrupt to a microkernel scheduler if unavailable.
"UNLK" would interrupt if a higher priority task was waiting
for the lock. There were CAUS and WAIT instructions that
offered capabilities similar to posix condition variables.

Each defined lock had a canonical lock level (a 4 digit
number) and the hardware would fail a lock request where
the new lock canonical lock number is less than the current
lock owned by the task (if any). Unlock enforced the
reverse. This prevented any A-B deadlock situations from
occuring, although with many locks in a large subsystem (e.g
the MCP OS) it was tricky sometimes to assign lock numbers.
This also implicitly encouraged programmers to minimize
the critical section and avoid nested locking where possible.

The microkernel only handled scheduling and interrupts, all
MCP code ran in the context of either the task making the
request, or in an 'independent runner' (a kernel thread)
dispatched from the microkernel. I/O interrupts were dispatched
to two different independent runners, one for normal interrupts
and one for real-time interrupts. Real-time interrupts were
used for document sorters (e.g. MICR reader/sorters processing checks/cheques/utility bills, etc) in order to be able to
select the destination pocket for each document in the
time interval from the read station to the pocket-select
station (at 2500 documents per minute - 42 per second,
one document every 24 milliseconds). We supported ten
active sorters per host. Even had one host installed
on an L-1011 with reader/sorters that processed
checks on coast-to-coast overnight flights.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul A. Clayton@21:1/5 to MitchAlsup on Thu Nov 23 17:13:03 2023

On 11/23/23 4:08 PM, MitchAlsup wrote:
[snip]

The 4 privilege levels, each, have a pointer to those 5 cache
lines. By writing the control register (HR instruction) one
can change the control point for each level (of course you
have to have appropriate permission-- but I decided that a
user should have the ability to context switch to another
user without needing OS intervention--thus pthreads do not
need an excursion through the Guest OS to switch threads
under the same memory map {but do when crossing processes}.

My 66000 also has Port Holes, which seem to offer some
cross-protection-domain access.

While not significantly helpful, I also wonder if privilege
reducing operations could be lower cost by not involving the
OS. This would require the OS to store the allowed privilege
elsewhere, but this might be done anyway. It would also have
little use (I suspect) and still require OS involvement to
restore privilege. There might be some cases where privilege
is only needed in an initialization stage, but that seems
likely to be rare.

Writing to the accessed and dirty bits of a PTE would also
seem to be something that could, in theory, be allowed to a
user-level process. Clearing the dirty bit could be dangerous
if stale data was from another protection domain. Clearing
the accessed bit would seem to only "strongly hint" that the
page be victimized earlier; setting the dirty bit would not
be different than a "silent store" [not useful it seems since
a load/store instruction pair could accomplish the same] and
setting the accessed bit would seem the same as performing a
non-caching load to any location in the page acting as a
"keep me" hint [probably not useful]. Even with this little
thought, allowing these PTE changes seems not worthwhile.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to MitchAlsup on Thu Nov 23 15:53:47 2023

On 11/23/2023 10:53 AM, MitchAlsup wrote:

BGB wrote:

On 11/22/2023 12:38 PM, MitchAlsup wrote:

BGB wrote:

Yeah, but that is not exactly minimalist in terms of the hardware.

Granted, burning around 1 kilocycle of overhead per syscall isn't
ideal either...

Eg:
   Save registers to ISR stack;
   Copy registers to User context;
   Copy handler-task registers to ISR stack;
   Reload registers from ISR stack;
   Handle the syscall;
   Save registers to ISR stack;
   Copy registers to Syscall context;
   Copy User registers to ISR stack;
   Reload registers from ISR stack.

Does mean that one needs to be economical with syscalls (say, doing
"printf" a whole line at a time, rather than individual characters, ...).

Not at all--I have reduced SysCalls to just a bit slower than actual CALL. say around 10-cycles. Use them as often as you like.

OK.

Well, they aren't very fast in my case, in any case.

And, did create incentive to allow getting the microsecond-clock value
and hardware RNG values from CPUID rather than needing a syscall (say,
don't want to burn 20us to check the microsecond counter, ...).

If the "memcpy's" could be eliminated, this could roughly halve the
cost of doing a syscall.

I have MM (memory move) as a 3-operand instruction.

None in my case...

But, a memcpy loop can move ~ 280-300 MB/s in the L1 cache at 50MHz.
Still might be better to not do a memcpy in these cases.

Say, if the ISR handler could "merely" reassign the TBR register to
switch from one task to another to perform the context switch (still
ignoring all the loads/stores hidden in the prolog and epilog).

One other option would be to do like RISC-V's privileged spec and have
multiple copies of the register file (and likely instructions for
accessing these alternate register files).

There is one CPU register file, and every running thread has an address
where that file comes from and goes to--just like a block of 4 cache lines; There is a 5th cache line that contains all the other PSW stuff.

No direct equivalent.

I was thinking sort of like the RISC-V Privileged spec, there are User/Supervisor/Machine sets, with the mode effecting which of these is visible.

Obvious drawback in my case is that this would effectively increase the
number of internal GPRs from 64 to 192 (and, at that point, may as well
go to 4 copies and have 256).

If this were handled in the decoder, this would mean roughly a 9-bit
register selector field (vs the current 7 bits).

The increase in the number of CRs could be less, since only a few of
them actually need duplication.

But, don't want to go this way, and it would only be a partial solution
that also does not map up well to my current implementation.

Not sure how an OS on SH-4 would have managed all this, but I suspect
their interrupt model would have had similar limitations to mine.

Major differences:
SH-4 banked out R0..R7 when entering an interrupt;
The VBR relative entry-point offsets were a bit, ad-hoc.

There were some fairly arbitrary displacements based on the type of
interrupt. Almost like they designed their interrupt mechanism around a particular chunk of ASM code or something. In my case, I kept a similar
idea, but just used a fixed 8-byte spacing, with the idea of these spots branching to the actual entry point.

Though, one other difference is in my case I ended up adding a dedicated SYSCALL handler; on SH-4 they had used a TRAP instruction, which would
have gone to the FAULT handler instead.

It is in-theory possible to jump from Interrupt Mode to normal
Supervisor Mode without a full context switch, but the specifics of
doing so would get a bit more hairy and arcane (which is sort of why I
just sorta ended up using a context switch).

Not sure what Linux on SH-4 had done, didn't really investigate this
part of the code all that much at the time.

In theory, the ISR handlers could be made to mimic the x86 TSS
mechanism, but this wouldn't gain much.

I think at one point, I had considered having tasks have both User and Supervisor state (with two stacks and two copies of all the registers),
but ended up not going this way (and instead giving the syscalls their designated own task context; which also saves on per-task memory overhead).

Worth the cost? Dunno.

In my opinion--Absolutely worth it.

Not too much different to modern Windows, where slow syscalls are
still fairly common (and despite the slowness of the mechanism, it
seems like BJX2 sycalls still manage to be around an order of
magnitude faster than Windows syscalls in terms of clock-cycle cost...).

Now, just get it down to a cache missing {L1, L2} instruction fetch.

Looked into it a little more, realized that "an order of magnitude" may
have actually been a little conservative; seems like Windows syscalls
may be more in the area of 50-100k cycles.

Why exactly? Dunno.

This is still ignoring some of the "slow cases" which may take millions
of clock cycles.

It also seems like fast-ish syscalls may be more of a Linux thing.

Why not just treat the RF as a cache with a known address in physical
memory.
In MY 66000 that is what I do and then just push and pull 4 cache
lines at a
time.

Possible, but poses its own share of problems...

Not sure how this could be implemented cost-effectively, or for that
matter, more cheaply than a RISC-V style mode-banked register-file.

1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead
of having 4 cache lines of state and 1 doubleword of address, you need
16 cache lines of state.

OK.

Having only 1 set of registers is good...

Issue is the mechanism for how to get all the contents in/out of the
register file, in a way that is both cost effective, and faster than
using a series of Load/Store instructions would have otherwise been.

Short of a pipeline redesign, it is unlikely to exceed a best case of
around 128 bits per clock cycle, with (in practice) there typically
being other penalties due to things like L1 misses and similar.

One bit of trickery would be, "what if" the Boot SRAM region were inside
the L1 cache rather than out on the ringbus?...

But, then one would have the cost of keeping 8K of SRAM close to the CPU
core that is mostly only ever used during interrupt handling (but,
probably still cheaper than making the register file 3x bigger, in any case...).

Though keeping it tied to a specific CPU core (and effectively processor
local) would avoid the ugly "what if" scenario of two CPU cores trying
to service an interrupt at the same time and potentially stepping on
each others' stacks. The main tradeoff vs putting the stacks in DRAM is
mostly that DRAM may have (comparably more expensive) L2 misses.

Would add a potential "wonk" factor though, if this SRAM region were
only visible for D$ access, but inaccessible from the I$. But, I guess
one can argue, there isn't really a valid reason to try to run code from
the ISR stack or similar.

Though, could make sense if one has a mechanism where a context switch
could have a mechanism to dump the whole register file to Block-RAM,
and some sort of mechanism to access this RAM via an MMIO interface.

Just put it in DRAM at SW controlled (via TLB) addresses.

Possibly.

It is also possible that some of the TBR / "struct TKPE_TaskInfo_s"
stuff could be baked into hardware... But, I don't want to go this route (baking parts of it into the C ABI is at least "slightly" less evil).

Also possible could be to add another CR for "Dump context registers
here", this adds the costs of another CR though.

I guess I can probably safely rule out MMIO under the basis that context switching via moving registers via MMIO would be slower than the current mechanism (of using a series of Load/Store instructions).

Pros/cons, seems like each possibility would also come with drawbacks:
   As-is: Slowness due to needing to save/reload everything;
   RISC-V: Expensive regfile, only works for limited cases;
   MMIO Backed + RV-like: Faster U<->S, but slower task switching.
   RAM Backed: Cache coherence becomes a critical feature.

The RISC-V like approach makes sense if one assumes:
   There is a user process;
   There is a kernel running under it;
   We want to call from the user process into the kernel.

So if you ae running under a Real OS you don't need 2 sets of RFs in my model.

OK.

Whether or not my "OS" is "Real" is still a bit debatable.
From what I can tell, it is sort of loosely in Win 3.x territory (at best).

As-in, can have multiple tasks and task switching, memory protection is
rather lacking, and still using cooperative scheduling (preemptive has
been experimented with, but at the moment is prone to cause stuff to
explode; I will need to "sort stuff out a bit more" and add things like
mutex locks around various things before this point).

Main obvious difference is:
while(cond)
{
thrd_yield();
cond=some_check();
}
Is OK, but:
while(cond)
cond=some_check();

May potentially lock up the OS if it gets stuck in an infinite loop.

In my current "GUI experiments", its stability is an almost comedic
level of badness (to what extent things work at all).

But, then again, Win3.x in DOSBox is not exactly "rock solid" either, so
even as primitive as it is, it seems "almost within reach". Like, "It
may work, it may cause the video driver to corrupt itself (leading to a
screen of indecipherable garbage or similar), or the Windows install
might just decide to corrupt its files badly enough that one has to
reinstall it to make it work again, ...".

Though, ironically, I am still left making some uses of 16 color BMP
images and CRAM and similar. Though, slightly atypical, in that I am
using CRAM as a still image format, and hacked things so that both
formats can support transparency.

Say: 16-color BMP: The "High Intensity Magenta" color can be used as a transparent color if needed. For 8-bit CRAM, a 256-color palette is
used, with one of the colors (0x80 in this case) being used as a
transparent color.

Note that "actual Windows" can't load these CRAM BMP's (but, also can't
load a few of the "should work" formats either; like 2-bpp images or the
older BITMAPCOREHEADER format).

Then again, one could argue, maybe it doesn't make much sense for modern programs to be able to load formats that haven't seen much use since the
days of CGA and Windows 1.x ?...

Doesn't make so much sense, say, for:
   User Process A calls a VTable entry which calls into User Process B;
   Service A uses a VTable to call into the VFS;
   ...

Say, where one is making use of horizontal context switches for
control flow between logical tasks. Which would still remain fairly
expensive under a RISC-V like model.

Yes, but PTHREADing can be done without privilege and in a single instruction.

OK.

Luckily, a thread-switch only needs to go 1-way, reducing it to around
500 cycles as-is in my case.

Theoretical minimum would be around 150-200 cycles, with most of the
savings based on eliminating around 1.5kB worth of "memcpy()"...

This need not involve an ISA change, could in theory be done by making
the SYSCALL ISR mandate that TBR be valid (and the associated compiler
changes, likely the main issue here).

Well, nevermind any cost of locating the next thread, but at the moment,
I am using a fairly simplistic round-robin scheduling strategy, so the scheduler mostly starts at a given PID, and looks for the next PID that
holds a valid/running task (wrapping back to PID 1 if it hits the end,
and stopping the search if it gets back to the original PID).

The high-level threading model wasn't based on pthreads in my case, but
rather C11 threads (and had implemented a lot of the "threads.h" stuff).

One could potentially mimic pthreads on top of C11 threads though.

At the moment, I forgot why I decided to go with C11 threads over
pthreads, but IIRC I think I had felt at the time like C11 threads were
a better fit.

One could have enough register banks for N logical tasks, but
supporting 4 or 8 copies of the register file is going to cost more
than 2 or 3.

Above, I was describing what the hardware was doing.

The software side is basically more like:
   Branch from VBR-table to ISR entry point;
   Get R0 and R1 saved onto the stack;

Where did you get the address of this stack ??

SP and SSP swap places on interrupt entry (currently by renumbering the registers in the instruction decoder).

SSP is initialized early on to the SRAM stack, so when an interrupt
happens, the 'SP' register automatically becomes the SRAM stack.

Essentially, both SP and SSP are SPRs, but:
SP is mapped into R15 in the GPR space;
SSP is mapped into the CR space.

So, when executing an ISR, it is effectively using SSP as its SP.

If I were eliminate this implicit register-swap mechanism, then the ISR
entry would likely need to reload a constant address each time. Though,
this change would also break binary compatibility with my existing code.

But, in theory, eliminating the register swap could allow demoting SP to
being a normal GPR.

Also, things like renumbering parts of the register space based on CPU
mode is expensive.

Though, some of my more recent design ideas would have gone over to an
ordering slightly more like RISC-V, say:
R0: ZR or PC (ALU or MEM)
R1: LR or TBR (ALU or MEM)
R2: SP
R3: GP (GBR)
R4 -R15: Scratch
R16-R31: Callee Save
R32-R47: Scratch
R48-R63: Callee Save

Would likely not adopt RISC-V's C ABI though.

Though, if one assumes R4..R63 are GPRs, this would allow both this ISA
and RISC-V to still use the same register numbering.

This is already fairly close to the register numbering scheme used in
XG2RV, though the assumption was that XG2RV would have used RV's ABI,
but this was stalled out mostly due to compiler issues (getting BGBCC to
be able to follow RISC-V's C ABI rules would be a non-trivial level of
effort; but is rendered moot if one still needs to use call thunking).

The interpretation for R0 and R1 would depend on how they are used:
ALU or similar: ZR and LR (Zero and Link Register)
Load/Store Base: PC and TBR.

Idea being that in userland, TBR effectively still exists as a Read-Only register (allowing userland to modify TBR would effectively also allow
userland to wreck the OS).

Thing is mostly that needing to renumber registers in the decoder based
on CPU mode isn't entirely free in terms of LUT cost or timing latency
(even if it only applies to a subset of the register space).

Note that for RV decoding:
X0..X31 -> R0 ..R31 (more or less)
F0..F31 -> R32..R63

But, RV's FPU instructions don't match up exactly 1:1, and some cases
would have semantic differences.

Though, it seems like most RV code could likely tolerate some deviation
in some areas (will it care that the high 32 bits of a Binary32 register
don't hold NaN? Will it care about the extra funkiness going on in LR? ...).

   Get some of the CRs saved off (we need R0 and R1 free here);
   Get the rest of the GPRs saved onto the stack;
   Call into the main part of the ISR handler (using normal C ABI);
   Restore most of the GPRs;
   Restore most of the CRs;
   Restore R0 and R1;
   Do an RTE.

If HW does register file save/restore the above looks like::

The software side is basically more like:
   Branch from VBR-table to ISR entry point;
   Call into the main part of the ISR handler (using normal C ABI);
   Do an RTE.

See what it saves ??

This is fewer instructions.

But, hardware cost, and clock-cycle savings?...

As-is, I can't come up with much that is both:
Fairly cheap to implement in hardware;
Would saves a lot of clock-cycles over software-based options.

As noted, the former is also why I had thus far mostly rejected the
RISC-V strategy (*).

*: Ironically, despite RISC-V having fewer GPRs, to implement the
Privileged spec, RISC-V would still end up needing a somewhat bigger
register file... Nevermind what exactly is going on with CSRs...

Say:
BJX2: 64 GPRs, ~ 14 CRs in use.
Some of the CRs defined (like the SMT set) don't currently exist.
TEAH is specific to Addr96 mode;
VIPT doesn't currently exist
Will only exist if/when inverted page tables are added.
STTB exists but isn't currently being used
Was intended for supervisor-mode page tables;
But, N/A if Supervisor Mode is reached via a task switch...

RISC-V: 3x ( 32 GPRs + 32 FPRs), 3x a bunch of CSRs.
So, theoretically, 192 registers, plus a bunch more CSRs.
Nevermind that the 'V' extension would add more registers.
Would we also need 3 copies of all the Vector registers, ... ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Thu Nov 23 23:30:50 2023

BGB wrote:

On 11/23/2023 10:53 AM, MitchAlsup wrote:

BGB wrote:

If the "memcpy's" could be eliminated, this could roughly halve the
cost of doing a syscall.

I have MM (memory move) as a 3-operand instruction.

None in my case...

But, a memcpy loop can move ~ 280-300 MB/s in the L1 cache at 50MHz.
Still might be better to not do a memcpy in these cases.

Say, if the ISR handler could "merely" reassign the TBR register to
switch from one task to another to perform the context switch (still
ignoring all the loads/stores hidden in the prolog and epilog).

One other option would be to do like RISC-V's privileged spec and have
multiple copies of the register file (and likely instructions for
accessing these alternate register files).

There is one CPU register file, and every running thread has an address
where that file comes from and goes to--just like a block of 4 cache lines; >> There is a 5th cache line that contains all the other PSW stuff.

No direct equivalent.

I was thinking sort of like the RISC-V Privileged spec, there are User/Supervisor/Machine sets, with the mode effecting which of these is visible.

Obvious drawback in my case is that this would effectively increase the number of internal GPRs from 64 to 192 (and, at that point, may as well
go to 4 copies and have 256).

If this were handled in the decoder, this would mean roughly a 9-bit
register selector field (vs the current 7 bits).

Decode is not the problem, sensing 1:256 is a big problem, in practice
even SRAMs only have 32-pairs of cells on a bit line using exotic timed
sense amps.
{{Decode is almost NEVER the logic delay problem:: ½ is situation recognition, the other ½ is fan-out buffering--driving the lines into the decoder is more gates of delay than determining if a given select line should be asserted.}}

The increase in the number of CRs could be less, since only a few of
them actually need duplication.

But, don't want to go this way, and it would only be a partial solution
that also does not map up well to my current implementation.

Not sure how an OS on SH-4 would have managed all this, but I suspect
their interrupt model would have had similar limitations to mine.

Major differences:
SH-4 banked out R0..R7 when entering an interrupt;
The VBR relative entry-point offsets were a bit, ad-hoc.

There were some fairly arbitrary displacements based on the type of interrupt. Almost like they designed their interrupt mechanism around a particular chunk of ASM code or something. In my case, I kept a similar
idea, but just used a fixed 8-byte spacing, with the idea of these spots branching to the actual entry point.

Though, one other difference is in my case I ended up adding a dedicated SYSCALL handler; on SH-4 they had used a TRAP instruction, which would
have gone to the FAULT handler instead.

It is in-theory possible to jump from Interrupt Mode to normal
Supervisor Mode without a full context switch,

but why ?? the probability that control returns from a given IST to its
softIRQ is less than ½ in a loaded system.

but the specifics of
doing so would get a bit more hairy and arcane (which is sort of why I
just sorta ended up using a context switch).

Not sure what Linux on SH-4 had done, didn't really investigate this
part of the code all that much at the time.

In theory, the ISR handlers could be made to mimic the x86 TSS
mechanism, but this wouldn't gain much.

Stay away from anything you see in x86 except in using it a moniker
to avoid.

I think at one point, I had considered having tasks have both User and Supervisor state (with two stacks and two copies of all the registers),
but ended up not going this way (and instead giving the syscalls their designated own task context; which also saves on per-task memory overhead).

Worth the cost? Dunno.

In my opinion--Absolutely worth it.

Not too much different to modern Windows, where slow syscalls are
still fairly common (and despite the slowness of the mechanism, it
seems like BJX2 sycalls still manage to be around an order of
magnitude faster than Windows syscalls in terms of clock-cycle cost...).

Now, just get it down to a cache missing {L1, L2} instruction fetch.

Looked into it a little more, realized that "an order of magnitude" may
have actually been a little conservative; seems like Windows syscalls
may be more in the area of 50-100k cycles.

Why exactly? Dunno.

This is still ignoring some of the "slow cases" which may take millions
of clock cycles.

It also seems like fast-ish syscalls may be more of a Linux thing.

Why not just treat the RF as a cache with a known address in physical
memory.
In MY 66000 that is what I do and then just push and pull 4 cache
lines at a
time.

Possible, but poses its own share of problems...

Not sure how this could be implemented cost-effectively, or for that
matter, more cheaply than a RISC-V style mode-banked register-file.

1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead
of having 4 cache lines of state and 1 doubleword of address, you need
16 cache lines of state.

OK.

Having only 1 set of registers is good...

Issue is the mechanism for how to get all the contents in/out of the
register file, in a way that is both cost effective, and faster than
using a series of Load/Store instructions would have otherwise been.

6R6W RFs are as big as one can practically build. You can get as much
Read BW by duplication, but you only have "so much" Write BW (even when
you know each write is to a different register).

Short of a pipeline redesign, it is unlikely to exceed a best case of
around 128 bits per clock cycle, with (in practice) there typically
being other penalties due to things like L1 misses and similar.

6R ports are 6*64-bits = 384-bits out and 384-bits in per cycle.

One bit of trickery would be, "what if" the Boot SRAM region were inside
the L1 cache rather than out on the ringbus?...

2 things::
a) By giving threadstate an address you gain the ability to load the
initial RF image from ROM as the CPU comes out of reset--it comes out
with a complete RF, a complete thread.header, mapping tables, privilege
and priority.
b) Those ROM-based TLB entries map to the L1 and L2 caches in Allocate
state (no underlying DRAM address availible) so you have ~1MB to play around with until you find DRAM, configure, initialize, and put in fee-pool.)
So, here, you HAVE "enough" storage to program BOOT activities in a HLL
(of your choice).

But, then one would have the cost of keeping 8K of SRAM close to the CPU
core that is mostly only ever used during interrupt handling (but,
probably still cheaper than making the register file 3x bigger, in any case...).

Is the Icache and Dcache not close enough ?? If not then add L2 !!

Though keeping it tied to a specific CPU core (and effectively processor local) would avoid the ugly "what if" scenario of two CPU cores trying
to service an interrupt at the same time and potentially stepping on
each others' stacks. The main tradeoff vs putting the stacks in DRAM is mostly that DRAM may have (comparably more expensive) L2 misses.

The interrupt (re)mapping table takes care of this prior to the CPU being bothered. A {CPU or device} sends an interrupt to the Interrupt mapping
table associated with the "Originating" thread. (IO/-MMU). That interrupt
is logged into the table and if enabled its priority is used to determine
which set of CPUs should be bothered, the affinity mask of the "Originating" thread is used to qualify which CPU from the priority set, and one of these
is selected. The selected CPU is tapped on the shoulder, and sends a get- Interrupt request to the Interrupt table logic which sends back the priority and number of a pending interrupt. If the CPU is still at lower priority
than the returning interrupt, the CPU <at this point> stops running code
from the old thread and begins running code on the new thread.
{{During the sending of the interrupt to the CPU and the receipt of the claim-Interrupt message, that interrupt will not get handed to any other
CPU}} So, the CPU continues to run instructions while the CPUs contend
for and claim unique interrupts. There are 512 unique interrupt at each of
64 priority levels, and each process can have its own Interrupt Table.
These tables need no maintenance except when interrupts are created and destroyed.}}

HV, Guest HV, Guest OS each have their own unique interrupt tables;
Although it could be arranged such that all could use the same table.

Would add a potential "wonk" factor though, if this SRAM region were
only visible for D$ access, but inaccessible from the I$. But, I guess
one can argue, there isn't really a valid reason to try to run code from
the ISR stack or similar.

Though, could make sense if one has a mechanism where a context switch
could have a mechanism to dump the whole register file to Block-RAM,
and some sort of mechanism to access this RAM via an MMIO interface.

Just put it in DRAM at SW controlled (via TLB) addresses.

Possibly.

It is also possible that some of the TBR / "struct TKPE_TaskInfo_s"
stuff could be baked into hardware... But, I don't want to go this route (baking parts of it into the C ABI is at least "slightly" less evil).

My mechanism is taking that struct task.....s (at least the part HW
needs to understand) and associating each one into a table that points
at DRAM. Now, when you want this thread to run, you load up the pointer
set the e-bit (enabled) and write it into the current header at its
privilege level. Poof--all 5 cache lines of state from the currently
running thread goes back to where it permanent home in DRAM is, and
the new thread fetches 5 cache lines of state of the new thread.
a) you can start the reads before you start the writes
b) you can start the writes anytime you have outbound access to "the bus"
c) the writes can be no late than the ½ cycle before the reads get written. Which is a lot faster than you can do in SW with LDs and STs.

Also possible could be to add another CR for "Dump context registers
here", this adds the costs of another CR though.

I config-space mapped all my CRs, so you get an unlimited number of them.

I guess I can probably safely rule out MMIO under the basis that context switching via moving registers via MMIO would be slower than the current mechanism (of using a series of Load/Store instructions).

.................

Yes, but PTHREADing can be done without privilege and in a single
instruction.

OK.

Luckily, a thread-switch only needs to go 1-way, reducing it to around
500 cycles as-is in my case.

In my case it is about MemoryLatency+5 cycles.

Yes, thread switch is a 1-way function--which is the reason you can
allow a user to preempt himself and allow a compatriot to run in his
place.....

Theoretical minimum would be around 150-200 cycles, with most of the
savings based on eliminating around 1.5kB worth of "memcpy()"...

My Real Time version of MY 66000 does 10-ish cycle context switch
(as seen at the CPU) but here a hunk of HW has gathered up those 5 cache
lines and sent them to the targeted CPU and all the CPU has to do is push
out the old state (5-cache liens) So the data was heading towards the CPU before the CPU even knew it wanted that data !!

This need not involve an ISA change, could in theory be done by making
the SYSCALL ISR mandate that TBR be valid (and the associated compiler changes, likely the main issue here).

Well, nevermind any cost of locating the next thread, but at the moment,
I am using a fairly simplistic round-robin scheduling strategy, so the scheduler mostly starts at a given PID, and looks for the next PID that
holds a valid/running task (wrapping back to PID 1 if it hits the end,
and stopping the search if it gets back to the original PID).

The high-level threading model wasn't based on pthreads in my case, but rather C11 threads (and had implemented a lot of the "threads.h" stuff).

One could potentially mimic pthreads on top of C11 threads though.

At the moment, I forgot why I decided to go with C11 threads over
pthreads, but IIRC I think I had felt at the time like C11 threads were
a better fit.

One could have enough register banks for N logical tasks, but
supporting 4 or 8 copies of the register file is going to cost more
than 2 or 3.

Above, I was describing what the hardware was doing.

The software side is basically more like:
   Branch from VBR-table to ISR entry point;
   Get R0 and R1 saved onto the stack;

Where did you get the address of this stack ??

SP and SSP swap places on interrupt entry (currently by renumbering the registers in the instruction decoder).

So, in effect, you actually have 33 registers with only 32 visible at
any instant. I am just so glad not to have gone down that rabbet hole
this time......

SSP is initialized early on to the SRAM stack, so when an interrupt
happens, the 'SP' register automatically becomes the SRAM stack.

Essentially, both SP and SSP are SPRs, but:
SP is mapped into R15 in the GPR space;
SSP is mapped into the CR space.

So, when executing an ISR, it is effectively using SSP as its SP.

If I were eliminate this implicit register-swap mechanism, then the ISR
entry would likely need to reload a constant address each time. Though,
this change would also break binary compatibility with my existing code.

But, in theory, eliminating the register swap could allow demoting SP to being a normal GPR.

Also, things like renumbering parts of the register space based on CPU
mode is expensive.

Though, some of my more recent design ideas would have gone over to an ordering slightly more like RISC-V, say:
R0: ZR or PC (ALU or MEM)
R1: LR or TBR (ALU or MEM)
R2: SP
R3: GP (GBR)
R4 -R15: Scratch
R16-R31: Callee Save
R32-R47: Scratch
R48-R63: Callee Save

Would likely not adopt RISC-V's C ABI though.

R0:: GPR, Return Address, proxy for IP, proxy for 0
R1..R9 Arguments and results passed in registers
R10..R15 Temporary Registers (scratch)
R16..R29 Callee Save
R30 FP when in use, Callee Save
R31 SP

Though, if one assumes R4..R63 are GPRs, this would allow both this ISA
and RISC-V to still use the same register numbering.

This is already fairly close to the register numbering scheme used in
XG2RV, though the assumption was that XG2RV would have used RV's ABI,
but this was stalled out mostly due to compiler issues (getting BGBCC to
be able to follow RISC-V's C ABI rules would be a non-trivial level of effort; but is rendered moot if one still needs to use call thunking).

The interpretation for R0 and R1 would depend on how they are used:
ALU or similar: ZR and LR (Zero and Link Register)
Load/Store Base: PC and TBR.

Idea being that in userland, TBR effectively still exists as a Read-Only register (allowing userland to modify TBR would effectively also allow userland to wreck the OS).

Thing is mostly that needing to renumber registers in the decoder based
on CPU mode isn't entirely free in terms of LUT cost or timing latency
(even if it only applies to a subset of the register space).

Note that for RV decoding:
X0..X31 -> R0 ..R31 (more or less)
F0..F31 -> R32..R63

But, RV's FPU instructions don't match up exactly 1:1, and some cases
would have semantic differences.

Though, it seems like most RV code could likely tolerate some deviation
in some areas (will it care that the high 32 bits of a Binary32 register don't hold NaN? Will it care about the extra funkiness going on in LR? ...).

   Get some of the CRs saved off (we need R0 and R1 free here);
   Get the rest of the GPRs saved onto the stack;
   Call into the main part of the ISR handler (using normal C ABI);
   Restore most of the GPRs;
   Restore most of the CRs;
   Restore R0 and R1;
   Do an RTE.

If HW does register file save/restore the above looks like::

The software side is basically more like:
   Branch from VBR-table to ISR entry point;
   Call into the main part of the ISR handler (using normal C ABI);
   Do an RTE.

See what it saves ??

This is fewer instructions.

But, hardware cost,

the HW cost has already been purchased by the state machine that
writes out 5-cache lines and waits for 5-cache lines to arrive.

and clock-cycle savings?...
The reads can arrive before you start the writes, you can go so far
as to organize your pipeline so the read data being written pushes
out the write data that needs to return to memory-making the timing
brain dead easy to achieve.

As-is, I can't come up with much that is both:
Fairly cheap to implement in hardware;
Would saves a lot of clock-cycles over software-based options.

As noted, the former is also why I had thus far mostly rejected the
RISC-V strategy (*).

Yet, you seem to be buying insurance as if you might need to head in that direction.

*: Ironically, despite RISC-V having fewer GPRs, to implement the
Privileged spec, RISC-V would still end up needing a somewhat bigger
register file... Nevermind what exactly is going on with CSRs...

Whereas that special State is only a dozen register <with state>
in My 66000--the rest being either memory resident or memory mapped.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Finch@21:1/5 to MitchAlsup on Thu Nov 23 21:36:41 2023

On 2023-11-23 6:30 p.m., MitchAlsup wrote:

BGB wrote:

On 11/23/2023 10:53 AM, MitchAlsup wrote:

BGB wrote:

If the "memcpy's" could be eliminated, this could roughly halve the
cost of doing a syscall.

I have MM (memory move) as a 3-operand instruction.

None in my case...

But, a memcpy loop can move ~ 280-300 MB/s in the L1 cache at 50MHz.
Still might be better to not do a memcpy in these cases.

Say, if the ISR handler could "merely" reassign the TBR register to
switch from one task to another to perform the context switch (still
ignoring all the loads/stores hidden in the prolog and epilog).

One other option would be to do like RISC-V's privileged spec and
have multiple copies of the register file (and likely instructions
for accessing these alternate register files).

There is one CPU register file, and every running thread has an address
where that file comes from and goes to--just like a block of 4 cache
lines;
There is a 5th cache line that contains all the other PSW stuff.

No direct equivalent.

I was thinking sort of like the RISC-V Privileged spec, there are
User/Supervisor/Machine sets, with the mode effecting which of these
is visible.

Obvious drawback in my case is that this would effectively increase
the number of internal GPRs from 64 to 192 (and, at that point, may as
well go to 4 copies and have 256).

If this were handled in the decoder, this would mean roughly a 9-bit
register selector field (vs the current 7 bits).

Decode is not the problem, sensing 1:256 is a big problem, in practice
even SRAMs only have 32-pairs of cells on a bit line using exotic timed
sense amps.
{{Decode is almost NEVER the logic delay problem:: ½ is situation recognition,
the other ½ is fan-out buffering--driving the lines into the decoder is
more
gates of delay than determining if a given select line should be
asserted.}}

The increase in the number of CRs could be less, since only a few of
them actually need duplication.

But, don't want to go this way, and it would only be a partial
solution that also does not map up well to my current implementation.

Not sure how an OS on SH-4 would have managed all this, but I suspect
their interrupt model would have had similar limitations to mine.

Major differences:
   SH-4 banked out R0..R7 when entering an interrupt;
   The VBR relative entry-point offsets were a bit, ad-hoc.

There were some fairly arbitrary displacements based on the type of
interrupt. Almost like they designed their interrupt mechanism around
a particular chunk of ASM code or something. In my case, I kept a
similar idea, but just used a fixed 8-byte spacing, with the idea of
these spots branching to the actual entry point.

Though, one other difference is in my case I ended up adding a
dedicated SYSCALL handler; on SH-4 they had used a TRAP instruction,
which would have gone to the FAULT handler instead.

It is in-theory possible to jump from Interrupt Mode to normal
Supervisor Mode without a full context switch,

but why ?? the probability that control returns from a given IST to its softIRQ is less than ½ in a loaded system.

                                                but the specifics of
doing so would get a bit more hairy and arcane (which is sort of why I
just sorta ended up using a context switch).

Not sure what Linux on SH-4 had done, didn't really investigate this
part of the code all that much at the time.

In theory, the ISR handlers could be made to mimic the x86 TSS
mechanism, but this wouldn't gain much.

Stay away from anything you see in x86 except in using it a moniker
to avoid.

I think at one point, I had considered having tasks have both User and
Supervisor state (with two stacks and two copies of all the
registers), but ended up not going this way (and instead giving the
syscalls their designated own task context; which also saves on
per-task memory overhead).

Worth the cost? Dunno.

In my opinion--Absolutely worth it.

Not too much different to modern Windows, where slow syscalls are
still fairly common (and despite the slowness of the mechanism, it
seems like BJX2 sycalls still manage to be around an order of
magnitude faster than Windows syscalls in terms of clock-cycle
cost...).

Now, just get it down to a cache missing {L1, L2} instruction fetch.

Looked into it a little more, realized that "an order of magnitude"
may have actually been a little conservative; seems like Windows
syscalls may be more in the area of 50-100k cycles.

Why exactly? Dunno.

This is still ignoring some of the "slow cases" which may take
millions of clock cycles.

It also seems like fast-ish syscalls may be more of a Linux thing.

Why not just treat the RF as a cache with a known address in
physical memory.
In MY 66000 that is what I do and then just push and pull 4 cache
lines at a
time.

Possible, but poses its own share of problems...

Not sure how this could be implemented cost-effectively, or for that
matter, more cheaply than a RISC-V style mode-banked register-file.

1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead >>> of having 4 cache lines of state and 1 doubleword of address, you need
16 cache lines of state.

OK.

Having only 1 set of registers is good...

Issue is the mechanism for how to get all the contents in/out of the
register file, in a way that is both cost effective, and faster than
using a series of Load/Store instructions would have otherwise been.

6R6W RFs are as big as one can practically build. You can get as much
Read BW by duplication, but you only have "so much" Write BW (even when
you know each write is to a different register).

Short of a pipeline redesign, it is unlikely to exceed a best case of
around 128 bits per clock cycle, with (in practice) there typically
being other penalties due to things like L1 misses and similar.

6R ports are 6*64-bits = 384-bits out and 384-bits in per cycle.

One bit of trickery would be, "what if" the Boot SRAM region were
inside the L1 cache rather than out on the ringbus?...

2 things::
a) By giving threadstate an address you gain the ability to load the
initial RF image from ROM as the CPU comes out of reset--it comes out
with a complete RF, a complete thread.header, mapping tables, privilege
and priority.
b) Those ROM-based TLB entries map to the L1 and L2 caches in Allocate
state (no underlying DRAM address availible) so you have ~1MB to play
around
with until you find DRAM, configure, initialize, and put in fee-pool.)
So, here, you HAVE "enough" storage to program BOOT activities in a HLL
(of your choice).

But, then one would have the cost of keeping 8K of SRAM close to the
CPU core that is mostly only ever used during interrupt handling (but,
probably still cheaper than making the register file 3x bigger, in any
case...).

Is the Icache and Dcache not close enough ?? If not then add L2 !!

Though keeping it tied to a specific CPU core (and effectively
processor local) would avoid the ugly "what if" scenario of two CPU
cores trying to service an interrupt at the same time and potentially
stepping on each others' stacks. The main tradeoff vs putting the
stacks in DRAM is mostly that DRAM may have (comparably more
expensive) L2 misses.

The interrupt (re)mapping table takes care of this prior to the CPU being bothered. A {CPU or device} sends an interrupt to the Interrupt mapping
table associated with the "Originating" thread. (IO/-MMU). That interrupt
is logged into the table and if enabled its priority is used to determine which set of CPUs should be bothered, the affinity mask of the
"Originating"
thread is used to qualify which CPU from the priority set, and one of these is selected. The selected CPU is tapped on the shoulder, and sends a get- Interrupt request to the Interrupt table logic which sends back the
priority
and number of a pending interrupt. If the CPU is still at lower priority
than the returning interrupt, the CPU <at this point> stops running code
from the old thread and begins running code on the new thread.
{{During the sending of the interrupt to the CPU and the receipt of the claim-Interrupt message, that interrupt will not get handed to any other CPU}} So, the CPU continues to run instructions while the CPUs contend
for and claim unique interrupts. There are 512 unique interrupt at each of
64 priority levels, and each process can have its own Interrupt Table.
These tables need no maintenance except when interrupts are created and destroyed.}}

HV, Guest HV, Guest OS each have their own unique interrupt tables;
Although it could be arranged such that all could use the same table.

Would add a potential "wonk" factor though, if this SRAM region were
only visible for D$ access, but inaccessible from the I$. But, I guess
one can argue, there isn't really a valid reason to try to run code
from the ISR stack or similar.

Though, could make sense if one has a mechanism where a context
switch could have a mechanism to dump the whole register file to
Block-RAM, and some sort of mechanism to access this RAM via an MMIO
interface.

Just put it in DRAM at SW controlled (via TLB) addresses.

Possibly.

It is also possible that some of the TBR / "struct TKPE_TaskInfo_s"
stuff could be baked into hardware... But, I don't want to go this
route (baking parts of it into the C ABI is at least "slightly" less
evil).

My mechanism is taking that struct task.....s (at least the part HW
needs to understand) and associating each one into a table that points
at DRAM. Now, when you want this thread to run, you load up the pointer
set the e-bit (enabled) and write it into the current header at its
privilege level. Poof--all 5 cache lines of state from the currently
running thread goes back to where it permanent home in DRAM is, and
the new thread fetches 5 cache lines of state of the new thread.
a) you can start the reads before you start the writes
b) you can start the writes anytime you have outbound access to "the bus"
c) the writes can be no late than the ½ cycle before the reads get written. Which is a lot faster than you can do in SW with LDs and STs.

Also possible could be to add another CR for "Dump context registers
here", this adds the costs of another CR though.

I config-space mapped all my CRs, so you get an unlimited number of them.

I guess I can probably safely rule out MMIO under the basis that
context switching via moving registers via MMIO would be slower than
the current mechanism (of using a series of Load/Store instructions).

.................

Yes, but PTHREADing can be done without privilege and in a single
instruction.

OK.

Luckily, a thread-switch only needs to go 1-way, reducing it to around
500 cycles as-is in my case.

In my case it is about MemoryLatency+5 cycles.
Yes, thread switch is a 1-way function--which is the reason you can
allow a user to preempt himself and allow a compatriot to run in his place.....

Theoretical minimum would be around 150-200 cycles, with most of the
savings based on eliminating around 1.5kB worth of "memcpy()"...

My Real Time version of MY 66000 does 10-ish cycle context switch
(as seen at the CPU) but here a hunk of HW has gathered up those 5 cache lines and sent them to the targeted CPU and all the CPU has to do is push
out the old state (5-cache liens) So the data was heading towards the
CPU before the CPU even knew it wanted that data !!

This need not involve an ISA change, could in theory be done by making
the SYSCALL ISR mandate that TBR be valid (and the associated compiler
changes, likely the main issue here).

Well, nevermind any cost of locating the next thread, but at the
moment, I am using a fairly simplistic round-robin scheduling
strategy, so the scheduler mostly starts at a given PID, and looks for
the next PID that holds a valid/running task (wrapping back to PID 1
if it hits the end, and stopping the search if it gets back to the
original PID).

The high-level threading model wasn't based on pthreads in my case,
but rather C11 threads (and had implemented a lot of the "threads.h"
stuff).

One could potentially mimic pthreads on top of C11 threads though.

At the moment, I forgot why I decided to go with C11 threads over
pthreads, but IIRC I think I had felt at the time like C11 threads
were a better fit.

One could have enough register banks for N logical tasks, but
supporting 4 or 8 copies of the register file is going to cost more
than 2 or 3.

Above, I was describing what the hardware was doing.

The software side is basically more like:
   Branch from VBR-table to ISR entry point;
   Get R0 and R1 saved onto the stack;

Where did you get the address of this stack ??

SP and SSP swap places on interrupt entry (currently by renumbering
the registers in the instruction decoder).

So, in effect, you actually have 33 registers with only 32 visible at
any instant. I am just so glad not to have gone down that rabbet hole
this time......

SSP is initialized early on to the SRAM stack, so when an interrupt
happens, the 'SP' register automatically becomes the SRAM stack.

Essentially, both SP and SSP are SPRs, but:
   SP is mapped into R15 in the GPR space;
   SSP is mapped into the CR space.

So, when executing an ISR, it is effectively using SSP as its SP.

If I were eliminate this implicit register-swap mechanism, then the
ISR entry would likely need to reload a constant address each time.
Though, this change would also break binary compatibility with my
existing code.

But, in theory, eliminating the register swap could allow demoting SP
to being a normal GPR.

Also, things like renumbering parts of the register space based on CPU
mode is expensive.

Though, some of my more recent design ideas would have gone over to an
ordering slightly more like RISC-V, say:
   R0: ZR or PC (ALU or MEM)
   R1: LR or TBR (ALU or MEM)
   R2: SP
   R3: GP (GBR)
   R4 -R15: Scratch
   R16-R31: Callee Save
   R32-R47: Scratch
   R48-R63: Callee Save

Would likely not adopt RISC-V's C ABI though.

R0::     GPR, Return Address, proxy for IP, proxy for 0
R1..R9   Arguments and results passed in registers
R10..R15 Temporary Registers (scratch)
R16..R29 Callee Save
R30      FP when in use, Callee Save
R31      SP

Though, if one assumes R4..R63 are GPRs, this would allow both this
ISA and RISC-V to still use the same register numbering.

This is already fairly close to the register numbering scheme used in
XG2RV, though the assumption was that XG2RV would have used RV's ABI,
but this was stalled out mostly due to compiler issues (getting BGBCC
to be able to follow RISC-V's C ABI rules would be a non-trivial level
of effort; but is rendered moot if one still needs to use call thunking).

The interpretation for R0 and R1 would depend on how they are used:
   ALU or similar: ZR and LR (Zero and Link Register)
   Load/Store Base: PC and TBR.

Idea being that in userland, TBR effectively still exists as a
Read-Only register (allowing userland to modify TBR would effectively
also allow userland to wreck the OS).

Thing is mostly that needing to renumber registers in the decoder
based on CPU mode isn't entirely free in terms of LUT cost or timing
latency (even if it only applies to a subset of the register space).

Note that for RV decoding:
   X0..X31 -> R0 ..R31 (more or less)
   F0..F31 -> R32..R63

But, RV's FPU instructions don't match up exactly 1:1, and some cases
would have semantic differences.

Though, it seems like most RV code could likely tolerate some
deviation in some areas (will it care that the high 32 bits of a
Binary32 register don't hold NaN? Will it care about the extra
funkiness going on in LR? ...).

   Get some of the CRs saved off (we need R0 and R1 free here);
   Get the rest of the GPRs saved onto the stack;
   Call into the main part of the ISR handler (using normal C ABI);
   Restore most of the GPRs;
   Restore most of the CRs;
   Restore R0 and R1;
   Do an RTE.

If HW does register file save/restore the above looks like::

The software side is basically more like:
   Branch from VBR-table to ISR entry point;
   Call into the main part of the ISR handler (using normal C ABI);
   Do an RTE.

See what it saves ??

This is fewer instructions.

But, hardware cost,

the HW cost has already been purchased by the state machine that writes
out 5-cache lines and waits for 5-cache lines to arrive.

and clock-cycle savings?...
The reads can arrive before you start the writes, you can go so far as
to organize your pipeline so the read data being written pushes
out the write data that needs to return to memory-making the timing
brain dead easy to achieve.

As-is, I can't come up with much that is both:
   Fairly cheap to implement in hardware;
   Would saves a lot of clock-cycles over software-based options.

As noted, the former is also why I had thus far mostly rejected the
RISC-V strategy (*).

Yet, you seem to be buying insurance as if you might need to head in that direction.

*: Ironically, despite RISC-V having fewer GPRs, to implement the
Privileged spec, RISC-V would still end up needing a somewhat bigger
register file... Nevermind what exactly is going on with CSRs...

Whereas that special State is only a dozen register <with state>
in My 66000--the rest being either memory resident or memory mapped.

My 68000 CPU core had a couple of task switching instructions added to
it. I made a dedicated task switch RAM wide enough to load or store all
the 68k registers in a single clock. Total task switch time was about
four clocks IIRC. The interrupt vector table was setup to be able to automatically task switch on interrupt. The RAM had storage for up to
512 tasks, but it was dedicated inside the CPU core rather than storing
task information in the memory system.

Q+ has a 64 register file, so it would take eight or nine cache lines to
store the context. Q+ register file is 4w18r ATM. Getting from the
register file to or from a cache line is a challenge. To access groups
of eight registers at once would mean adding or using eight register
file ports. The register file has only four write ports so only ½ of a
cache line could be written to the file in a clock cycle. It is
appealing to handle multiple registers per clock. Read/write ports are dedicated to specific function units, so making use of them for task
switching may involve additional logic. I called the CSR to store the
task state address the TS CSR.

As I understand it normally RISCV does not use multiple register files,
it has only a single file. There may be implementations out there that
do make use of multiple files, but I think the standard is setup to get
by with a single file.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Fri Nov 24 03:11:17 2023

Robert Finch wrote:

On 2023-11-23 6:30 p.m., MitchAlsup wrote:

BGB wrote:

Whereas that special State is only a dozen register <with state>
in My 66000--the rest being either memory resident or memory mapped.

My 68000 CPU core had a couple of task switching instructions added to
it. I made a dedicated task switch RAM wide enough to load or store all
the 68k registers in a single clock. Total task switch time was about
four clocks IIRC. The interrupt vector table was setup to be able to automatically task switch on interrupt. The RAM had storage for up to
512 tasks, but it was dedicated inside the CPU core rather than storing
task information in the memory system.

This is headed in the right direction. Make context switching something
easy to pull off.

Q+ has a 64 register file, so it would take eight or nine cache lines to store the context. Q+ register file is 4w18r ATM. Getting from the
register file to or from a cache line is a challenge. To access groups
of eight registers at once would mean adding or using eight register
file ports. The register file has only four write ports so only ½ of a
cache line could be written to the file in a clock cycle. It is
appealing to handle multiple registers per clock. Read/write ports are dedicated to specific function units, so making use of them for task switching may involve additional logic. I called the CSR to store the
task state address the TS CSR.

4W generally ends up with 4R and replications lead to 8R 12R 16R and 20R.
Yet you chose 18. Why ?

This is above and beyond the "typical" operand consumption of a RISC ISA.
Your typical 4-wide RISC ISA would have 8R (6-wide is better balanced at
12R allowing 1 FU to consume 3-registers and 1 FU having only 1-operand
(or forwarding). What are you using the other 5-operands for ??

As I understand it normally RISCV does not use multiple register files,

RISC-V has a 32 entry GPR and a 32 entry FPR.

it has only a single file. There may be implementations out there that
do make use of multiple files, but I think the standard is setup to get
by with a single file.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Finch@21:1/5 to MitchAlsup on Thu Nov 23 23:37:54 2023

On 2023-11-23 10:11 p.m., MitchAlsup wrote:

Robert Finch wrote:

On 2023-11-23 6:30 p.m., MitchAlsup wrote:

BGB wrote:

Whereas that special State is only a dozen register <with state>
in My 66000--the rest being either memory resident or memory mapped.

My 68000 CPU core had a couple of task switching instructions added to
it. I made a dedicated task switch RAM wide enough to load or store
all the 68k registers in a single clock. Total task switch time was
about four clocks IIRC. The interrupt vector table was setup to be
able to automatically task switch on interrupt. The RAM had storage
for up to 512 tasks, but it was dedicated inside the CPU core rather
than storing task information in the memory system.

This is headed in the right direction. Make context switching something
easy to pull off.

Q+ has a 64 register file, so it would take eight or nine cache lines
to store the context. Q+ register file is 4w18r ATM. Getting from the
register file to or from a cache line is a challenge. To access groups
of eight registers at once would mean adding or using eight register
file ports. The register file has only four write ports so only ½ of a
cache line could be written to the file in a clock cycle. It is
appealing to handle multiple registers per clock. Read/write ports are
dedicated to specific function units, so making use of them for task
switching may involve additional logic. I called the CSR to store the
task state address the TS CSR.

4W generally ends up with 4R and replications lead to 8R 12R 16R and 20R.
Yet you chose 18. Why ?
This is above and beyond the "typical" operand consumption of a RISC ISA. Your typical 4-wide RISC ISA would have 8R (6-wide is better balanced at
12R allowing 1 FU to consume 3-registers and 1 FU having only 1-operand
(or forwarding). What are you using the other 5-operands for ??

As I understand it normally RISCV does not use multiple register files,

RISC-V has a 32 entry GPR and a 32 entry FPR.

it has only a single file. There may be implementations out there that
do make use of multiple files, but I think the standard is setup to
get by with a single file.

I have 4w1r replicated 18 times. That is enough read ports to supply
three operands each to six functional units. All six functional units
may be scheduled at the same time. I have thought of trying to use fewer
read ports by prioritizing the ports as it is unlikely that all ports
would be needed at the same time. The current design is simple, but not resource efficient. Six function units are ALU0, ALU1, FPU, FCU, LOAD,
STORE. The FCU really only needs two source operands.

There is no forwarding in the design (yet). I have read this cost about
10% in performance. I think this may be made up for by a smaller design
that can operate at a higher fmax. I have found in the past that
forwarding muxes appear on the critical timing path. I have seen another
design eliminating forwarding. It made the difference between operating
at 50 MHz or 60 MHz+. 20% gain in fmax. I think this may be an aspect of
an FPGA implementation.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to MitchAlsup on Fri Nov 24 00:44:04 2023

On 11/23/2023 5:30 PM, MitchAlsup wrote:

BGB wrote:

On 11/23/2023 10:53 AM, MitchAlsup wrote:

BGB wrote:

If the "memcpy's" could be eliminated, this could roughly halve the
cost of doing a syscall.

I have MM (memory move) as a 3-operand instruction.

None in my case...

But, a memcpy loop can move ~ 280-300 MB/s in the L1 cache at 50MHz.
Still might be better to not do a memcpy in these cases.

Say, if the ISR handler could "merely" reassign the TBR register to
switch from one task to another to perform the context switch (still
ignoring all the loads/stores hidden in the prolog and epilog).

One other option would be to do like RISC-V's privileged spec and
have multiple copies of the register file (and likely instructions
for accessing these alternate register files).

There is one CPU register file, and every running thread has an address
where that file comes from and goes to--just like a block of 4 cache
lines;
There is a 5th cache line that contains all the other PSW stuff.

No direct equivalent.

I was thinking sort of like the RISC-V Privileged spec, there are
User/Supervisor/Machine sets, with the mode effecting which of these
is visible.

Obvious drawback in my case is that this would effectively increase
the number of internal GPRs from 64 to 192 (and, at that point, may as
well go to 4 copies and have 256).

If this were handled in the decoder, this would mean roughly a 9-bit
register selector field (vs the current 7 bits).

Decode is not the problem, sensing 1:256 is a big problem, in practice
even SRAMs only have 32-pairs of cells on a bit line using exotic timed
sense amps.
{{Decode is almost NEVER the logic delay problem:: ½ is situation recognition,
the other ½ is fan-out buffering--driving the lines into the decoder is
more
gates of delay than determining if a given select line should be
asserted.}}

I had noted that there is a noticeable LUT cost difference between 32
and 64 GPRs, which seems to go somewhat bigger than the difference
expected from going from 5b/3b LUTRAMs to 6b/2b LUTRAMs.

Like, adding a bit to the internal register ID fields (6b to 7b)
propagated cost across the whole pipeline.

The alternative would be to handle the register banking in the register
file, using the CPU mode to select between the possible register banks.

However, if still using LUTRAMs, the increase in register file size
would likely increase the number of LUTs by roughly 5x.

A theoretical estimate for the core number of LUTRAMs and "array support
LUTs":
32 GPRs: 396
64 GPRs: 576
256 GPRs: 2880

This is ignoring the LUTs going into things like register forwarding, etc.

Based on past experience, I suspect the actual cost difference to be a
bit larger (given, say, the difference between a 32 GPR and 64 GPR configuration is notably larger than 180 LUTs).

The increase in the number of CRs could be less, since only a few of
them actually need duplication.

But, don't want to go this way, and it would only be a partial
solution that also does not map up well to my current implementation.

Not sure how an OS on SH-4 would have managed all this, but I suspect
their interrupt model would have had similar limitations to mine.

Major differences:
   SH-4 banked out R0..R7 when entering an interrupt;
   The VBR relative entry-point offsets were a bit, ad-hoc.

There were some fairly arbitrary displacements based on the type of
interrupt. Almost like they designed their interrupt mechanism around
a particular chunk of ASM code or something. In my case, I kept a
similar idea, but just used a fixed 8-byte spacing, with the idea of
these spots branching to the actual entry point.

Though, one other difference is in my case I ended up adding a
dedicated SYSCALL handler; on SH-4 they had used a TRAP instruction,
which would have gone to the FAULT handler instead.

It is in-theory possible to jump from Interrupt Mode to normal
Supervisor Mode without a full context switch,

but why ?? the probability that control returns from a given IST to its softIRQ is less than ½ in a loaded system.

One might want to jump to save the cost of 2 context switches, but the
hair this would involve didn't seem worth it.

It would also result in a few other issues:
System calls would not be interruptible;
System calls could not reschedule the caller.
Effectively, this would hinder things like "usleep()" or "yield()".

Seemed better to go the route I did.

                                                but the specifics of
doing so would get a bit more hairy and arcane (which is sort of why I
just sorta ended up using a context switch).

Not sure what Linux on SH-4 had done, didn't really investigate this
part of the code all that much at the time.

In theory, the ISR handlers could be made to mimic the x86 TSS
mechanism, but this wouldn't gain much.

Stay away from anything you see in x86 except in using it a moniker
to avoid.

Yeah, not really losing much by not having a TSS...
But, Intel probably thought it was a good idea...

I think at one point, I had considered having tasks have both User and
Supervisor state (with two stacks and two copies of all the
registers), but ended up not going this way (and instead giving the
syscalls their designated own task context; which also saves on
per-task memory overhead).

Worth the cost? Dunno.

In my opinion--Absolutely worth it.

Not too much different to modern Windows, where slow syscalls are
still fairly common (and despite the slowness of the mechanism, it
seems like BJX2 sycalls still manage to be around an order of
magnitude faster than Windows syscalls in terms of clock-cycle
cost...).

Now, just get it down to a cache missing {L1, L2} instruction fetch.

Looked into it a little more, realized that "an order of magnitude"
may have actually been a little conservative; seems like Windows
syscalls may be more in the area of 50-100k cycles.

Why exactly? Dunno.

This is still ignoring some of the "slow cases" which may take
millions of clock cycles.

It also seems like fast-ish syscalls may be more of a Linux thing.

Why not just treat the RF as a cache with a known address in
physical memory.
In MY 66000 that is what I do and then just push and pull 4 cache
lines at a
time.

Possible, but poses its own share of problems...

Not sure how this could be implemented cost-effectively, or for that
matter, more cheaply than a RISC-V style mode-banked register-file.

1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead >>> of having 4 cache lines of state and 1 doubleword of address, you need
16 cache lines of state.

OK.

Having only 1 set of registers is good...

Issue is the mechanism for how to get all the contents in/out of the
register file, in a way that is both cost effective, and faster than
using a series of Load/Store instructions would have otherwise been.

6R6W RFs are as big as one can practically build. You can get as much
Read BW by duplication, but you only have "so much" Write BW (even when
you know each write is to a different register).

Short of a pipeline redesign, it is unlikely to exceed a best case of
around 128 bits per clock cycle, with (in practice) there typically
being other penalties due to things like L1 misses and similar.

6R ports are 6*64-bits = 384-bits out and 384-bits in per cycle.

If I were to use all the ports currently available, this would be:
384 bits out, 192 bits in, per cycle.

MOV.X can move 128-bits per cycle.

My L1 cache can currently deal with 128 bits, but going bigger would
pose issues. Biggest I could theoretically go at present would be
256-bits with a mandatory 256-bit alignment.

Anything beyond this would likely require significantly redesigning the
L1 cache, and possibly also needing to modify the ringbus to do bigger transfers (a 256-bit front-end interface buys little if the operation is dominated by L1 misses).

And, as-is, a fair chunk of the cost is L1 misses, and bigger transfers
wont fix this.

128-bits per cycle works, I can do this from software, ...

But, yeah, I can still get a theoretical 50% reduction by
saving/restoring registers directly into the TBR register-save area,
rather than using "memcpy()" to do so...

One bit of trickery would be, "what if" the Boot SRAM region were
inside the L1 cache rather than out on the ringbus?...

2 things::
a) By giving threadstate an address you gain the ability to load the
initial RF image from ROM as the CPU comes out of reset--it comes out
with a complete RF, a complete thread.header, mapping tables, privilege
and priority.
b) Those ROM-based TLB entries map to the L1 and L2 caches in Allocate
state (no underlying DRAM address availible) so you have ~1MB to play
around
with until you find DRAM, configure, initialize, and put in fee-pool.)
So, here, you HAVE "enough" storage to program BOOT activities in a HLL
(of your choice).

My Boot ROM is already mostly written in C...

Well, apart from the sanity checks, which are written in ASM.

I did cut some corners to save space, for example, the Boot ROM's FAT
driver is Read-Only and lacks support for LFNs, ... Mostly since for
finding "bootload.sys" in the root directory, I don't need anything
beyond 8.3 filenames, ...

But, then one would have the cost of keeping 8K of SRAM close to the
CPU core that is mostly only ever used during interrupt handling (but,
probably still cheaper than making the register file 3x bigger, in any
case...).

Is the Icache and Dcache not close enough ?? If not then add L2 !!

Accessing the Boot SRAM is around the same latency as accessing the L2
cache, but has the advantage that it can never have an L2 miss.

Though keeping it tied to a specific CPU core (and effectively
processor local) would avoid the ugly "what if" scenario of two CPU
cores trying to service an interrupt at the same time and potentially
stepping on each others' stacks. The main tradeoff vs putting the
stacks in DRAM is mostly that DRAM may have (comparably more
expensive) L2 misses.

Realized a simpler solution to the above issue (without needing to significantly redesign stuff):
Make the SRAM area bigger and then subdivide it for each CPU core.

Say:
Core 1 gets a stack at 0x0000DF00, Core 2 at 0x0000FF00
Or:
Core 1 at 0x0000CF80
Core 2 at 0x0000DF80
Core 3 at 0x0000EF80
Core 4 at 0x0000FF80

And then hope none of the ISR's overflows their assigned stack space.

The interrupt (re)mapping table takes care of this prior to the CPU being bothered. A {CPU or device} sends an interrupt to the Interrupt mapping
table associated with the "Originating" thread. (IO/-MMU). That interrupt
is logged into the table and if enabled its priority is used to determine which set of CPUs should be bothered, the affinity mask of the
"Originating"
thread is used to qualify which CPU from the priority set, and one of these is selected. The selected CPU is tapped on the shoulder, and sends a get- Interrupt request to the Interrupt table logic which sends back the
priority
and number of a pending interrupt. If the CPU is still at lower priority
than the returning interrupt, the CPU <at this point> stops running code
from the old thread and begins running code on the new thread.
{{During the sending of the interrupt to the CPU and the receipt of the claim-Interrupt message, that interrupt will not get handed to any other CPU}} So, the CPU continues to run instructions while the CPUs contend
for and claim unique interrupts. There are 512 unique interrupt at each of
64 priority levels, and each process can have its own Interrupt Table.
These tables need no maintenance except when interrupts are created and destroyed.}}

HV, Guest HV, Guest OS each have their own unique interrupt tables;
Although it could be arranged such that all could use the same table.

Hmm, my boot-time state is more like:
SR: Initialized to Supervisor mode and BJX2 Baseline;
PC: Set to 0;
MMCR: Set to 0;
VBR: Set to 0;
Most everything else: Potentially random garbage.

When the RESET signal is asserted, some logic on the RingBus also
effectively flushes anything on the bus to 0, since on the FPGA, it
tends to start up in a state where the bus is filled with garbage (all
the FF's and LUTRAMs tend to start up containing garbage, but curiously
all of the BRAM's seem to be cleared to 0).

Interestingly, my Verilog simulations artificially inject some amount of
random garbage for testing purposes (otherwise, Verilator seems to start
up with everything cleared to 0, unlike the real FPGA).

Would add a potential "wonk" factor though, if this SRAM region were
only visible for D$ access, but inaccessible from the I$. But, I guess
one can argue, there isn't really a valid reason to try to run code
from the ISR stack or similar.

Though, could make sense if one has a mechanism where a context
switch could have a mechanism to dump the whole register file to
Block-RAM, and some sort of mechanism to access this RAM via an MMIO
interface.

Just put it in DRAM at SW controlled (via TLB) addresses.

Possibly.

It is also possible that some of the TBR / "struct TKPE_TaskInfo_s"
stuff could be baked into hardware... But, I don't want to go this
route (baking parts of it into the C ABI is at least "slightly" less
evil).

My mechanism is taking that struct task.....s (at least the part HW
needs to understand) and associating each one into a table that points
at DRAM. Now, when you want this thread to run, you load up the pointer
set the e-bit (enabled) and write it into the current header at its
privilege level. Poof--all 5 cache lines of state from the currently
running thread goes back to where it permanent home in DRAM is, and
the new thread fetches 5 cache lines of state of the new thread.
a) you can start the reads before you start the writes
b) you can start the writes anytime you have outbound access to "the bus"
c) the writes can be no late than the ½ cycle before the reads get written. Which is a lot faster than you can do in SW with LDs and STs.

I ended up simplifying this problem slightly:
A previously reserved pointer at offset 0x0020 was repurposed to being a designated pointer to the register-save area.

This effectively turns most of both the TKPE_TaskInfo_s and
TKPE_TaskInfoKern_s structures into "don't care" as far as the ISR prolog/epilog needs to be concerned.

Or, IOW: "(TBR, 0x20) holds a 64-bit pointer to the register save area".

Also possible could be to add another CR for "Dump context registers
here", this adds the costs of another CR though.

I config-space mapped all my CRs, so you get an unlimited number of them.

I have encoding space for 64 in theory (in XG2), 32 in Baseline.

In practice, it is a little more limited, as the register ID space is
also used for SPRs and special internal-use registers (like ZR, IMM, ...).

Say:
00..3F: GPRs
40..5F: SPRs and special-use
60..7F: CRs.

The bigger issue though, is the (unlike GRPs), the CRs are implemented
using FF's rather than LUTRAM, and thus, every CR is relatively expensive.

A bunch of potential CR assignments were burnt on the SMT feature, which
never materialized (and probably wont; mostly as I realized that the
original considered strategy for trying to implement SMT would have
ended up likely being more expensive than having two logical processor
cores).

About the only resources that would really make sense to share SMT style
ATM would likely be the FPU and SIMD unit. It would likely also make
more sense to have two mostly-independent pipelines, rather than a
single extra-wide pipeline with logically co-issued threads.

But, if I revisit the idea, it would likely end up looking more like a
pair of semi-conjoined cores behaving as-if they were two independent cores.

But, not yet reclaimed the register numbers.

I guess I can probably safely rule out MMIO under the basis that
context switching via moving registers via MMIO would be slower than
the current mechanism (of using a series of Load/Store instructions).

.................

Yes, but PTHREADing can be done without privilege and in a single
instruction.

OK.

Luckily, a thread-switch only needs to go 1-way, reducing it to around
500 cycles as-is in my case.

In my case it is about MemoryLatency+5 cycles.
Yes, thread switch is a 1-way function--which is the reason you can
allow a user to preempt himself and allow a compatriot to run in his place.....

OK.

Theoretical minimum would be around 150-200 cycles, with most of the
savings based on eliminating around 1.5kB worth of "memcpy()"...

My Real Time version of MY 66000 does 10-ish cycle context switch
(as seen at the CPU) but here a hunk of HW has gathered up those 5 cache lines and sent them to the targeted CPU and all the CPU has to do is push
out the old state (5-cache liens) So the data was heading towards the
CPU before the CPU even knew it wanted that data !!

Hmm.

In my case, it seems more like stuff is going to get caught up in a
series of L1 misses.

Though, rapid-fire syscalls would at least have the advantage that there
data is more likely to already be in-cache.

This need not involve an ISA change, could in theory be done by making
the SYSCALL ISR mandate that TBR be valid (and the associated compiler
changes, likely the main issue here).

Well, nevermind any cost of locating the next thread, but at the
moment, I am using a fairly simplistic round-robin scheduling
strategy, so the scheduler mostly starts at a given PID, and looks for
the next PID that holds a valid/running task (wrapping back to PID 1
if it hits the end, and stopping the search if it gets back to the
original PID).

The high-level threading model wasn't based on pthreads in my case,
but rather C11 threads (and had implemented a lot of the "threads.h"
stuff).

One could potentially mimic pthreads on top of C11 threads though.

At the moment, I forgot why I decided to go with C11 threads over
pthreads, but IIRC I think I had felt at the time like C11 threads
were a better fit.

One could have enough register banks for N logical tasks, but
supporting 4 or 8 copies of the register file is going to cost more
than 2 or 3.

Above, I was describing what the hardware was doing.

The software side is basically more like:
   Branch from VBR-table to ISR entry point;
   Get R0 and R1 saved onto the stack;

Where did you get the address of this stack ??

SP and SSP swap places on interrupt entry (currently by renumbering
the registers in the instruction decoder).

So, in effect, you actually have 33 registers with only 32 visible at
any instant. I am just so glad not to have gone down that rabbet hole
this time......

More like 65 with 64 visible at any given time.

Early on, R0 and R1 had also swapped places with doppelganger
counterparts, in a similar way, but I eliminated this early on.

SSP is initialized early on to the SRAM stack, so when an interrupt
happens, the 'SP' register automatically becomes the SRAM stack.

Essentially, both SP and SSP are SPRs, but:
   SP is mapped into R15 in the GPR space;
   SSP is mapped into the CR space.

So, when executing an ISR, it is effectively using SSP as its SP.

If I were eliminate this implicit register-swap mechanism, then the
ISR entry would likely need to reload a constant address each time.
Though, this change would also break binary compatibility with my
existing code.

But, in theory, eliminating the register swap could allow demoting SP
to being a normal GPR.

Also, things like renumbering parts of the register space based on CPU
mode is expensive.

Though, some of my more recent design ideas would have gone over to an
ordering slightly more like RISC-V, say:
   R0: ZR or PC (ALU or MEM)
   R1: LR or TBR (ALU or MEM)
   R2: SP
   R3: GP (GBR)
   R4 -R15: Scratch
   R16-R31: Callee Save
   R32-R47: Scratch
   R48-R63: Callee Save

Would likely not adopt RISC-V's C ABI though.

R0::     GPR, Return Address, proxy for IP, proxy for 0
R1..R9   Arguments and results passed in registers
R10..R15 Temporary Registers (scratch)
R16..R29 Callee Save
R30      FP when in use, Callee Save
R31      SP

As-is, it is more like:
R0: DLR or PC
R1: DHR or GBR
R2/R3: Scratch / Return
R4..R7: Scratch / Arg0..Arg3
R8..R14: Callee Save
R15: SP
R16..R19: Scratch
R20..R23: Scratch / Arg4..Arg7
R24..R31: Callee Save
R32..R35: Scratch
R36..R39: Scratch / Arg8..Arg11 (Opt)
R40..R47: Callee Save
R48..R51: Scratch
R52..R55: Scratch / Arg12..Arg15 (Opt)
R56..R63: Callee Save

Which was effectively taking the general pattern for R0..R15, and then essentially repeating it 4 times.

With RISC-V using partial remapping:
X0: ZR
X1: LR
X2: SP
X3: GBR
X4: TBR (Read Only in Usermode)
X5: DHR
X6..X13: R6..R13
X14: R2
X15: R3
X16..X31: R16..R31

XG2RV uses RISC-V's register space, with a slightly tweaked version of
XG2's encoding scheme.

Initial plan was for XG2RV to use RISC-V's ABI, which could in theory
allow thunk-free cross-ISA function calls, but... Getting BGBCC to
support RISC-V's ABI would be a pain, and otherwise there is no real
plausible way at the moment to link XG2 code and RISC-V code into a
single binary, rendering the whole idea "kinda moot".

Though, if one assumes R4..R63 are GPRs, this would allow both this
ISA and RISC-V to still use the same register numbering.

This is already fairly close to the register numbering scheme used in
XG2RV, though the assumption was that XG2RV would have used RV's ABI,
but this was stalled out mostly due to compiler issues (getting BGBCC
to be able to follow RISC-V's C ABI rules would be a non-trivial level
of effort; but is rendered moot if one still needs to use call thunking).

The interpretation for R0 and R1 would depend on how they are used:
   ALU or similar: ZR and LR (Zero and Link Register)
   Load/Store Base: PC and TBR.

Idea being that in userland, TBR effectively still exists as a
Read-Only register (allowing userland to modify TBR would effectively
also allow userland to wreck the OS).

Thing is mostly that needing to renumber registers in the decoder
based on CPU mode isn't entirely free in terms of LUT cost or timing
latency (even if it only applies to a subset of the register space).

Note that for RV decoding:
   X0..X31 -> R0 ..R31 (more or less)
   F0..F31 -> R32..R63

But, RV's FPU instructions don't match up exactly 1:1, and some cases
would have semantic differences.

Though, it seems like most RV code could likely tolerate some
deviation in some areas (will it care that the high 32 bits of a
Binary32 register don't hold NaN? Will it care about the extra
funkiness going on in LR? ...).

   Get some of the CRs saved off (we need R0 and R1 free here);
   Get the rest of the GPRs saved onto the stack;
   Call into the main part of the ISR handler (using normal C ABI);
   Restore most of the GPRs;
   Restore most of the CRs;
   Restore R0 and R1;
   Do an RTE.

If HW does register file save/restore the above looks like::

The software side is basically more like:
   Branch from VBR-table to ISR entry point;
   Call into the main part of the ISR handler (using normal C ABI);
   Do an RTE.

See what it saves ??

This is fewer instructions.

But, hardware cost,

the HW cost has already been purchased by the state machine that writes
out 5-cache lines and waits for 5-cache lines to arrive.

I don't have anything like this either...

Miss handling is more like:
L1 Cache sees that request has missed;
Signal a pipeline stall;
Throw requests onto the ringbus;
Wait for responses to arrive;
Execution continues when "all is good".

State is mostly controlled with state flags, say:
Has A sent a Store request;
Has A sent a Load request;
Has A gotten a response for a Store request;
Has A gotten a response for a Load request;
Has B sent a Store request;
Has B sent a Load request;
Has B gotten a response for a Store request;
Has B gotten a response for a Load request;
...
With an if/else tree dealing with the various cases.

The ordering of the if/else tree will determine which order requests are
sent, say:
Store A
Store B
Load A
Load B

And checks to keep the pipeline stalled if a request has been sent but
the corresponding response has not yet arrived.

and clock-cycle savings?...
The reads can arrive before you start the writes, you can go so far as
to organize your pipeline so the read data being written pushes
out the write data that needs to return to memory-making the timing
brain dead easy to achieve.

Wait, so, like pipelining requests/responses to external RAM?...

In my case, all L1<->L2 communication is effectively synchronous and
there isn't really any overlap between separate memory accesses (a given
access will need to finish before any new memory access can begin).

Getting too fancy here could raise issues, as the bus design introduces
a certain level of "chaos" (responses will often not arrive in the same
order the requests were sent, at the whim of L2 hit/miss).

I have noted that I can have RAM backed framebuffer and a rasterizer
module without significantly effecting memory bandwidth for the main CPU
core (the different entities using the bus being mostly invisible to
each other).

Well, except when trying to switch the display module into 800x600 72Hz 256-color or hi-color mode or similar (then the screen turns to garbage
and memory performance seemingly goes to crap).

Seemingly, about the highest it can manage relatively OK is 640x480 60Hz 256-color.

Previously, 800x600 worked OK at 36Hz, but this is rather non-standard.

The 256-color mode works OK-ish, but:
Still don't have a particularly good "generic" 256-color palette.

Say, examples of the palette can be seen here:
https://twitter.com/cr88192/status/1727574073566257232

Also the process of remapping everything from RGB555 to 256-color seems
to be kinda slow.

With a sort of "no great option":
15-bit lookup tables are too big, and will result in excessive L1 cache
misses.

Bit-twiddling RGB555 into GBR533 has a lot of bit twiddly, slightly
worse looking results, but faster start-up times (it is a lot faster to regenerate an 11-bit lookup table than a 15 bit lookup table).

Where, say, RGB555 to GBR533 is, say:
((v&0x3FC)<<1)|((v>>12)&7)

Generally, things related to redrawing the windows and refreshing the
screen tending to dominate CPU usage in this case.

As can be noted, both Doom and Hexen stuck at "rather crap" framerates
of around 8 fps.

Palette used for the Doom image:
16 shades of 16 colors;
0z: Grayscale
1z-6z: High Saturation
9z-Ez: Low Saturation
7z/8z/Fz: Off-White
Newer palette:
13 shades of 18 colors + RGBI colors;
0z: Grayscale
1z-6z: High Saturation
9z-Ez: Low Saturation
7z/8z/Fz: Off-White
Vertically:
z0: RGBI
z1: Orange / Amber
z2: Sky Blue

In a prior variant, there was an olive-green axis, but I left it out of
this one. The RGBI colors were because otherwise these colors could not
be recreated faithfully (eg: for console text or 16-color bitmaps).
Orange and Sky Blue slightly improve color-fidelity, and the loss of
various "almost but not quite black" colors was not a huge loss.

However, 0x70 and 0x80 were redundant, so 0x80 was repurposed as a
transparent color. F0 and 0F are "not quite" redundant:
F0 is 7FFF, 0F is 7BDE.

As-is, I can't come up with much that is both:
   Fairly cheap to implement in hardware;
   Would saves a lot of clock-cycles over software-based options.

As noted, the former is also why I had thus far mostly rejected the
RISC-V strategy (*).

Yet, you seem to be buying insurance as if you might need to head in that direction.

Yeah, but I don't really want to go there either...

*: Ironically, despite RISC-V having fewer GPRs, to implement the
Privileged spec, RISC-V would still end up needing a somewhat bigger
register file... Nevermind what exactly is going on with CSRs...

Whereas that special State is only a dozen register <with state>
in My 66000--the rest being either memory resident or memory mapped.

OK.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul A. Clayton@21:1/5 to MitchAlsup on Fri Nov 24 08:19:34 2023

On 11/23/23 6:30 PM, MitchAlsup wrote:
[snip]

Stay away from anything you see in x86 except in using it a moniker
to avoid.

Even a stopped (12-hour) clock is right twice a day.

I hope you are not going to remove from My 66000 variable length
instruction encoding, hardware handling of (some for x86, XSAVE/
XRSTR) context saving and restoring, or even Memory Move.

One could go even further and claim avoiding anything seen in x86
means not having registers (a storage region with simple, compact
addressing that an implementation will optimize as the common case
for operands — the Mill's Belt counts as "registers" in this sense
and even something like a transport-trigger architecture would
likely have storage for values with temporal locality coarser than
immediate use but frequent enough to justify simpler and more
compact addressing).

Yes, x86 messes up even these aspects. VLE does not have to be
byte granular or use multiple prefixes in variable order. Hardware
context save/restore does not have to be limited to extended
state. A memory move instruction does not *need* to have a variant
for each possible/likely chunk size or be implemented as
substantially less performant than a software implementation, even
with compile-time known size and alignment. Registers do not have
to be limited to 8 or be accessed in sub-units.

(Sub-unit access has some attraction to me for more efficiently
using a limited storage space while still trying to keep access
simple by limiting variability of shifting and complexity of
partial write ordering, but less efficient storage use can easily
be better than complexity of accessing the fastest and most
commonly accessed storage. More recent ISAs have implemented
partial register accesses. IBM ZArch, S/360 descendant, spit GPRs
into high and low halves to increase the number of values
available in the nominally 16 GPRs. AArch64 has 32-bit computer
operations motivated, I think, for power saving, which do not
increase the number of values and so avoids the shift and partial-
write problems.)

I suspect you could write a multi-volume treatise on x86 about hardware-software interface design and management (including the
social and economic considerations of project/product management).
Ignoring human factors, including those outside the organization
owning the interface, seems attractive to a certain engineering
mindset but human factors are significant design considerations.

[Yet once more stating what is obvious, especially to one skilled
in the art.]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Finch@21:1/5 to Paul A. Clayton on Fri Nov 24 09:43:23 2023

On 2023-11-24 8:19 a.m., Paul A. Clayton wrote:

On 11/23/23 6:30 PM, MitchAlsup wrote:
[snip]

Stay away from anything you see in x86 except in using it a moniker
to avoid.

Even a stopped (12-hour) clock is right twice a day.

I hope you are not going to remove from My 66000 variable length
instruction encoding, hardware handling of (some for x86, XSAVE/
XRSTR) context saving and restoring, or even Memory Move.

One could go even further and claim avoiding anything seen in x86
means not having registers (a storage region with simple, compact
addressing that an implementation will optimize as the common case
for operands — the Mill's Belt counts as "registers" in this sense
and even something like a transport-trigger architecture would
likely have storage for values with temporal locality coarser than
immediate use but frequent enough to justify simpler and more
compact addressing).

Yes, x86 messes up even these aspects. VLE does not have to be
byte granular or use multiple prefixes in variable order. Hardware
context save/restore does not have to be limited to extended
state. A memory move instruction does not *need* to have a variant
for each possible/likely chunk size or be implemented as
substantially less performant than a software implementation, even
with compile-time known size and alignment. Registers do not have
to be limited to 8 or be accessed in sub-units.

(Sub-unit access has some attraction to me for more efficiently
using a limited storage space while still trying to keep access
simple by limiting variability of shifting and complexity of
partial write ordering, but less efficient storage use can easily
be better than complexity of accessing the fastest and most
commonly accessed storage. More recent ISAs have implemented
partial register accesses. IBM ZArch, S/360 descendant, spit GPRs
into high and low halves to increase the number of values
available in the nominally 16 GPRs. AArch64 has 32-bit computer
operations motivated, I think, for power saving, which do not
increase the number of values and so avoids the shift and partial-
write problems.)

I suspect you could write a multi-volume treatise on x86 about hardware-software interface design and management (including the
social and economic considerations of project/product management).
Ignoring human factors, including those outside the organization
owning the interface, seems attractive to a certain engineering
mindset but human factors are significant design considerations.

[Yet once more stating what is obvious, especially to one skilled
in the art.]

There is a lot of value in having a unique architecture. The x86 has had
a lot of things bolted on to it. It has adapted over time. Being able to
see how things have changed is valuable. I suspect just about any
architecture adapted over a 40 or 50 years period would look no so
appealing. I happen to like the segmented approach, not necessarily
because it is a good way to do things, but it was certainly interesting
and challenging. An interesting, challenging, and somewhat mysterious architecture may be more appealing than the best organized, most
performant, energy efficient one. There is a trade-off between ‘the
best’ and the ‘human factor’. I can imagine that there might be treaties limiting computer performance somewhere. Just how fast of a CPU is legal?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Fri Nov 24 13:41:26 2023

My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}

Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Fri Nov 24 18:24:00 2023

Paul A. Clayton wrote:

On 11/23/23 6:30 PM, MitchAlsup wrote:
[snip]

Stay away from anything you see in x86 except in using it a moniker
to avoid.

Even a stopped (12-hour) clock is right twice a day.

I hope you are not going to remove from My 66000 variable length
instruction encoding, hardware handling of (some for x86, XSAVE/
XRSTR) context saving and restoring, or even Memory Move.

It is these things which allows my architecture to only need 70%
of the instructions RISC-V needs.

One could go even further and claim avoiding anything seen in x86
means not having registers (a storage region with simple, compact
addressing that an implementation will optimize as the common case
for operands — the Mill's Belt counts as "registers" in this sense
and even something like a transport-trigger architecture would
likely have storage for values with temporal locality coarser than
immediate use but frequent enough to justify simpler and more
compact addressing).

Having 1 set of flat (any register can do any result or operand) is
a My 66000 requirement, The only things I took from x86-64 is
the [base+index<<scale+displacement] memory addressing model, and
the 2-level MMU, even here I used the I/O MMU version rather than
the processor version.

Yes, x86 messes up even these aspects.
VLE does not have to be byte granular or use multiple prefixes in variable order.

VLE does not need prefixes of any kind.

Hardware context save/restore does not have to be limited to extended state.

HW S/R is most useful when it deals with ALL the state.

A memory move instruction does not *need* to have a variant
for each possible/likely chunk size or be implemented as
substantially less performant than a software implementation,

One can synthesize SIMD and Vector saving 90% of the OpCode space
< even with compile-time known size and alignment. Registers do not have

to be limited to 8 or be accessed in sub-units.

(Sub-unit access has some attraction to me for more efficiently
using a limited storage space while still trying to keep access
simple by limiting variability of shifting and complexity of
partial write ordering, but less efficient storage use can easily
be better than complexity of accessing the fastest and most
commonly accessed storage. More recent ISAs have implemented
partial register accesses. IBM ZArch, S/360 descendant, spit GPRs
into high and low halves to increase the number of values
available in the nominally 16 GPRs. AArch64 has 32-bit computer
operations motivated, I think, for power saving, which do not
increase the number of values and so avoids the shift and partial-
write problems.)

I suspect this came out of already having to implement HW for
IC (insert Character) instruction from System 360 time.

I suspect you could write a multi-volume treatise on x86 about hardware-software interface design and management (including the
social and economic considerations of project/product management).
Ignoring human factors, including those outside the organization
owning the interface, seems attractive to a certain engineering
mindset but human factors are significant design considerations.

It would be more beneficial to the world just to build an architecture
without any of those flaws--just to show them how its done.

[Yet once more stating what is obvious, especially to one skilled
in the art.]

Captain Obvious would be proud.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Stefan Monnier on Fri Nov 24 22:21:53 2023

Stefan Monnier wrote:

My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}

Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

This seems to mimic RISC-V set of levels but done/named differently.

The Guest OS and Guest HV levels are done in such a way that you can have a stack of Guest OSs of any depth and a stack or Guest HVs of any depth, although the HW only supports 4 levels HW with SW intervention supports any number of levels.

In particular: Guest OS manages faults from Application, Guest HV manages faults from Guest OS {Which makes it possible to recover from page faults
in the "sticky" places of interrupt and exception handling}, Real HV
manages faults from Guest HV.

Application accesses only 63-bits of virtual address space. If application makes an access with the HOB of the virtual address set, the access takes
a fault.

Guest OS can reach down into Application by accessing with the HOB clear (0)
or access its own VAS with the HOB set (1).

Guest HV can reach down into Guest OS by accessing with the HOB set (1)
or access its own VAS with the HOB clear(0).

Real HV can reach down into Guest HV by accessing with the HOB clear (0)
or access its own VAS with the HOB set (1).

Assuming we are running with a HV::
Application accesses use 2-level paging through Application Mapping Tables. Guest OS accesses Application use 2-level paging through Application Mapping Tables; and accesses Guest OS use 2-level paging through Guest OS Tables.
Guest HV accessing Guest OS use 2-level paging through Guest OS Tables,
and access Guest HV use 1-level paging through Guest HV Tables.
Real HV accesses Guest HV use 1-level paging through Guest HV Tables,
and accesses Real HV use 1-level paging through Real HV Tables.

--------------

When a 2-level Mapping creates an UnCacheable, MMI/O, ROM or config space access, this intermediate addressed space determines the memory order. So, Guest OS can make a process address sequentially consistent by making all
the PTEs use MMI/O space accesses. The second level of translation will,
then, translate that access back to <say> cacheable DRAM to be performed. Likewise, should the second level of translation produce an access other
than cacheable DRAM, memory order is determined by the stronger method
of both translations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stefan Monnier on Sat Nov 25 00:01:00 2023

Stefan Monnier <monnier@iro.umontreal.ca> writes:

My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}

Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

Would require priority decoders to differeniate rather
than simple gates, probably.

Although I wonder at the missing firmware privilege level, a la SMM or EL3.

ARM added support for nested hypervisors without adding a
new exception level. Although interesting, there isn't much
evidence of it being used in production. Yet anyway.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Scott Lurndal on Fri Nov 24 20:57:03 2023

On 11/24/2023 6:01 PM, Scott Lurndal wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV} >>

Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

Would require priority decoders to differeniate rather
than simple gates, probably.

Although I wonder at the missing firmware privilege level, a la SMM or EL3.

ARM added support for nested hypervisors without adding a
new exception level. Although interesting, there isn't much
evidence of it being used in production. Yet anyway.

It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.

With pretty much anything that isn't "bare metal" being put in User Mode (potentially using emulation traps as needed).

Something like a Soft-TLB or Inverted-Page-Table does not need any
special hardware to support nested translation (whereas hardware
page-walking would require dedicated support).

Not entirely sure how multi-level virtualization works with page-tables,
but works "somehow".

Then again, it is possible that doing everything in software could lead
to people working in inner levels being jealous of those working in the
outer levels for being closer to the hardware (and thus presumably
having lower performance overheads).

...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to MitchAlsup on Fri Nov 24 20:49:57 2023

On 11/24/2023 12:24 PM, MitchAlsup wrote:

Paul A. Clayton wrote:

On 11/23/23 6:30 PM, MitchAlsup wrote:
[snip]

Stay away from anything you see in x86 except in using it a moniker
to avoid.

Even a stopped (12-hour) clock is right twice a day.

I hope you are not going to remove from My 66000 variable length
instruction encoding, hardware handling of (some for x86, XSAVE/
XRSTR) context saving and restoring, or even Memory Move.

It is these things which allows my architecture to only need 70%
of the instructions RISC-V needs.

In some of my tests, the total number of executed instructions tends to
be less than RISC-V as well.

Best I can tell, the main things that save instructions are mostly:
Register-indexed load/store (~ 30% of Ld/St);
MOV.X (~ 12% of Ld/St);
Jumbo prefixes (~ 6%).

Though, apparently, someone posted something recently showing RV64 and
ARM64 to be much closer than expected, which is curious. The main
instructions that seem to have "the most bang for the buck" are ones
that ARM64 has equivalents of.

More testing may be needed.

In other news:
Did add the compiler support to eliminate the "memcpy()" step from the task-switching (by having the prolog/epilog saving/restoring registers
directly from the task context).

Should roughly halve syscall overhead, along with shaving 480 bytes off
the stack frame (some of the GPRs and CRs still need to be shuffled via
the stack, so it only saves 480 bytes rather than 640).

One could go even further and claim avoiding anything seen in x86
means not having registers (a storage region with simple, compact
addressing that an implementation will optimize as the common case
for operands — the Mill's Belt counts as "registers" in this sense
and even something like a transport-trigger architecture would
likely have storage for values with temporal locality coarser than
immediate use but frequent enough to justify simpler and more
compact addressing).

Having 1 set of flat (any register can do any result or operand) is
a My 66000 requirement, The only things I took from x86-64 is
the [base+index<<scale+displacement] memory addressing model, and
the 2-level MMU, even here I used the I/O MMU version rather than the processor version.

Yeah. Flat registers are good.

Internally, [base+index*scale] or [base+disp*scale] can do "most of it"...

Having [base+index*scale+disp] could do a little more, but seems to be
somewhat rarer. I had experimented with such an encoding, but it didn't
seem like it saw enough use-cases to justify the cost of its existence.

Granted, it might be more useful if it could be encoded like:
JumboImm+JumboOp+LdSt
With a 33 bit displacement (rather than an 9/11 bit displacement), as
this could potentially allow using it to address global arrays.

IOW, potentially allowing:
FEdd-dddd-FFw0-0Vdd-F0nm-0eoZ MOV.x (Rm, Ro*Sc, Disp33s), Rn

Yes, x86 messes up even these aspects. VLE does not have to be byte
granular or use multiple prefixes in variable order.

VLE does not need prefixes of any kind.

Hardware context save/restore does not have to be limited to extended
state.

HW S/R is most useful when it deals with ALL the state.

A memory move instruction does not *need* to have a variant
for each possible/likely chunk size or be implemented as
substantially less performant than a software implementation,

One can synthesize SIMD and Vector saving 90% of the OpCode space
< even with compile-time known size and alignment. Registers do not have

to be limited to 8 or be accessed in sub-units.

(Sub-unit access has some attraction to me for more efficiently
using a limited storage space while still trying to keep access
simple by limiting variability of shifting and complexity of
partial write ordering, but less efficient storage use can easily
be better than complexity of accessing the fastest and most
commonly accessed storage. More recent ISAs have implemented
partial register accesses. IBM ZArch, S/360 descendant, spit GPRs
into high and low halves to increase the number of values
available in the nominally 16 GPRs. AArch64 has 32-bit computer
operations motivated, I think, for power saving, which do not
increase the number of values and so avoids the shift and partial-
write problems.)

I suspect this came out of already having to implement HW for IC (insert Character) instruction from System 360 time.

Seems sane.

I suspect you could write a multi-volume treatise on x86 about
hardware-software interface design and management (including the
social and economic considerations of project/product management).
Ignoring human factors, including those outside the organization
owning the interface, seems attractive to a certain engineering
mindset but human factors are significant design considerations.

It would be more beneficial to the world just to build an architecture without any of those flaws--just to show them how its done.

People can probably debate what is ideal.

There seem to be people around who see RISC-V as the model of perfection.

I disagree, where some things seem to be corner cutting in areas where
doing so is a foot gun, and other areas being needlessly expensive (and
some things in the reaches of "extensions land" being just kinda absurd).

In some ways, it is (as I see it) better to define some things and leave
them as optional, rather than define little, and leave everyone else to
make an incoherent mess of things.

Then again, likely there is disagreements as to what sorts of features
seem meaningful, wasteful, or needless extravagance.

Granted, it does seem like x86 probably needs to be retired at some point...

[Yet once more stating what is obvious, especially to one skilled
in the art.]

Captain Obvious would be proud.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to BGB on Sat Nov 25 16:55:30 2023

BGB <cr88192@gmail.com> writes:

On 11/24/2023 6:01 PM, Scott Lurndal wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV} >>>

Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

Would require priority decoders to differeniate rather
than simple gates, probably.

Although I wonder at the missing firmware privilege level, a la SMM or EL3. >>
ARM added support for nested hypervisors without adding a
new exception level. Although interesting, there isn't much
evidence of it being used in production. Yet anyway.

It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.

With pretty much anything that isn't "bare metal" being put in User Mode >(potentially using emulation traps as needed).

Something like a Soft-TLB or Inverted-Page-Table does not need any
special hardware to support nested translation (whereas hardware
page-walking would require dedicated support).

It's been tried. And performance sucked big-time. The reason
that AMD added back support for the DS limit register in AMD64
was to support xen (and vmware) before Pacifica (the AMD project
that became Secure Virtual Machine (SVM) known now as AMD-V).

Both intel and AMD use a block of memory to record guest state
and have instructions to enter and leave VM mode (e.g. vmenter);
ARM stores guest state in system registers - less overhead
when switching from guest to host or guest to guest.

Not entirely sure how multi-level virtualization works with page-tables,
but works "somehow".

But not well, nor performant.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Scott Lurndal on Sat Nov 25 12:31:36 2023

On 11/25/2023 10:55 AM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/24/2023 6:01 PM, Scott Lurndal wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}

Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

Would require priority decoders to differeniate rather
than simple gates, probably.

Although I wonder at the missing firmware privilege level, a la SMM or EL3. >>>
ARM added support for nested hypervisors without adding a
new exception level. Although interesting, there isn't much
evidence of it being used in production. Yet anyway.

It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.

With pretty much anything that isn't "bare metal" being put in User Mode
(potentially using emulation traps as needed).

Something like a Soft-TLB or Inverted-Page-Table does not need any
special hardware to support nested translation (whereas hardware
page-walking would require dedicated support).

It's been tried. And performance sucked big-time. The reason
that AMD added back support for the DS limit register in AMD64
was to support xen (and vmware) before Pacifica (the AMD project
that became Secure Virtual Machine (SVM) known now as AMD-V).

OK.

I wouldn't expect nested inverted-page-table translation to be *that*
much slower than normal inverted page tables. Though, would add a bit of multi-level translation wonk in the top-level miss handler (and likely
still better than multi-level soft-TLB, where a miss in the outer TLB
level means needing to propagate the interrupt inwards and then
emulating it the whole way up).

Granted, there is still the annoyance that the OS's tend to deal with page-tables, and one needs to translate to inverted page tables, which typically have a finite associativity (such as 4 or 8 way).

Would mean that multi-level interrupt handling would still be needed
whenever the page isn't in the guest's TLB or VIPT (short of breaking abstraction and faking the use of hardware page walking for the guest OS's).

Granted, full soft TLB isn't ideal for performance either (in general),
my workaround was mostly making the TLB big enough that the average-case
miss rate is kept fairly low (well, and for now, putting the whole OS in
one big address space).

But, multiple address spaces is sort of the whole point of VMs, so...

Seems like one might need a mechanism to remap the VM from real CR's to
a partially emulated set of CR's (VCR's ?...).

Both intel and AMD use a block of memory to record guest state
and have instructions to enter and leave VM mode (e.g. vmenter);
ARM stores guest state in system registers - less overhead
when switching from guest to host or guest to guest.

OK.

Not entirely sure how multi-level virtualization works with page-tables,
but works "somehow".

But not well, nor performant.

As far as I know, the whole "nested page tables" was the core of how virtualization worked on x86...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Sat Nov 25 19:27:04 2023

BGB wrote:

On 11/25/2023 10:55 AM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/24/2023 6:01 PM, Scott Lurndal wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}

Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

Would require priority decoders to differeniate rather
than simple gates, probably.

Although I wonder at the missing firmware privilege level, a la SMM or EL3.

ARM added support for nested hypervisors without adding a
new exception level. Although interesting, there isn't much
evidence of it being used in production. Yet anyway.

It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.

With pretty much anything that isn't "bare metal" being put in User Mode >>> (potentially using emulation traps as needed).

Something like a Soft-TLB or Inverted-Page-Table does not need any
special hardware to support nested translation (whereas hardware
page-walking would require dedicated support).

It's been tried. And performance sucked big-time. The reason
that AMD added back support for the DS limit register in AMD64
was to support xen (and vmware) before Pacifica (the AMD project
that became Secure Virtual Machine (SVM) known now as AMD-V).

OK.

I wouldn't expect nested inverted-page-table translation to be *that*
much slower than normal inverted page tables. Though, would add a bit of multi-level translation wonk in the top-level miss handler (and likely
still better than multi-level soft-TLB, where a miss in the outer TLB
level means needing to propagate the interrupt inwards and then
emulating it the whole way up).

Think of it like this:: Privilege inversion::

If HV is performing table walks on behalf of Guest OS, HV is having to
rummage through Guest OS tables and then rummage through HV own tables.
Here having HV rummage through Guest OS tables is more than a hassle,
nothing in HV should directly touch anything in Guest OS unless Guest
OS grants access (and not implicitly as is herein).

What you REALLY want is for Guest OS to manage its own tables and HV to
manage its own tables. Thereby no particular piece of SW is capable of operating at the lowest privilege of {Guest OS, HV} it can be 1 or the
other.

The above holds for any kind of tables, nested, inverted, nested inverted,
..

Granted, there is still the annoyance that the OS's tend to deal with page-tables, and one needs to translate to inverted page tables, which typically have a finite associativity (such as 4 or 8 way).

Would mean that multi-level interrupt handling would still be needed
whenever the page isn't in the guest's TLB or VIPT (short of breaking abstraction and faking the use of hardware page walking for the guest OS's).

Granted, full soft TLB isn't ideal for performance either (in general),
my workaround was mostly making the TLB big enough that the average-case
miss rate is kept fairly low (well, and for now, putting the whole OS in
one big address space).

But, multiple address spaces is sort of the whole point of VMs, so...

Seems like one might need a mechanism to remap the VM from real CR's to
a partially emulated set of CR's (VCR's ?...).

Both intel and AMD use a block of memory to record guest state
and have instructions to enter and leave VM mode (e.g. vmenter);
ARM stores guest state in system registers - less overhead
when switching from guest to host or guest to guest.

OK.

Not entirely sure how multi-level virtualization works with page-tables, >>> but works "somehow".

But not well, nor performant.

As far as I know, the whole "nested page tables" was the core of how virtualization worked on x86...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to BGB on Sat Nov 25 19:28:50 2023

BGB <cr88192@gmail.com> writes:

On 11/25/2023 10:55 AM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/24/2023 6:01 PM, Scott Lurndal wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}

Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

Would require priority decoders to differeniate rather
than simple gates, probably.

Although I wonder at the missing firmware privilege level, a la SMM or EL3.

ARM added support for nested hypervisors without adding a
new exception level. Although interesting, there isn't much
evidence of it being used in production. Yet anyway.

It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.

With pretty much anything that isn't "bare metal" being put in User Mode >>> (potentially using emulation traps as needed).

Something like a Soft-TLB or Inverted-Page-Table does not need any
special hardware to support nested translation (whereas hardware
page-walking would require dedicated support).

It's been tried. And performance sucked big-time. The reason
that AMD added back support for the DS limit register in AMD64
was to support xen (and vmware) before Pacifica (the AMD project
that became Secure Virtual Machine (SVM) known now as AMD-V).

OK.

I wouldn't expect nested inverted-page-table translation to be *that*
much slower than normal inverted page tables. Though, would add a bit of >multi-level translation wonk in the top-level miss handler (and likely
still better than multi-level soft-TLB, where a miss in the outer TLB
level means needing to propagate the interrupt inwards and then
emulating it the whole way up).

Let's look at a hardware example of a nested page table walk,
using the AMD nested page table feature as a guide. The AMD
version uses the same PTE format as the non-virtualized page
tables (which reduces the amount of kernel code required to
manage the page tables) unlike Intel's EPT.

Assuming 4k-byte pages in both the primary and nested page tables,
a page table walk must make 22 memory accesses to satisfy a
VA to PA translation, versus only four in a non-virtualized
table walk. This can be reduced to 11 if you have the luxury
of using 1GB mappings in the nested page table.

Performing all those accesses in a kernel fault handler would
consume a great deal more time than a hardware table walker will (particularly if the hardware table walkers can cache the intermediate results
of the higher-level blocks in the walk in the walk hardware).

The downsides of IPT pretty much preclude their use in most
modern operating systems where shared memory between processes
is common (explicitly -or- implicitly (such as VDSO on linux));
some of goals listed as benefits for IPT (e.g. easier whole
process swapping) are made irrelevent by modern operating
systems that don't do that. There's a rather incoherent
description of IPT at geeksforgeeks - I'd not recommend it
as a useful resource.

Would mean that multi-level interrupt handling would still be needed
whenever the page isn't in the guest's TLB or VIPT (short of breaking >abstraction and faking the use of hardware page walking for the guest OS's).

If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.

Seems like one might need a mechanism to remap the VM from real CR's to
a partially emulated set of CR's (VCR's ?...).

ARM does this by adding a layer above the OS ring that can trap
accesses to certain control registers used by the OS to the
hypervisor for resolution. But for the most part, the guest just
uses the same control registers as if it were running bare metal with
no trapping - they're just loaded by the hypervisor before the guest
is dispatched and saved by the hypervisor when scheduling a new
guest. Thats an advantage of the exception level scheme, where
each level has its own set of control registers.

However, shortcoming of the initial implementation was if the
hypervisor was type II, the hypervisor needed to have a special
privileged guest to run standard user-mode code[*]. So they
added (in V8.1) the virtual host extensions (VHE) which allowed
the hypervisor exception level (EL2) to directly dispatch
user-mode code to EL0 (with the normal traps from usermode
to the OS directed to the hypervisor instead of a guest OS). This
let the hypervisor (e.g. KVM) to act both as a hypervisor and
a guest OS with out the context switches required to support a
privileged guest).

[*] And also to provide VFIO support for non-SRIOV hardware devices.

But not well, nor performant.

As far as I know, the whole "nested page tables" was the core of how >virtualization worked on x86...

Before AMD added NPT (Nested Page Tables), the hypervisor needed to
be able to recognize and trap any accesses from the guest OS to
its own page tables and update the real page tables accordingly.
To do that, they had several options:
1) Paravirtualization (i.e. all guest page table ops call the
hypervisor rather than changing the page tables directly);
Xen did this.
2) Write-protecting the page tables and trapping any writes in
the hypervisor. Difficult to do since the page tables in
common OS are not allocated contiguously and they are updated
using normal loads and stores (the HV does know them, however,
as it can trap writes to CR3 and from there can write-protect
the entire table in the real page tables).
3) Binary patch the guest operating system. This was the approach used
by VMware before AMD introduced NPT.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to BGB on Sat Nov 25 22:10:45 2023

BGB <cr88192@gmail.com> writes:

On 11/25/2023 1:28 PM, Scott Lurndal wrote:

If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.

If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
CPU), then the cost of the TLB miss handling is on par with other things
like handling the timer interrupt, etc...

Any cycle used by the miss handler is a cycle that could
have been used for useful work. Timer interrupt handling
is often very short (increment a memory location, a comparison
and a return if no timer has expired). And we're long
past the days of using regular timer interrupts for scheduling
(see tickless kernels, for example).

But, what one does need, is a way to perform context switches without
also triggering a huge wave of TLB misses in the process.

Why?

Note that depending on the number of entries in your TLB
and the scheduler behavior, it's unlikely that any prior
TLB entries will be useful to a newly scheduled thread
(in a different address space).

Having multiple banks of TLBs that you can switch between
might be able to provide you with the capability to
reduce the TLB miss rate on scheduling a new thread of
execution - but CAMs aren't cheap.

For the most part, industry has settled on a large number
of tagged TLB entries as a good compromise. Some architectures have
a global bit in the entry that can be set via the page
table that indicates that ASID and/or VMID qualifications
aren't necessary for a hit.

Big TLB + strategic sharing and ASIDs can help here at least (whereas, a
full TLB flush on context-switch would suck pretty bad).

That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half
of the virtrual address space is shared by all processes - there's no reason that those entries need to be flushed on context-switch.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Scott Lurndal on Sat Nov 25 15:39:53 2023

On 11/25/2023 1:28 PM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/25/2023 10:55 AM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/24/2023 6:01 PM, Scott Lurndal wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}

Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

Would require priority decoders to differeniate rather
than simple gates, probably.

Although I wonder at the missing firmware privilege level, a la SMM or EL3.

ARM added support for nested hypervisors without adding a
new exception level. Although interesting, there isn't much
evidence of it being used in production. Yet anyway.

It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.

With pretty much anything that isn't "bare metal" being put in User Mode >>>> (potentially using emulation traps as needed).

Something like a Soft-TLB or Inverted-Page-Table does not need any
special hardware to support nested translation (whereas hardware
page-walking would require dedicated support).

It's been tried. And performance sucked big-time. The reason
that AMD added back support for the DS limit register in AMD64
was to support xen (and vmware) before Pacifica (the AMD project
that became Secure Virtual Machine (SVM) known now as AMD-V).

OK.

I wouldn't expect nested inverted-page-table translation to be *that*
much slower than normal inverted page tables. Though, would add a bit of
multi-level translation wonk in the top-level miss handler (and likely
still better than multi-level soft-TLB, where a miss in the outer TLB
level means needing to propagate the interrupt inwards and then
emulating it the whole way up).

Let's look at a hardware example of a nested page table walk,
using the AMD nested page table feature as a guide. The AMD
version uses the same PTE format as the non-virtualized page
tables (which reduces the amount of kernel code required to
manage the page tables) unlike Intel's EPT.

Assuming 4k-byte pages in both the primary and nested page tables,
a page table walk must make 22 memory accesses to satisfy a
VA to PA translation, versus only four in a non-virtualized
table walk. This can be reduced to 11 if you have the luxury
of using 1GB mappings in the nested page table.

Performing all those accesses in a kernel fault handler would
consume a great deal more time than a hardware table walker will (particularly
if the hardware table walkers can cache the intermediate results
of the higher-level blocks in the walk in the walk hardware).

OK.

The downsides of IPT pretty much preclude their use in most
modern operating systems where shared memory between processes
is common (explicitly -or- implicitly (such as VDSO on linux));
some of goals listed as benefits for IPT (e.g. easier whole
process swapping) are made irrelevent by modern operating
systems that don't do that. There's a rather incoherent
description of IPT at geeksforgeeks - I'd not recommend it
as a useful resource.

I was thinking of an IPT where one basically keeps stuff from all of the currently running process in a shared IPT, mostly treating it like a big memory-backed form of the TLB.

Though, sharing is a concern:
If you hash entries based on ASID, then there are fewer collisions, but
no sharing;
Sharing requires addressing to effectively be plain modulo within the
areas that can be shared.

Initially, I had assumed non-hashed modulo indexing, but this does mean
a potentially higher collision rate if different ASIDs have different
pages in the same overlapping address ranges.

Something like 8-way associativity would be "better" here at reducing
this issue, but more expensive to deal with in hardware than 4-way.

Would mean that multi-level interrupt handling would still be needed
whenever the page isn't in the guest's TLB or VIPT (short of breaking
abstraction and faking the use of hardware page walking for the guest OS's).

If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.

If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
CPU), then the cost of the TLB miss handling is on par with other things
like handling the timer interrupt, etc...

But, what one does need, is a way to perform context switches without
also triggering a huge wave of TLB misses in the process.

Big TLB + strategic sharing and ASIDs can help here at least (whereas, a
full TLB flush on context-switch would suck pretty bad).

But, doing traditional "every process gets its own address space" takes
a hit here (no good option other than to limit the task-switching
frequency, but this may become obvious to the user if the task switching
is too slow).

So, for something like a 50MHz core, this might mean, say, allowing a
process to run for up to 250ms before the preemptive task-switch
mechanism kicks in. But, 250ms is slow enough to become obvious to a
user (or, at least, much more so than, say, 100ms).

Though, probably still better than a purely cooperative scheduler, where
a process failing to call "thrd_yeild()" effectively locks up the whole
rest of the system (in my GUI experiments, this effect results in, say,
Doom effectively locking up the whole GUI until it running the game
proper, where it then starts calling "thrd_yeild()").

Though, might make sense to consider also being able to forcibly yield
threads on system calls and/or in some other C library calls.

Though, in these cases, will likely still need to start adding mutex
locks in some areas.

Seems like one might need a mechanism to remap the VM from real CR's to
a partially emulated set of CR's (VCR's ?...).

ARM does this by adding a layer above the OS ring that can trap
accesses to certain control registers used by the OS to the
hypervisor for resolution. But for the most part, the guest just
uses the same control registers as if it were running bare metal with
no trapping - they're just loaded by the hypervisor before the guest
is dispatched and saved by the hypervisor when scheduling a new
guest. Thats an advantage of the exception level scheme, where
each level has its own set of control registers.

However, shortcoming of the initial implementation was if the
hypervisor was type II, the hypervisor needed to have a special
privileged guest to run standard user-mode code[*]. So they
added (in V8.1) the virtual host extensions (VHE) which allowed
the hypervisor exception level (EL2) to directly dispatch
user-mode code to EL0 (with the normal traps from usermode
to the OS directed to the hypervisor instead of a guest OS). This
let the hypervisor (e.g. KVM) to act both as a hypervisor and
a guest OS with out the context switches required to support a
privileged guest).

[*] And also to provide VFIO support for non-SRIOV hardware devices.

OK.

But not well, nor performant.

As far as I know, the whole "nested page tables" was the core of how
virtualization worked on x86...

Before AMD added NPT (Nested Page Tables), the hypervisor needed to
be able to recognize and trap any accesses from the guest OS to
its own page tables and update the real page tables accordingly.
To do that, they had several options:
1) Paravirtualization (i.e. all guest page table ops call the
hypervisor rather than changing the page tables directly);
Xen did this.
2) Write-protecting the page tables and trapping any writes in
the hypervisor. Difficult to do since the page tables in
common OS are not allocated contiguously and they are updated
using normal loads and stores (the HV does know them, however,
as it can trap writes to CR3 and from there can write-protect
the entire table in the real page tables).
3) Binary patch the guest operating system. This was the approach used
by VMware before AMD introduced NPT.

OK.

FWIW:
One feature of my VUGID/ACLID scheme, is that it is possible to have
memory be Read/Write to one task, and Read-Only to another task (with a
trap if they try to write to it), without needing to use separate
mappings (and thus, both tasks can share the same TLBE's; but will get different access depending on who accesses it).

Though, I don't expect this scheme would see much adoption in mainline
OS's, nor likely much adoption in targets based around hardware
page-table walkers...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Sun Nov 26 01:50:39 2023

Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/25/2023 10:55 AM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/24/2023 6:01 PM, Scott Lurndal wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}

Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

Would require priority decoders to differeniate rather
than simple gates, probably.

Although I wonder at the missing firmware privilege level, a la SMM or EL3.

ARM added support for nested hypervisors without adding a
new exception level. Although interesting, there isn't much
evidence of it being used in production. Yet anyway.

It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.

With pretty much anything that isn't "bare metal" being put in User Mode >>>> (potentially using emulation traps as needed).

Something like a Soft-TLB or Inverted-Page-Table does not need any
special hardware to support nested translation (whereas hardware
page-walking would require dedicated support).

It's been tried. And performance sucked big-time. The reason
that AMD added back support for the DS limit register in AMD64
was to support xen (and vmware) before Pacifica (the AMD project
that became Secure Virtual Machine (SVM) known now as AMD-V).

OK.

I wouldn't expect nested inverted-page-table translation to be *that*
much slower than normal inverted page tables. Though, would add a bit of >>multi-level translation wonk in the top-level miss handler (and likely >>still better than multi-level soft-TLB, where a miss in the outer TLB
level means needing to propagate the interrupt inwards and then
emulating it the whole way up).

Let's look at a hardware example of a nested page table walk,
using the AMD nested page table feature as a guide. The AMD
version uses the same PTE format as the non-virtualized page
tables (which reduces the amount of kernel code required to
manage the page tables) unlike Intel's EPT.

Assuming 4k-byte pages in both the primary and nested page tables,
a page table walk must make 22 memory accesses to satisfy a
VA to PA translation, versus only four in a non-virtualized
table walk. This can be reduced to 11 if you have the luxury
of using 1GB mappings in the nested page table.

20 of those 22 accesses are subject to caching of various flavors.

Performing all those accesses in a kernel fault handler would
consume a great deal more time than a hardware table walker will (particularly
if the hardware table walkers can cache the intermediate results
of the higher-level blocks in the walk in the walk hardware).

The downsides of IPT pretty much preclude their use in most
modern operating systems where shared memory between processes
is common (explicitly -or- implicitly (such as VDSO on linux));
some of goals listed as benefits for IPT (e.g. easier whole
process swapping) are made irrelevent by modern operating
systems that don't do that. There's a rather incoherent
description of IPT at geeksforgeeks - I'd not recommend it
as a useful resource.

If you want to run any form of *nix you must design the center of
control at/in the CPU[s].....for better or worse.

Would mean that multi-level interrupt handling would still be needed >>whenever the page isn't in the guest's TLB or VIPT (short of breaking >>abstraction and faking the use of hardware page walking for the guest OS's).

If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.

Seems like one might need a mechanism to remap the VM from real CR's to
a partially emulated set of CR's (VCR's ?...).

ARM does this by adding a layer above the OS ring that can trap
accesses to certain control registers used by the OS to the
hypervisor for resolution. But for the most part, the guest just
uses the same control registers as if it were running bare metal with
no trapping - they're just loaded by the hypervisor before the guest
is dispatched and saved by the hypervisor when scheduling a new
guest. Thats an advantage of the exception level scheme, where
each level has its own set of control registers.

My 66000 memory maps control registers {CPU, LLC, NorthBridge,
device, ...} into MMI/O space. A CPU, with access permission,
can read or write another CPU's control registers--used sparingly
to get out of trouble. Mainly this is used to allow a CPU to
read or write device control registers.

However, shortcoming of the initial implementation was if the
hypervisor was type II, the hypervisor needed to have a special
privileged guest to run standard user-mode code[*]. So they
added (in V8.1) the virtual host extensions (VHE) which allowed
the hypervisor exception level (EL2) to directly dispatch
user-mode code to EL0 (with the normal traps from usermode
to the OS directed to the hypervisor instead of a guest OS). This
let the hypervisor (e.g. KVM) to act both as a hypervisor and
a guest OS with out the context switches required to support a
privileged guest).

[*] And also to provide VFIO support for non-SRIOV hardware devices.

But not well, nor performant.

As far as I know, the whole "nested page tables" was the core of how >>virtualization worked on x86...

Before AMD added NPT (Nested Page Tables), the hypervisor needed to
be able to recognize and trap any accesses from the guest OS to
its own page tables and update the real page tables accordingly.
To do that, they had several options:
1) Paravirtualization (i.e. all guest page table ops call the
hypervisor rather than changing the page tables directly);
Xen did this.
2) Write-protecting the page tables and trapping any writes in
the hypervisor. Difficult to do since the page tables in
common OS are not allocated contiguously and they are updated
using normal loads and stores (the HV does know them, however,
as it can trap writes to CR3 and from there can write-protect
the entire table in the real page tables).
3) Binary patch the guest operating system. This was the approach used
by VMware before AMD introduced NPT.

Nested Page Tables are the best solution (Fewest SW instructions of
overhead and total cycles of latency) we currently know of.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Sun Nov 26 16:01:30 2023

mitchalsup@aol.com (MitchAlsup) writes:

Scott Lurndal wrote:

Seems like one might need a mechanism to remap the VM from real CR's to
a partially emulated set of CR's (VCR's ?...).

ARM does this by adding a layer above the OS ring that can trap
accesses to certain control registers used by the OS to the
hypervisor for resolution. But for the most part, the guest just
uses the same control registers as if it were running bare metal with
no trapping - they're just loaded by the hypervisor before the guest
is dispatched and saved by the hypervisor when scheduling a new
guest. Thats an advantage of the exception level scheme, where
each level has its own set of control registers.

My 66000 memory maps control registers {CPU, LLC, NorthBridge,
device, ...} into MMI/O space. A CPU, with access permission,
can read or write another CPU's control registers--used sparingly
to get out of trouble. Mainly this is used to allow a CPU to
read or write device control registers.

ARM supports access to CPU system registers via MMIO;
primarily for debug purposes. System Registers may be accessed
either via MMIO accesses from a running core, subject to
permission controls, or via JTAG interface(s).

The preferred way to access a cores own system registers is
via the MSR/MRS instructions.

<snip>

Nested Page Tables are the best solution (Fewest SW instructions of
overhead and total cycles of latency) we currently know of.

Indeed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Sun Nov 26 19:28:13 2023

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Scott Lurndal wrote:

Seems like one might need a mechanism to remap the VM from real CR's to >>>>a partially emulated set of CR's (VCR's ?...).

ARM does this by adding a layer above the OS ring that can trap
accesses to certain control registers used by the OS to the
hypervisor for resolution. But for the most part, the guest just
uses the same control registers as if it were running bare metal with
no trapping - they're just loaded by the hypervisor before the guest
is dispatched and saved by the hypervisor when scheduling a new
guest. Thats an advantage of the exception level scheme, where
each level has its own set of control registers.

My 66000 memory maps control registers {CPU, LLC, NorthBridge,
device, ...} into MMI/O space. A CPU, with access permission,
can read or write another CPU's control registers--used sparingly
to get out of trouble. Mainly this is used to allow a CPU to
read or write device control registers.

ARM supports access to CPU system registers via MMIO;
primarily for debug purposes. System Registers may be accessed
either via MMIO accesses from a running core, subject to
permission controls, or via JTAG interface(s).

Nice to know someone already blazed the trail.

The preferred way to access a cores own system registers is
via the MSR/MRS instructions.

My 66000 has a HR (Header Register) instruction to access one
register at a time, but a MM (memory to memory move) instruction
can be used to swap the entire core-stack {HV-level context switch.}
MM to a MMI/O space is guaranteed to be ATOMIC across the entire
transfer.

But it is not just system registers, but all storage within a
CPU/core, the L2 control status registers, the HostBridge
control and status registers,...EVEN the register Registers
are available--remotely.

<snip>

Nested Page Tables are the best solution (Fewest SW instructions of >>overhead and total cycles of latency) we currently know of.

Indeed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Scott Lurndal on Sun Nov 26 15:17:06 2023

On 11/25/2023 4:10 PM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/25/2023 1:28 PM, Scott Lurndal wrote:

If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.

If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
CPU), then the cost of the TLB miss handling is on par with other things
like handling the timer interrupt, etc...

Any cycle used by the miss handler is a cycle that could
have been used for useful work. Timer interrupt handling
is often very short (increment a memory location, a comparison
and a return if no timer has expired). And we're long
past the days of using regular timer interrupts for scheduling
(see tickless kernels, for example).

It takes roughly as much time to service a timer interrupt as to service
a TLB miss...

Much of the work in the time spent in the latter is saving/restoring the relevant registers, with the actual page table walk and 'LDTLB'
instruction typically a fairly minor part in comparison...

At least, excluding something like using B-Tree based page tables...

It could be made faster, but would likely require doing the TLB miss
handler in ASM and only saving/restoring the minimum number of registers
(well, at least until we detect that there will be a page-fault, which
would still require falling back to a "more comprehensive" handler).

Any L1 miss penalties from the page-walk itself would likely also apply
to a hardware page-walker.

But, what one does need, is a way to perform context switches without
also triggering a huge wave of TLB misses in the process.

Why?

Note that depending on the number of entries in your TLB
and the scheduler behavior, it's unlikely that any prior
TLB entries will be useful to a newly scheduled thread
(in a different address space).

I am mostly using a 256x 4-way TLB (so, 1024 TLBE's).
With a 16K page size, this is basically enough to keep roughly something
the size of the working set of Doom entirely in the TLB.

In my past experiments, 16K seemed to be the local optimum for the
programs tested:
4K and 8K resulted in higher miss rates;
32K and 64K resulted in a more "internal fragmentation" without much
reduction in miss rate.

Having multiple banks of TLBs that you can switch between
might be able to provide you with the capability to
reduce the TLB miss rate on scheduling a new thread of
execution - but CAMs aren't cheap.

This is why my TLB is 4-way set-associative.

An 8-way TLB would be a lot more expensive, and a fully-associative TLB
(of nearly any non-trivial size) would be effectively implausible.

For the most part, industry has settled on a large number
of tagged TLB entries as a good compromise. Some architectures have
a global bit in the entry that can be set via the page
table that indicates that ASID and/or VMID qualifications
aren't necessary for a hit.

Yeah.

I guess a factor here is mostly defining rules to both allow for and
control the scope of global pages.

In my case:
The TTB register defines an ASID in the high order bits;
The TLBE also has an ASID;
The ASID is split into two parts (6 and 10 bits).
In the ASID, 0 designates global pages
But they are broken into "groups"
So typically a global page is only shared within a given group.

I am thinking the 6.10 split may have given too many bits to the group,
and 4.12 or 2.14 might have been better.

As-is, say, ASID 03DE would be able see global pages in 0000, but 045F
would not (but would see global pages in ASID 0400).

So, say, in the current scheme:
ASID's 0000, 0400, 0800, 0C00, ... would exist as mirrors of the
global address space.

Where, say, if during a TLB Miss, if a page is marked global, it can be
put into one of these ASIDs rather than the main ASID of the current
process (if not in an ASID range which disallows global pages).

The size of the group will have an effect on miss rate in cases where
there are a lot of active PIDs though.

Big TLB + strategic sharing and ASIDs can help here at least (whereas, a
full TLB flush on context-switch would suck pretty bad).

That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half of the virtrual address space is shared by all processes - there's no reason that those entries need to be flushed on context-switch.

AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
the defined behavior?... Well, at least ignoring the support for global
pages.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Sun Nov 26 22:38:05 2023

mitchalsup@aol.com (MitchAlsup) writes:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Scott Lurndal wrote:

Seems like one might need a mechanism to remap the VM from real CR's to >>>>>a partially emulated set of CR's (VCR's ?...).

ARM does this by adding a layer above the OS ring that can trap
accesses to certain control registers used by the OS to the
hypervisor for resolution. But for the most part, the guest just
uses the same control registers as if it were running bare metal with
no trapping - they're just loaded by the hypervisor before the guest
is dispatched and saved by the hypervisor when scheduling a new
guest. Thats an advantage of the exception level scheme, where
each level has its own set of control registers.

My 66000 memory maps control registers {CPU, LLC, NorthBridge,
device, ...} into MMI/O space. A CPU, with access permission,
can read or write another CPU's control registers--used sparingly
to get out of trouble. Mainly this is used to allow a CPU to
read or write device control registers.

ARM supports access to CPU system registers via MMIO;
primarily for debug purposes. System Registers may be accessed
either via MMIO accesses from a running core, subject to
permission controls, or via JTAG interface(s).

Nice to know someone already blazed the trail.

Note that a handful of system registers, when accessed
using the MRS/MSR instructions are self-synchronizing
with-respect to other state. This, architecturally,
does _not_ hold when accessed via MMIO.

The preferred way to access a cores own system registers is
via the MSR/MRS instructions.

My 66000 has a HR (Header Register) instruction to access one
register at a time, but a MM (memory to memory move) instruction
can be used to swap the entire core-stack {HV-level context switch.}
MM to a MMI/O space is guaranteed to be ATOMIC across the entire
transfer.

But it is not just system registers, but all storage within a
CPU/core, the L2 control status registers, the HostBridge
control and status registers,...EVEN the register Registers
are available--remotely.

<snip>

Nested Page Tables are the best solution (Fewest SW instructions of >>>overhead and total cycles of latency) we currently know of.

Indeed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Sun Nov 26 22:41:37 2023

mitchalsup@aol.com (MitchAlsup) writes:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Scott Lurndal wrote:

The preferred way to access a cores own system registers is
via the MSR/MRS instructions.

My 66000 has a HR (Header Register) instruction to access one
register at a time, but a MM (memory to memory move) instruction
can be used to swap the entire core-stack {HV-level context switch.}
MM to a MMI/O space is guaranteed to be ATOMIC across the entire
transfer.

But it is not just system registers, but all storage within a
CPU/core, the L2 control status registers, the HostBridge
control and status registers,...EVEN the register Registers
are available--remotely.

Yes, we do that (useful on chips that can also be a PCIe endpoint).

Even AMD does that with the memory controllers, SMI, I2C/I3C
etc. appearing as PCI endpoints.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to BGB on Sun Nov 26 22:46:55 2023

BGB <cr88192@gmail.com> writes:

On 11/25/2023 4:10 PM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/25/2023 1:28 PM, Scott Lurndal wrote:

If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.

If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
CPU), then the cost of the TLB miss handling is on par with other things >>> like handling the timer interrupt, etc...

Any cycle used by the miss handler is a cycle that could
have been used for useful work. Timer interrupt handling
is often very short (increment a memory location, a comparison
and a return if no timer has expired). And we're long
past the days of using regular timer interrupts for scheduling
(see tickless kernels, for example).

It takes roughly as much time to service a timer interrupt as to service
a TLB miss...

You'll need to provide more than an assertion for that.

Much of the work in the time spent in the latter is saving/restoring the >relevant registers, with the actual page table walk and 'LDTLB'
instruction typically a fairly minor part in comparison...

Then you've a poorly written handler. Note that a hardware table
walker doesn't need to save any registers.

<snip>

That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half >> of the virtrual address space is shared by all processes - there's no reason >> that those entries need to be flushed on context-switch.

AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
the defined behavior?... Well, at least ignoring the support for global >pages.

Does x86 even tag the TLB entries with an ASID? I've been in ARMv8 land for the
last decade.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Sun Nov 26 23:10:36 2023

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Scott Lurndal wrote:

My 66000 memory maps control registers {CPU, LLC, NorthBridge,
device, ...} into MMI/O space. A CPU, with access permission,
can read or write another CPU's control registers--used sparingly
to get out of trouble. Mainly this is used to allow a CPU to
read or write device control registers.

ARM supports access to CPU system registers via MMIO;
primarily for debug purposes. System Registers may be accessed
either via MMIO accesses from a running core, subject to
permission controls, or via JTAG interface(s).

Nice to know someone already blazed the trail.

Note that a handful of system registers, when accessed
using the MRS/MSR instructions are self-synchronizing
with-respect to other state. This, architecturally,
does _not_ hold when accessed via MMIO.

My 66000 architecture specification indicates that when a CPU control
register is written the CPU performs as if there were a saving of all
current state, allow the write to transpire, and then act as if you
reloaded all the state.

The "as if" qualifier allows an implementation to take less cycles
when it recognizes certain situations.

But, this is one of those things that falls out "for free"* when the
HW knows how to perform context switches as if thread-state were in
memory. {{(*) nothing is ever free, but if you have HW context
switches there are a lot of other things that can be made "as if"}}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Sun Nov 26 23:13:07 2023

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Scott Lurndal wrote:

But it is not just system registers, but all storage within a
CPU/core, the L2 control status registers, the HostBridge
control and status registers,...EVEN the register Registers
are available--remotely.

Yes, we do that (useful on chips that can also be a PCIe endpoint).

Even AMD does that with the memory controllers, SMI, I2C/I3C
etc. appearing as PCI endpoints.

I look at it like this, you are going to need the ability to reach
into the innermost areas of the chip and look at what is going on.
The easiest means to get here, today, is via PCIe--JTAG is not that
useful when there are 1T bits you might want to look at.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Sun Nov 26 23:29:35 2023

Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/25/2023 4:10 PM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/25/2023 1:28 PM, Scott Lurndal wrote:

If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.

If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
CPU), then the cost of the TLB miss handling is on par with other things >>>> like handling the timer interrupt, etc...

Any cycle used by the miss handler is a cycle that could
have been used for useful work. Timer interrupt handling
is often very short (increment a memory location, a comparison
and a return if no timer has expired). And we're long
past the days of using regular timer interrupts for scheduling
(see tickless kernels, for example).

It takes roughly as much time to service a timer interrupt as to service
a TLB miss...

You'll need to provide more than an assertion for that.

Service a TLB miss with an L2 TLB is about 6 cycles on my 1-wide machine. Walking the page tables may be as few as 1 access or as many as 24 to
L2 cache (adding in whatever cache miss latency transpires). With reasonable Table Walk Caching, we may average 30-cycles {Hardware table walk} So,
at one end we have 6-cycles and at the other we have 24 serially dependent
L2 misses:: but averaging around 30-cycles.

Service a timer interrupt:: 10-cycles waiting for thread-state to arrive,
Cache miss waiting for instructions for ISR dispatcher, 3 instructions to transfer control to ISR handler. Another cache miss waiting for instructions
At this point the handler needs to tell the time it has been serviced, and optionally to send it a count of the next time it should go off. Schedule
a DPC/softIRQ, unwind the handler/dispatcher stack, and return from dispatcher only to end up at DPC/softIRQ.

I cant see this taking less than 100 cycles.......and vastly more if SW is burdened with doing the save and restore after finding registers to use
while shuffling data to some stack.

Much of the work in the time spent in the latter is saving/restoring the >>relevant registers, with the actual page table walk and 'LDTLB'
instruction typically a fairly minor part in comparison...

Then you've a poorly written handler. Note that a hardware table
walker doesn't need to save any registers.

Neither does the My 66000 ISR dispatcher. By the time control arrives,
old thread state has been returned (at least conceptually) to memory
and the CPU has its new thread state loaded {including IP, Root Pointer, ISRasid, ISR SP, ISR FP is desired, and pointers to things the IRS may
want quick access to when it receives control--all reentrantly.

So, I contend it is not the writing of the ISR handler, it is architecture which causes the ISR handler to have such a bit prologue and epilogue.

<snip>

That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half
of the virtrual address space is shared by all processes - there's no reason
that those entries need to be flushed on context-switch.

AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
the defined behavior?... Well, at least ignoring the support for global >>pages.

Well done ASIDs prevent the need for TLB flushing except when kinking a thread out the ASID bucket-list.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Finch@21:1/5 to BGB on Sun Nov 26 19:08:09 2023

On 2023-11-26 4:17 p.m., BGB wrote:

On 11/25/2023 4:10 PM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/25/2023 1:28 PM, Scott Lurndal wrote:

If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.

If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
CPU), then the cost of the TLB miss handling is on par with other things >>> like handling the timer interrupt, etc...

Any cycle used by the miss handler is a cycle that could
have been used for useful work.   Timer interrupt handling
is often very short (increment a memory location, a comparison
and a return if no timer has expired).   And we're long
past the days of using regular timer interrupts for scheduling
(see tickless kernels, for example).

It takes roughly as much time to service a timer interrupt as to service
a TLB miss...

Much of the work in the time spent in the latter is saving/restoring the relevant registers, with the actual page table walk and 'LDTLB'
instruction typically a fairly minor part in comparison...

At least, excluding something like using B-Tree based page tables...

It could be made faster, but would likely require doing the TLB miss
handler in ASM and only saving/restoring the minimum number of registers (well, at least until we detect that there will be a page-fault, which
would still require falling back to a "more comprehensive" handler).

Any L1 miss penalties from the page-walk itself would likely also apply
to a hardware page-walker.

A hardware table walker strikes me as not being a large component.
Although untested yet, the Q+ table walker is only about 1,200 LUTs or
1% of the FPGA. Given the small size I think it is worth it to have the
table walker in hardware. It is hard to beat hardware timing wise when
it does not need to save / restore registers.

But, what one does need, is a way to perform context switches without
also triggering a huge wave of TLB misses in the process.

Why?

Note that depending on the number of entries in your TLB
and the scheduler behavior, it's unlikely that any prior
TLB entries will be useful to a newly scheduled thread
(in a different address space).

I am mostly using a 256x 4-way TLB (so, 1024 TLBE's).
With a 16K page size, this is basically enough to keep roughly something
the size of the working set of Doom entirely in the TLB.

In my past experiments, 16K seemed to be the local optimum for the
programs tested:
4K and 8K resulted in higher miss rates;
32K and 64K resulted in a more "internal fragmentation" without much reduction in miss rate.

Having multiple banks of TLBs that you can switch between
might be able to provide you with the capability to
reduce the TLB miss rate on scheduling a new thread of
execution - but CAMs aren't cheap.

This is why my TLB is 4-way set-associative.

An 8-way TLB would be a lot more expensive, and a fully-associative TLB
(of nearly any non-trivial size) would be effectively implausible.

For the most part, industry has settled on a large number
of tagged TLB entries as a good compromise.   Some architectures have
a global bit in the entry that can be set via the page
table that indicates that ASID and/or VMID qualifications
aren't necessary for a hit.

Yeah.

I guess a factor here is mostly defining rules to both allow for and
control the scope of global pages.

In my case:
The TTB register defines an ASID in the high order bits;
The TLBE also has an ASID;
The ASID is split into two parts (6 and 10 bits).
    In the ASID, 0 designates global pages
    But they are broken into "groups"
    So typically a global page is only shared within a given group.

I am thinking the 6.10 split may have given too many bits to the group,
and 4.12 or 2.14 might have been better.

As-is, say, ASID 03DE would be able see global pages in 0000, but 045F
would not (but would see global pages in ASID 0400).

So, say, in the current scheme:
ASID's 0000, 0400, 0800, 0C00, ... would exist as mirrors of the
global address space.

Where, say, if during a TLB Miss, if a page is marked global, it can be
put into one of these ASIDs rather than the main ASID of the current
process (if not in an ASID range which disallows global pages).

The size of the group will have an effect on miss rate in cases where
there are a lot of active PIDs though.

Big TLB + strategic sharing and ASIDs can help here at least (whereas, a >>> full TLB flush on context-switch would suck pretty bad).

That's unnecessaryly harsh.   Consider that on Intel/AMD/ARM the
kernel half
of the virtrual address space is shared by all processes - there's no
reason
that those entries need to be flushed on context-switch.

AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
the defined behavior?... Well, at least ignoring the support for global pages.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Robert Finch on Sun Nov 26 21:06:05 2023

On 11/26/2023 6:08 PM, Robert Finch wrote:

On 2023-11-26 4:17 p.m., BGB wrote:

On 11/25/2023 4:10 PM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/25/2023 1:28 PM, Scott Lurndal wrote:

If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.

If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
CPU), then the cost of the TLB miss handling is on par with other
things
like handling the timer interrupt, etc...

Any cycle used by the miss handler is a cycle that could
have been used for useful work.   Timer interrupt handling
is often very short (increment a memory location, a comparison
and a return if no timer has expired).   And we're long
past the days of using regular timer interrupts for scheduling
(see tickless kernels, for example).

It takes roughly as much time to service a timer interrupt as to
service a TLB miss...

Much of the work in the time spent in the latter is saving/restoring
the relevant registers, with the actual page table walk and 'LDTLB'
instruction typically a fairly minor part in comparison...

At least, excluding something like using B-Tree based page tables...

It could be made faster, but would likely require doing the TLB miss
handler in ASM and only saving/restoring the minimum number of
registers (well, at least until we detect that there will be a
page-fault, which would still require falling back to a "more
comprehensive" handler).

Any L1 miss penalties from the page-walk itself would likely also
apply to a hardware page-walker.

A hardware table walker strikes me as not being a large component.
Although untested yet, the Q+ table walker is only about 1,200 LUTs or
1% of the FPGA. Given the small size I think it is worth it to have the
table walker in hardware. It is hard to beat hardware timing wise when
it does not need to save / restore registers.

Possible, though, until TLB Miss exceeds ~ 1% or so, it isn't really a
huge priority either.

In my current cases, it is generally less than 0.1% of the CPU time, so
not yet a huge priority.

Vs, say:
~ 1% for the 1kHz timer interrupt
~ 0.6% for syscall (down from around 1.2%).

The optimization I had used for syscalls is mostly N/A for the timer
interrupt though.

Had considered inverted page tables as possible as well, but, making
this faster isn't (yet) a terribly high priority.

But, what one does need, is a way to perform context switches without
also triggering a huge wave of TLB misses in the process.

Why?

Note that depending on the number of entries in your TLB
and the scheduler behavior, it's unlikely that any prior
TLB entries will be useful to a newly scheduled thread
(in a different address space).

I am mostly using a 256x 4-way TLB (so, 1024 TLBE's).
With a 16K page size, this is basically enough to keep roughly
something the size of the working set of Doom entirely in the TLB.

In my past experiments, 16K seemed to be the local optimum for the
programs tested:
4K and 8K resulted in higher miss rates;
32K and 64K resulted in a more "internal fragmentation" without much
reduction in miss rate.

Having multiple banks of TLBs that you can switch between
might be able to provide you with the capability to
reduce the TLB miss rate on scheduling a new thread of
execution - but CAMs aren't cheap.

This is why my TLB is 4-way set-associative.

An 8-way TLB would be a lot more expensive, and a fully-associative
TLB (of nearly any non-trivial size) would be effectively implausible.

For the most part, industry has settled on a large number
of tagged TLB entries as a good compromise.   Some architectures have
a global bit in the entry that can be set via the page
table that indicates that ASID and/or VMID qualifications
aren't necessary for a hit.

Yeah.

I guess a factor here is mostly defining rules to both allow for and
control the scope of global pages.

In my case:
   The TTB register defines an ASID in the high order bits;
   The TLBE also has an ASID;
   The ASID is split into two parts (6 and 10 bits).
     In the ASID, 0 designates global pages
     But they are broken into "groups"
     So typically a global page is only shared within a given group.

I am thinking the 6.10 split may have given too many bits to the
group, and 4.12 or 2.14 might have been better.

As-is, say, ASID 03DE would be able see global pages in 0000, but 045F
would not (but would see global pages in ASID 0400).

So, say, in the current scheme:
   ASID's 0000, 0400, 0800, 0C00, ... would exist as mirrors of the
global address space.

Where, say, if during a TLB Miss, if a page is marked global, it can
be put into one of these ASIDs rather than the main ASID of the
current process (if not in an ASID range which disallows global pages).

The size of the group will have an effect on miss rate in cases where
there are a lot of active PIDs though.

Big TLB + strategic sharing and ASIDs can help here at least
(whereas, a
full TLB flush on context-switch would suck pretty bad).

That's unnecessaryly harsh.   Consider that on Intel/AMD/ARM the
kernel half
of the virtrual address space is shared by all processes - there's no
reason
that those entries need to be flushed on context-switch.

AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
the defined behavior?... Well, at least ignoring the support for
global pages.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Scott Lurndal on Sun Nov 26 20:54:01 2023

On 11/26/2023 4:46 PM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/25/2023 4:10 PM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/25/2023 1:28 PM, Scott Lurndal wrote:

If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.

If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
CPU), then the cost of the TLB miss handling is on par with other things >>>> like handling the timer interrupt, etc...

Any cycle used by the miss handler is a cycle that could
have been used for useful work. Timer interrupt handling
is often very short (increment a memory location, a comparison
and a return if no timer has expired). And we're long
past the days of using regular timer interrupts for scheduling
(see tickless kernels, for example).

It takes roughly as much time to service a timer interrupt as to service
a TLB miss...

You'll need to provide more than an assertion for that.

If the interrupt's save/restore prolog/epilog by itself burns ~ 500+
cycles, then the time needed to do a few memory loads, some bit
twiddling, and an LDTLB, mostly disappears in the noise...

Granted, it cost more cycles to walk the page-table, compose, and load
the TLBE, than it does to increment a counter variable, but...

Nearly all the "expensive parts" will happen similarly in both cases.

I could get along OK using a B-Tree as a page-table, which despite the considerable cost difference between a simple 3-level page table walk
and a B-Tree walk, this "merely doubled" the average cost of the TLB
Miss handler...

Both cases could be faster, but it would likely require writing the ISR handlers in ASM (and not saving/restoring all of the registers).

And the potential savings are smaller:
The TLB miss handler may also need to deal with ACL Miss and needs to be
able to dispatch a Page Fault event;
The IRQ Miss handler, meanwhile, may need to deal with other types of
hardware events beyond just timer interrupts (though, at present, the
timer is the only thing that generates an interrupt, pretty much
everything else at present is polling IO).

Much of the work in the time spent in the latter is saving/restoring the
relevant registers, with the actual page table walk and 'LDTLB'
instruction typically a fairly minor part in comparison...

Then you've a poorly written handler. Note that a hardware table
walker doesn't need to save any registers.

Most of this logic is auto-generated by my C compiler.

__interrupt void __isr_interrupt(void)
{
}

By itself is going to save/restore all of the registers and burn roughly
500 cycles in the process...

Though, I had considered possibly adding a "__interrupt_min" keyword,
which would try to minimize the number of registers saved/restored, but
would not allow the ISR to implement a context switch...

But, the latter restriction would make it "almost useless", as the main
two interrupts where it might be useful (the IRQ and TLB Miss handlers),
would also be naturally excluded as both may need to implement context switches.

Did end up adding:
__interrupt_tbrsave void __isr_syscall(void)
{
}

Where "__interrupt_tbrsave" does at least optimize things in the case
where we *know* we are going to do a context switch.

In this case, it allows eliminating a few calls:
isrsave=__arch_isrsave;
memcpy(
taskern->ctx_regsave,
isrsave,
__ARCH_SIZEOF_REGSAVE__);
memcpy(
isrsave,
taskern2->ctx_regsave,
__ARCH_SIZEOF_REGSAVE__);

Which generally ended up burning another ~ 500 clock cycles.

Note that at 50MHz, one would end up needing to invoke an ISR around
1000 times per second to hit 1%.

Though, with syscalls, it was a little worse. But the new interrupt type
has helped some.

Now, syscalls are just behind the timer interrupt (which is at around 1%
of the CPU, getting triggered at around 1000 times per second).

The TLB Miss ISR is < 0.1% of the time, mostly by averaging under 100
TLB misses per second.

Though, some of this does mean that, despite the BJX2 core running at
around 3x the clock-speed of an MSP430, I can't run the clock with a
32kHz timer interrupt without effectively eating the CPU.

So, this is one area where it seems like the MSP430 has an advantage...

<snip>

That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half
of the virtrual address space is shared by all processes - there's no reason
that those entries need to be flushed on context-switch.

AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
the defined behavior?... Well, at least ignoring the support for global
pages.

Does x86 even tag the TLB entries with an ASID? I've been in ARMv8 land for the
last decade.

It seems to have added "something" to support global pages, but doesn't
appear to use an ASID.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to BGB on Mon Nov 27 15:10:25 2023

BGB <cr88192@gmail.com> writes:

On 11/26/2023 4:46 PM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/25/2023 4:10 PM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/25/2023 1:28 PM, Scott Lurndal wrote:

If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.

If one can stay under, say, 100-500 TLB misses per second (on a 50MHz >>>>> CPU), then the cost of the TLB miss handling is on par with other things >>>>> like handling the timer interrupt, etc...

Any cycle used by the miss handler is a cycle that could
have been used for useful work. Timer interrupt handling
is often very short (increment a memory location, a comparison
and a return if no timer has expired). And we're long
past the days of using regular timer interrupts for scheduling
(see tickless kernels, for example).

It takes roughly as much time to service a timer interrupt as to service >>> a TLB miss...

You'll need to provide more than an assertion for that.

If

Ah, speculation. Got it.

the interrupt's save/restore prolog/epilog by itself burns ~ 500+

cycles, then the time needed to do a few memory loads, some bit
twiddling, and an LDTLB, mostly disappears in the noise...

Again, if.

Does x86 even tag the TLB entries with an ASID? I've been in ARMv8 land for the
last decade.

It seems to have added "something" to support global pages, but doesn't >appear to use an ASID.

They've had global pages since they introduced paging on the i386, IIRC.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Wed Nov 29 17:15:00 2023

On Sun, 12 Nov 2023 20:55:27 +0000, Quadibloc wrote:

I had tried, with all sorts of ingenious compromises of register spaces
and the like, to fit all the capabilities I wanted into the opcode space
of a single version of the instruction set, eliminating the need for
blocks which contained instructions belonging to alternate versions of
the instruction set.

But if the 16-bit instructions I'm making room for are useless to
compilers, that's questionable.

At first, when I mulled over this, I came up with multiple ideas to
address it, each one crazier than the last.

Seeing, therefore, that this was a difficult nut to crack, and not
wanting to go down in another wrong direction... instead, I found a way
to go that seemed to me to be reasonably sensible.

Go back to uncompromised 32-bit instructions, even though that means
there are no 16-bit instructions.

Then, bring back short instructions - effectively 17 bits long - so as
to have room for full register specifications. This means an alternative block format where 16, 32, 48, 64... bit instructions are all possible.

*But* because of the room 17-bit short instructions take up in the
header, the 32-bit instructions are the same regular format as in the
other case. Not some kind of 33-bit or 35-bit instruction with a new set
of instruction formats.

I have now modified the 17-bit shift instructions in the diagram, so that
they can also apply to all 32 integer registers, and I have corrected the opcodes on the page

http://www.quadibloc.com/arch/cw0101.htm

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Robert Finch on Fri Dec 1 07:48:45 2023

On Thu, 30 Nov 2023 11:22:55 -0500, Robert Finch wrote:

Having a look at the ConcertiaII ISA. I like the idea of
pseudo-immediates. All the immediates could be moved to one end of the
block and then skipped over during instruction fetch.

That is the general idea, with one minor correction.

The benefit of pseudo-immediates, like that of ordinary immediates,
are that they're already available, because they were brought into the
CPU by instruction fetch.

They get skipped over by the _next_ step, instruction decode.

Why a block structure? The goal is to have a situation where
instruction decode is largely done in parallel for the whole
block.

The first step is - is there a header? If not, decode all eight
32-bit instructions in the block in parallel.

If so, process the header, and that will directly and immediately
reveal where every instruction in the block begins, so again the
next step has all the instructions being decoded in parallel.

The header allows the length that immediates would add to instructions
to be in the pseudo-immediated instead, avoiding another potential
complication to instruction decoding.

In addition, having headers means that the instruction set can be
expanded or made flexible without it being possible to change the
mode of the CPU to cause it to read existing instruction code the
wrong way. Any modifications to how instructions are to be interpreted
are right there in the block header, so malware that can't alter
code can't work around that by changing how it is to be read.

Among the features the headers allow to be added are VLIW features,
such as instruction predication and explicitly indicating which
instructions can execute in parallel. This allows high-performance
but lightweight (non-OoO) implementations if desired.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Fri Dec 1 18:37:17 2023

Quadibloc wrote:

On Thu, 30 Nov 2023 11:22:55 -0500, Robert Finch wrote:

Having a look at the ConcertiaII ISA. I like the idea of
pseudo-immediates. All the immediates could be moved to one end of the
block and then skipped over during instruction fetch.

That is the general idea, with one minor correction.

The benefit of pseudo-immediates, like that of ordinary immediates,
are that they're already available, because they were brought into the
CPU by instruction fetch.

They get skipped over by the _next_ step, instruction decode.

Why a block structure? The goal is to have a situation where
instruction decode is largely done in parallel for the whole
block.

What if you had the advantages of the block header without the
cost of the block header ??

The first step is - is there a header? If not, decode all eight
32-bit instructions in the block in parallel.

Why not decode assuming there is a block header and also decode as
if there were not a block header. Then you can multiplex (choose)
later which one prevails. This puts the choice at at least 4 gates
of delay into the decode cycle.

If so, process the header, and that will directly and immediately
reveal where every instruction in the block begins, so again the
next step has all the instructions being decoded in parallel.

You then have to route the instructions to the decoders. Are your
decoders expensive enough in a wide implementation that this matters?
The alternative is to have a no-header decoder running in parallel
with a header decoder and choose which to use.

The header allows the length that immediates would add to instructions
to be in the pseudo-immediated instead, avoiding another potential complication to instruction decoding.

In addition, having headers means that the instruction set can be
expanded or made flexible without it being possible to change the
mode of the CPU to cause it to read existing instruction code the
wrong way. Any modifications to how instructions are to be interpreted
are right there in the block header, so malware that can't alter
code can't work around that by changing how it is to be read.

You MAY be able to alter the headers later in the architecture's life,
but ultimately you sacrifice forward compatibility.

Among the features the headers allow to be added are VLIW features,

Why would you want this ??

such as instruction predication and explicitly indicating which
instructions can execute in parallel.

HW does not seem to have much trouble doing this already.

This allows high-performance
but lightweight (non-OoO) implementations if desired.

Have any GBnOoO machines been successful ?

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Fri Dec 1 21:58:46 2023

BGB wrote:

It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.

I have been thinking about this for a while::

It seems to me that if one wants a robust system, the HyperVisor must
support various serviced-HyperVisors. This second (less privileged HV)
is, in essence, a HV that can crash without allowing the whole system
to crash {just like virtual machines can crash and take their applications
with them.}

Secondly:: Running an ISR at HV level is a privilege inversion issue,
the HV has to look at data structures maintained by a (not necessarily trustable) Guest OS--possibly corrupting the HV itself.

So while a 3 level system gives you most of what you want in a modern
system, it still has its own problems--that can be solved with a 4th.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Fri Dec 1 21:15:59 2023

On Wed, 29 Nov 2023 17:15:00 +0000, Quadibloc wrote:

I have now modified the 17-bit shift instructions in the diagram, so
that they can also apply to all 32 integer registers, and I have
corrected the opcodes on the page

http://www.quadibloc.com/arch/cw0101.htm

And now I have completed the process of getting back to where I was before,
by adding in the page

http://www.quadibloc.com/arch/cw0102.htm

which describes the instructions longer than 32 bits.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Fri Dec 1 22:10:39 2023

On Fri, 01 Dec 2023 18:37:17 +0000, MitchAlsup wrote:

Why not decode assuming there is a block header and also decode as if
there were not a block header. Then you can multiplex (choose) later
which one prevails. This puts the choice at at least 4 gates of delay
into the decode cycle.

You are quite correct that this is a possible technique to speed up an implementation of the architecture, at the cost of using extra electricity
to do work that will be thrown away later.

I described things in terms of a naive implementation to make the concepts easier to understand.

(quoting me)

If so, process the header, and that will directly and immediately
reveal where every instruction in the block begins, so again the next
step has all the instructions being decoded in parallel.

You then have to route the instructions to the decoders. Are your
decoders expensive enough in a wide implementation that this matters?
The alternative is to have a no-header decoder running in parallel with
a header decoder and choose which to use.

I wasn't thinking of routing instructions to decoders. Instead, the
decoders simply sit behind the physical positions in the block where
an instruction could begin, and the header (or the absence of a header)
tells them to start decoding. Or, in the type of fast implementation
you describe, to continue with decoding.

You MAY be able to alter the headers later in the architecture's life,
but ultimately you sacrifice forward compatibility.

As long as I can avoid sacrificing *backwards* compatibility.

Among the features the headers allow to be added are VLIW features,

Why would you want this ??

So the architecture could be used for very cheap embedded systems,
in addition to heavyweight desktops and servers.

This allows high-performance but lightweight (non-OoO) implementations
if desired.

Have any GBnOoO machines been successful ?

Ah, you don't mean out-of-order 68000 machines. Of which there was only
one, the 68050. You mean "great big not out of order" machines. Of which
there were none, the design being, no doubt, so outrageous as to not
even deserve the chance to fail, since it would have no chance to succeed.

That's a very valid point, but any ISA for a "great big" machine can have
a subset which no longer requires a "great big" machine.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Fri Dec 1 23:12:14 2023

mitchalsup@aol.com (MitchAlsup) writes:

BGB wrote:

It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.

I have been thinking about this for a while::

It seems to me that if one wants a robust system, the HyperVisor must
support various serviced-HyperVisors. This second (less privileged HV)
is, in essence, a HV that can crash without allowing the whole system
to crash {just like virtual machines can crash and take their applications >with them.}

Generally there must be a privilege level more privileged than
hypervisor, which controls the hardware - particularly if one
intends to 'schedule' multiple independent (not nested) hypervisors.

Then there is a requirement in the cloud for a nested hypervisor; this
can be done with a paravirtualized hypervisor, at some performance
cost, or with a true hardware supported nesting capability.

Secondly:: Running an ISR at HV level is a privilege inversion issue,
the HV has to look at data structures maintained by a (not necessarily >trustable) Guest OS--possibly corrupting the HV itself.

Modern interrupt virtualization mechanisms (e.g. ARMv8 GICv4.1)
handle guest interrupts completely in the hardware, with no
hypervisor intervention involved in the most common cases
(e.g. software generated interprocessor interrupts, virtual
timer interrupts, message signaled interrupts, et alia).

So while a 3 level system gives you most of what you want in a modern
system, it still has its own problems--that can be solved with a 4th.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Sat Dec 2 02:15:35 2023

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup) writes:

BGB wrote:

It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.

I have been thinking about this for a while::

It seems to me that if one wants a robust system, the HyperVisor must >>support various serviced-HyperVisors. This second (less privileged HV)
is, in essence, a HV that can crash without allowing the whole system
to crash {just like virtual machines can crash and take their applications >>with them.}

Generally there must be a privilege level more privileged than
hypervisor, which controls the hardware - particularly if one
intends to 'schedule' multiple independent (not nested) hypervisors.

So, call my HV System Manage Mode and call my Guest HV the HyperVisor.

Then there is a requirement in the cloud for a nested hypervisor; this
can be done with a paravirtualized hypervisor, at some performance
cost, or with a true hardware supported nesting capability.

Secondly:: Running an ISR at HV level is a privilege inversion issue,
the HV has to look at data structures maintained by a (not necessarily >>trustable) Guest OS--possibly corrupting the HV itself.

Modern interrupt virtualization mechanisms (e.g. ARMv8 GICv4.1)
handle guest interrupts completely in the hardware, with no
hypervisor intervention involved in the most common cases
(e.g. software generated interprocessor interrupts, virtual
timer interrupts, message signaled interrupts, et alia).

My 66000 has interrupt tables similar to RISC-V (in that you can have
as many tables as you want, and any table can interrupt to any priority.)

Unlike RISC-V My 66000 LLC has a little machine which operates the
tables, so devices raise an interrupt by sending a message to the
little machine which sets a bit in the table {and when enabled
a core operating at lower priority than NSARFs the table update
and requests the highest priority pending and enabled interrupt
(getInterrupt "bus transaction"). When the response arrives and
the core is still operating at a lower priority level, core responds
with a claimiInterrupt (or a put Interrupt) "bus transaction" and
only at this point stops running the old context code and context
switches to irsDispatcher.

Cores send IPIs by using the little machine.....

So while a 3 level system gives you most of what you want in a modern >>system, it still has its own problems--that can be solved with a 4th.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to MitchAlsup on Sat Dec 2 17:35:36 2023

MitchAlsup wrote:

Chris M. Thomasson wrote:

On 12/1/2023 6:15 PM, MitchAlsup wrote:

Cores send IPIs by using the little machine.....

Fwiw, how would your system handle this function from Microsoft:

https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers

Or, would that be kernel?

Core could send multiple IPIs in a loop or core could send a single IPI
to a kernel function that performs the loop.

Since performing 1 IPI requires 2 STs and does not require waiting on a response, it is probably easier if the core does the loop.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Chris M. Thomasson on Sat Dec 2 17:34:03 2023

Chris M. Thomasson wrote:

On 12/1/2023 6:15 PM, MitchAlsup wrote:

Cores send IPIs by using the little machine.....

Fwiw, how would your system handle this function from Microsoft:

https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers

Or, would that be kernel?

Core could send multiple IPIs in a loop or core could send a single IPI
to a kernel function that performs the loop.

Since performing 1 IPI re

So while a 3 level system gives you most of what you want in a modern
system, it still has its own problems--that can be solved with a 4th.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Sat Dec 2 20:39:08 2023

Paul A. Clayton wrote:

On 11/24/23 9:43 AM, Robert Finch wrote:
[snip]

There is a lot of value in having a unique architecture.

A uniquely difficult architecture like x86 increases the barrier
to competition both from patents and organizational knowledge and
tools. While MIPS managed to suppress clones with its patent on
unaligned loads (please correct any historical inaccuracy), Intel
was better positioned to discourage software-compatible
competition — and not just financially.

In Intel's case one must not just execute the x86 ISA but also be
bug-for-bug compatible. AMD K5 was essentially sacrificed to find
that bug-for-bug compatibility--that is they found the test vector
set that defined x86.

I suspect that the bad reputation of x86 among computer architects
— especially with the biases from Computer Architecture: A
Quantitative Approach which substantially informs computer
architecture education — might also make finding talent more
difficult. However, the prominence of the x86 vendors (working on
something that actually gets produced and used by millions of
people is gratifying) and the challenge of working on a difficult architecture would also attract talent (and perhaps more qualified
talent).

The x86
has had a lot of things bolted on to it. It has adapted over time.
Being able to see how things have changed is valuable.

x86 provides more than one lesson on change/project management.
The binary lock-in advantage of x86 makes architectural changes
more challenging. While something like the 8080 to 8086 "assembly
compatible" transition might have been practical and long-term
beneficial from an engineering perspective, from a business
perspective such would validate binary translation, reducing the
competitive barriers.

(Itanium showed that mediocre hardware translation between x86 and
a rather incompatible architecture (and microarchitecture) would
have been problematic even if native Itanium code had competitive

So did Transmeta.

performance. This seems reminiscent of the Pentium Pro's "issue"
with 16-bit code; both seem to have been at least partially
marketing failures. On the other hand, ARM designed a 64-bit
architecture that is only moderately compatible with the 32-bit
architecture — flags being one example of compatibility — and 32-
bit support is now being mostly left behind for 64-bit
implementations.)

----------------

MIPS (even with its delayed branches, lack of variable length
encoding, etc.) would probably be a better architecture in 2023
than x86 was around 2010. The delayed branches might have been
deprecated, VLE might have been added in an additional mode, and
eventually complex-but-useful instructions would probably have
been added. (MIPS would almost certainly have caught SIMD widening
disease and had other temporarily useful extension additions, but
the tradeoffs in 1985 were closer to those of 2023.)

The early 00's were a good time to avoid being an architect--SIMD
was very appealing and is now showing its age.....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to A. Clayton on Sat Dec 2 20:53:00 2023

In article <ukfvqu$2flaf$1@dont-email.me>, paaronclayton@gmail.com (Paul
A. Clayton) wrote:

This seems reminiscent of the Pentium Pro's "issue" with 16-bit
code; both seem to have been at least partially marketing failures.

For the scientific and technical markets, the Pentium Pro was just fine.
I'm not sure you can call customers' desire to run 16-bit software on
Pentium Pro a marketing failure. It was always going to happen, and if marketing people thought they could prevent it, they were fooling
themselves.

Mind you, these were the same marketing teams who a few years later
wanted the Pentium 4 "NetBurst" microarchitecture, specifically because
it would be introduced at high clock speeds. They'd been in a clockspeed
battle with AMD for about two years, and sticking to any one thing that
long means marketing people treat it as absolute truth.

On the other hand, ARM designed a 64-bit architecture that is
only moderately compatible with the 32-bit architecture - flags
being one example of compatibility - and 32-bit support is now
being mostly left behind for 64-bit implementations.

Aarch64 has essentially no concessions to aarch32 compatibility as far as
I can see. Emulating 32-bit on 64-bit would be painful because of
predicated instructions: your dynamic binary translator has a hard time
being sure that flags won't be used and thus need not be evaluated. The
easy transition is, I think, due to the later date, and the small amount
of aarch32 software written in assembler that's still in use.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Sat Dec 2 21:20:59 2023

On Fri, 01 Dec 2023 22:10:39 +0000, Quadibloc wrote:

On Fri, 01 Dec 2023 18:37:17 +0000, MitchAlsup wrote:

Have any GBnOoO machines been successful ?

Ah, you don't mean out-of-order 68000 machines. Of which there was only
one, the 68050. You mean "great big not out of order" machines. Of which there were none, the design being, no doubt, so outrageous as to not
even deserve the chance to fail, since it would have no chance to
succeed.

That's a very valid point, but any ISA for a "great big" machine can
have a subset which no longer requires a "great big" machine.

Also, as you are well aware, Intel has included both "performance" and "efficiency" cores in its latest generations of CPUs, similar to the
BIG.little architecture used for some ARM processors.

And then AMD came along, with its own twist on this feature: their
"little" processors aren't so little, having the same circuitry as the
big ones, but laid out more compactly so they have to have a lower
clock speed. That way, they're not so slow as to be a total waste in
normal full-power operation, and thus add to the total core count.

Well, another way to address the efficiency/little cores being a
waste of space would be to reduce the waste by making them smaller.
If their purpose is to save power consumption when nobody's using the
computer, to just keep the OS alive while it waits for the keyboard
or the mouse to ask it to do something... then they should be made
really little.

Like Intel's _original_ Atom processors, which were in-order. As they
were standalone processors for light and cheap laptops, Intel made
the right decision to switch to out-of-order for later versions, so
they wouldn't be so slow as to be useless.

But in-order efficiency cores that are there when the demands are very
low? With features that let one optimize code for them, though?

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Sun Dec 3 10:05:27 2023

On Fri, 01 Dec 2023 21:15:59 +0000, Quadibloc wrote:

On Wed, 29 Nov 2023 17:15:00 +0000, Quadibloc wrote:

I have now modified the 17-bit shift instructions in the diagram, so
that they can also apply to all 32 integer registers, and I have
corrected the opcodes on the page

http://www.quadibloc.com/arch/cw0101.htm

And now I have completed the process of getting back to where I was
before,
by adding in the page

http://www.quadibloc.com/arch/cw0102.htm

which describes the instructions longer than 32 bits.

Two further changes have been made.

On the first page of the description of the ISA, I have noted that
when VLIW features are used, indicating that instructions may be
executed in parallel must not change the result of a calculation,
since some implementations may ignore that directive.

On the page about 17-bit instructions, I have changed the format
of 128-bit floating-point numbers; instead of being a 128-bit version
of temporary real, with more significand bits, I've added one exponent
bit, subtracting one significand bit.

The reason for this is to allow, with a 130-bit internal form, the
*standard* 128-bit form for IEEE 754 floating-point numbers, which
does have a hidden first bit, to be supported.

In addition to having the sixteen even-numbered registers available for
such numbers, since 130 bits is so frustratingly shorter than 256 bits,
I also make the registers with numbers of the form 4n+1 available, using
the same scheme as I will use for 128-bit Decimal Floating Point in the
IBM format. Tweaked slightly to allow internal forms of up to 168 bits
instead of up to 160 bits.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to MitchAlsup on Sun Dec 3 14:36:37 2023

mitchalsup@aol.com (MitchAlsup) writes:

Have any GBnOoO machines been successful ?

Great Big in-order machines (why write non-OoO?):

Multiflow has 7, 14, or 28 instructions per cycle, but of course its
target market is supercomputing, i.e., throughput computing, and at
the time the competition was pipelined SIMD (Cray etc.). Was it
successful? Probably not that much.

The 21164(a) is 4-wide, and was successful in its prime, but there was
no OoO competition at the time. When the Pentium Pro appeared at
200MHz, it took the SPECint95_base crown from the 300MHz 21164 <https://en.wikipedia.org/wiki/Alpha_21164#Performance>. Given that
it held SPECint and SPECfp performance crowns for some time, one can
consider it to be successful. Also, I think it was commercially
somewhat successful. The 21164 including the 21164a also had a much
longer lifespan than its predecessor, mainly due to the 21264 being
late.

The Larrabee (which eventually resulted in Knights Ferry) is a
two-wide in-order design, but with very wide (512-bit) SIMD units.
One probably cannot call it a success.

Since the victory of OoO, people mostly limited themselves to two-wide
in-order machines, probably because any more width is mostly wasted
given the limited amount of instruction-level parallelism within a
basic block. If people wanted more, they usually went to OoO (e.g., Bonnell->Silvermont, Knight's Ferry->Knight's Corner).

One exception is ARM, which stayed with in-order in A53, A55, A510,
A520, and switched from 2-wide to 3-wide in the A55->A510 transition,
but interestingly went from 3 to 2 ALUs in the A510->A520 transition
(but is still generally 3-wide).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Sun Dec 3 15:55:41 2023

Quadibloc <quadibloc@servername.invalid> writes:

On Fri, 01 Dec 2023 22:10:39 +0000, Quadibloc wrote:

That's a very valid point, but any ISA for a "great big" machine can
have a subset which no longer requires a "great big" machine.

Also, as you are well aware, Intel has included both "performance" and >"efficiency" cores in its latest generations of CPUs, similar to the >BIG.little architecture used for some ARM processors.

Which interestingly leads to recent Intel desktop and laptop CPUs not supporting AVX-512, even on CPUs that have only the performance cores
enabled, even though the P-cores have AVX-512 implemented.

Likewise, big.LITTLE has led to ARM cores all only supporting
128-bit-wide SVE, because wider SVE would be too costly on the LITTLE
cores. It will be interesting to see what Apple does.

Well, another way to address the efficiency/little cores being a
waste of space would be to reduce the waste by making them smaller.
If their purpose is to save power consumption when nobody's using the >computer, to just keep the OS alive while it waits for the keyboard
or the mouse to ask it to do something... then they should be made
really little.

That's not their primary purpose, or there would only be one such
core. Intel has put 16 E-cores on Raptor Lake in order to be able to
boast 24 cores (more than AMD's desktop offering) and 32 threads (same
as AMD) total. And on tasks that can benefit from that many cores,
such as some benchmarks, they are actually quite beneficial.

ARM claims that their LITTLE in-order cores serve that purpose, but
then, why put 4 or more of them on a smartphone SoC? They are
certainly not more energy-efficient than the OoO brethren except at
their lowest performance point (and then not by much).

Apple uses OoO efficiency cores that are about as fast as a Cortex-A76
(in case of M1). Apparently they have no problem with using an OoO
core to "keep the OS alive while it waits"; given that modern cores
use very little power while they wait, that's not surprising.

But in-order efficiency cores that are there when the demands are very
low?

Intel runs the management engine on such a core, and AMD runs its
equivalent on several ARM cores, but these work outside of the realm
covered by the OS.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Anton Ertl on Sun Dec 3 19:45:39 2023

On Sun, 03 Dec 2023 14:36:37 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Great Big in-order machines (why write non-OoO?):

Of course, no doubt he was thinking of the Itanium, which was one
of the most resounding failures in recent years.

If one goes far enough back, of course, there's the IBM System/360
Model 85. Unlike the Model 91, it was in-order, yet it offered more performance! This was because it had one thing the Model 91 didn't,
a cache.

The Model 85 was actually a failure for IBM in sales terms, but as
that was because of an economic slump at the time it came out, IBM
was not deterred from re-using the design, with a few additions and
tweaks, in the IBM System/370 Model 165 and 168 a few years later.
And those systems were quite successful.

I've already noted that an in-order version of a great big architecture
might make for nice lightweight efficiency cores in a BIG.little type
design. But making those cores in-order has another nice benefit.

No Spectre. No Meltdown. So, when the computer is actually active,
these cores, instead of being a total waste of space, could be put
to use as a ready-made sandbox for executing code sourced from the
Internet.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Sun Dec 3 20:18:30 2023

Quadibloc wrote:

On Sun, 03 Dec 2023 14:36:37 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Great Big in-order machines (why write non-OoO?):

Of course, no doubt he was thinking of the Itanium, which was one
of the most resounding failures in recent years.

Itanic, Multiflow, i860, and now probably Mill.

If one goes far enough back, of course, there's the IBM System/360
Model 85. Unlike the Model 91, it was in-order, yet it offered more performance! This was because it had one thing the Model 91 didn't,
a cache.

The Model 85 was actually a failure for IBM in sales terms, but as
that was because of an economic slump at the time it came out, IBM
was not deterred from re-using the design, with a few additions and
tweaks, in the IBM System/370 Model 165 and 168 a few years later.
And those systems were quite successful.

Model 85 and 91 were combined into 195 but this still failed compared
to CDC 7600.

I've already noted that an in-order version of a great big architecture
might make for nice lightweight efficiency cores in a BIG.little type
design. But making those cores in-order has another nice benefit.

When you deconstruct a GBOoO machine into a LBIO machine you invariably
loose issue width, which takes the pressure off {TLBs, Caches, Bus, ...}
the pipeline shrinks in stages, taking even more pressure off those.

No Spectre. No Meltdown. So, when the computer is actually active,
these cores, instead of being a total waste of space, could be put
to use as a ready-made sandbox for executing code sourced from the
Internet.

You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO
by following one simple rule:: No microarchitectural changes until
the causing instruction retires. AND you can do this without loosing performance.

The existing camp of designs chooses not to.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Sun Dec 3 13:02:35 2023

On 12/3/2023 12:18 PM, MitchAlsup wrote:

snip

Is my assessment (interspersed below) of the effects of this correct?

When you deconstruct a GBOoO machine into a LBIO machine you invariably
loose issue width,

Which reduces performance

which takes the pressure off {TLBs, Caches, Bus, ...}

Which allows savings in ports, etc., thus further reducing gate count,
thus chip size, thus cost.

the pipeline shrinks in stages,

Which reduces the cost of mis-predicted branches, thus counterbalancing
"some" of the performance loss from eliminating OoO. Also, further
reduces gate count.

taking even more pressure off those.

Overall, while the direction of the area/cost reduction, and performance
loss are clear, the magnitude of these is more difficult to predict
before actually doing it.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Sun Dec 3 22:19:40 2023

On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:

Model 85 and 91 were combined into 195 but this still failed compared to
CDC 7600.

I definitely remembered the Model 195.

Even if the CDC 7600 outsold it, though, in one way the Model 195 was
an enormous success. Its microarchitecture ended up being, in general
terms, copied by the Pentium Pro and the Pentium II.

So, today, all computers are made this way - OoO pipeline plus cache.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Stephen Fuld on Sun Dec 3 22:34:56 2023

Stephen Fuld wrote:

On 12/3/2023 12:18 PM, MitchAlsup wrote:

snip

Is my assessment (interspersed below) of the effects of this correct?

When you deconstruct a GBOoO machine into a LBIO machine you invariably
loose issue width,

Which reduces performance

which takes the pressure off {TLBs, Caches, Bus, ...}

Which allows savings in ports, etc., thus further reducing gate count,
thus chip size, thus cost.

the pipeline shrinks in stages,

Which reduces the cost of mis-predicted branches, thus counterbalancing "some" of the performance loss from eliminating OoO. Also, further
reduces gate count.

taking even more pressure off those.

Overall, while the direction of the area/cost reduction, and performance
loss are clear, the magnitude of these is more difficult to predict
before actually doing it.

My last year at AMD ('06); I was working on a 1-wide x86-64, eXcel simulation indicated ½ the performance at 1/12 the area and likely 1/10 the power.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Sun Dec 3 22:39:26 2023

Quadibloc wrote:

On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:

Model 85 and 91 were combined into 195 but this still failed compared to
CDC 7600.

I definitely remembered the Model 195.

Even if the CDC 7600 outsold it, though, in one way the Model 195 was
an enormous success. Its microarchitecture ended up being, in general
terms, copied by the Pentium Pro and the Pentium II.

So, today, all computers are made this way - OoO pipeline plus cache.

Depends on how accurately you think copying 91 reservation stations count.
Most machines today implement value-free reservation stations because they
are 1/8 the area and somewhat faster. Tomasulo used value-capturing res- ervation stations.

Also note: the compute Luke was working on a few years ago used Scoreboard technology rather than reservation station technology.....

Several of the very deep window machines use a dispatch-stack-like pre scheduler before routing instructions to the FV reservation stations.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Sun Dec 3 23:25:38 2023

On Sun, 03 Dec 2023 22:34:56 +0000, MitchAlsup wrote:

My last year at AMD ('06); I was working on a 1-wide x86-64, eXcel
simulation indicated ½ the performance at 1/12 the area and likely 1/10
the power.

Given that OoO is a wildly inefficient way to improve
the single-thread performance of CPUs, which we use
because we don't have anything better, I'm not surprised
you've expressed the wish that more research be done on
using multiple CPUs in parallel.

Myself, I don't believe the parallel programming problem
is solvable; there will always be too many problems that
have critical serial parts that are too big. But that
doesn't mean that I think we're doomed to require big
hot CPUs that hog electricity.

Because the problem of writing small, bloat-free programs
_is_ solvable. Back in the days when all we had was Windows
3.1 running on 386 and 486 processors, that was enough to
do nearly everything we do with computers today.

We could still run word processors, do spreadsheets, even
run _Mathematica_. All most computer users would miss would
be a bit of graphical pizazz.

Now, it isn't in the interest of CPU makers and others in
the computer industry for users not to be strongly motivated
to run out and buy newer and faster processors every year
or two. The death of Dennard Scaling, and the tapering off
of Moore's Law, however, are taking the wind out of the sails
of that. Eventually, the improvements will be so minor that
the CPU makers won't have enough *money* to fund fabs that
probe the ultimate limits of feature size any longer.

For some users, CPUs made of some exotic material beyond
silicon that was 10x as fast... but, because of yield
issues, could only be used to make small in-order CPUs,
so the CPUs are only 5x as fast... would be worth almost
any price. Because the parallel programming problem hasn't
been solved, whether or not it can be.

And I don't begrudge them such a development, as it would
be a step towards making better performance available to
everyone, as demand drives research into bringing costs
down.

What the rest of us really need is lighter-weight software
that isn't driven by the interests of computer makers instead
of computer users.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Quadibloc on Sun Dec 3 16:01:43 2023

On 12/3/2023 3:25 PM, Quadibloc wrote:

On Sun, 03 Dec 2023 22:34:56 +0000, MitchAlsup wrote:

My last year at AMD ('06); I was working on a 1-wide x86-64, eXcel
simulation indicated ½ the performance at 1/12 the area and likely 1/10
the power.

Given that OoO is a wildly inefficient way to improve
the single-thread performance of CPUs, which we use
because we don't have anything better, I'm not surprised
you've expressed the wish that more research be done on
using multiple CPUs in parallel.

Myself, I don't believe the parallel programming problem
is solvable; there will always be too many problems that
have critical serial parts that are too big. But that
doesn't mean that I think we're doomed to require big
hot CPUs that hog electricity.

Because the problem of writing small, bloat-free programs
_is_ solvable. Back in the days when all we had was Windows
3.1 running on 386 and 486 processors, that was enough to
do nearly everything we do with computers today.

We could still run word processors, do spreadsheets, even
run _Mathematica_. All most computer users would miss would
be a bit of graphical pizazz.

While I absolutely agree that there is too much resources spent on
"graphical pizzazz", and while you could run many/most of the same
programs, that doesn't mean there is no user benefit from faster CPUs.
For example, you probably could run some simulations, fluid dynamics,
finite element analysis, etc. but you were severely limited in the size
of the program you could run in an acceptable amount of elapsed time.
And applications like servers of various flavors certainly benefit from
faster CPUs, as you need fewer of them. Not to mention driving graphics
at more realistic resolutions, etc.

So, no, it wasn't simply the greed of CPU makers that drive us to higher performance systems.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Mon Dec 4 00:08:19 2023

On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:

You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO by following one simple rule:: No microarchitectural changes until the
causing instruction retires. AND you can do this without loosing
performance.

I thought that the mitigations that _were_ costly in performance
were mostly attempts to approach following just that rule.

Since out-of-order is so expensive in power and transistors,
though, if mitigations do exact a performance cost, then
going to a simple CPU that is not out-of-order might be a
way to accept a loss of performance, but gain big savings in
power and die size, whereas mitigations make those worse.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Mon Dec 4 18:58:51 2023

Quadibloc wrote:

On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:

You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO by
following one simple rule:: No microarchitectural changes until the
causing instruction retires. AND you can do this without loosing
performance.

I thought that the mitigations that _were_ costly in performance
were mostly attempts to approach following just that rule.

The mitigations were closer to:: cause the problem to vanish,
but change as little of the µArchitecture as possible in doing
it. But 6 years later, they apparently are still unwilling to
alter the µArchitecture enough to completely eliminate them.

Since out-of-order is so expensive in power and transistors,
though, if mitigations do exact a performance cost, then
going to a simple CPU that is not out-of-order might be a
way to accept a loss of performance, but gain big savings in
power and die size, whereas mitigations make those worse.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Mon Dec 4 19:54:10 2023

Quadibloc wrote:

On Sun, 03 Dec 2023 22:34:56 +0000, MitchAlsup wrote:

My last year at AMD ('06); I was working on a 1-wide x86-64, eXcel
simulation indicated ½ the performance at 1/12 the area and likely 1/10
the power.

Given that OoO is a wildly inefficient way to improve
the single-thread performance of CPUs, which we use
because we don't have anything better, I'm not surprised
you've expressed the wish that more research be done on
using multiple CPUs in parallel.

Myself, I don't believe the parallel programming problem
is solvable; there will always be too many problems that
have critical serial parts that are too big. But that
doesn't mean that I think we're doomed to require big
hot CPUs that hog electricity.

Because the problem of writing small, bloat-free programs
_is_ solvable. Back in the days when all we had was Windows
3.1 running on 386 and 486 processors, that was enough to
do nearly everything we do with computers today.

We could still run word processors, do spreadsheets, even
run _Mathematica_. All most computer users would miss would
be a bit of graphical pizazz.

Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??

If not, why are they still adding unused bloat to them ??

{{Come to think of it, my 2003 WORD is more useful than my wife's
2022 WORD because mine wastes less space on the screen with stuff
I never use.}}

Now, it isn't in the interest of CPU makers and others in
the computer industry for users not to be strongly motivated
to run out and buy newer and faster processors every year
or two. The death of Dennard Scaling, and the tapering off
of Moore's Law, however, are taking the wind out of the sails
of that. Eventually, the improvements will be so minor that
the CPU makers won't have enough *money* to fund fabs that
probe the ultimate limits of feature size any longer.

My desktops tend to last 7-9 years before blowing out a power
supply transistor. My laptops when the battery dies.

For some users, CPUs made of some exotic material beyond
silicon that was 10x as fast... but, because of yield
issues, could only be used to make small in-order CPUs,

Gallium Arsenide.

so the CPUs are only 5x as fast... would be worth almost
any price. Because the parallel programming problem hasn't
been solved, whether or not it can be.

And I don't begrudge them such a development, as it would
be a step towards making better performance available to
everyone, as demand drives research into bringing costs
down.

What the rest of us really need is lighter-weight software
that isn't driven by the interests of computer makers instead
of computer users.

Bloatware is driven by the software companies needing to sell
new SW features to stay in business. CUP companies don't care,
customers with CD or DVD ROM disks don't care either....

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Mon Dec 4 20:03:47 2023

Quadibloc wrote:

On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:

You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO by
following one simple rule:: No microarchitectural changes until the
causing instruction retires. AND you can do this without loosing
performance.

I thought that the mitigations that _were_ costly in performance
were mostly attempts to approach following just that rule.

Since out-of-order is so expensive in power and transistors,
though, if mitigations do exact a performance cost, then
going to a simple CPU that is not out-of-order might be a
way to accept a loss of performance, but gain big savings in
power and die size, whereas mitigations make those worse.

18 years ago, when I quit building CPUs professionally, GBOoO
performance was 2× what an 1-wide IO could deliver. In those
18 years the CPU makers have gone from 2× to 3× performance
while the execution window has grown from 48 to 300 instructions.
Clearly an unsustainable µArchitectural direction.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Stephen Fuld on Mon Dec 4 19:58:18 2023

Stephen Fuld wrote:

On 12/3/2023 3:25 PM, Quadibloc wrote:

We could still run word processors, do spreadsheets, even
run _Mathematica_. All most computer users would miss would
be a bit of graphical pizazz.

While I absolutely agree that there is too much resources spent on
"graphical pizzazz", and while you could run many/most of the same
programs, that doesn't mean there is no user benefit from faster CPUs.
For example, you probably could run some simulations, fluid dynamics,
finite element analysis, etc. but you were severely limited in the size
of the program you could run in an acceptable amount of elapsed time.

I might note that all of those applications have no real limitation
in parallelism.

And applications like servers of various flavors certainly benefit from faster CPUs, as you need fewer of them. Not to mention driving graphics
at more realistic resolutions, etc.

So, no, it wasn't simply the greed of CPU makers that drive us to higher performance systems.

I would say that CPU makers were driven to build faster, bigger, ...
machines because SW makers were continuing to consume all of the
available cycles (whether the end user cared or not.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Mon Dec 4 20:13:55 2023

mitchalsup@aol.com (MitchAlsup) writes:

Quadibloc wrote:

On Sun, 03 Dec 2023 22:34:56 +0000, MitchAlsup wrote:

We could still run word processors, do spreadsheets, even
run _Mathematica_. All most computer users would miss would
be a bit of graphical pizazz.

Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??

Dunno. I still use troff.

If not, why are they still adding unused bloat to them ??

You can't sell a new version if there's nothing different.

My desktops tend to last 7-9 years before blowing out a power
supply transistor. My laptops when the battery dies.

My cubicle desktop (a Dell tower) is now 11 years old and
going strong. It's only been powered down a half dozen times
during that period. I did replace the boot disk with an
SSD in 2013.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Mon Dec 4 20:34:16 2023

BGB wrote:

On 12/3/2023 5:25 PM, Quadibloc wrote:

On Sun, 03 Dec 2023 22:34:56 +0000, MitchAlsup wrote:

In my own efforts, I can note that a 50MHz CPU, with programs having
memory foot-prints measured in MB (or less) is "not entirely useless".

I was working on an Automotive Engine simulator in eXcel on a 33 MHz
486, That CPU died and I got a 200 MPH Pentium Pro. On the 486, I
could change a variable (rod length for example) and eXcel would be
done by the time I walked to the frige, got a beer and walked back.
On the PP, it was done in less than 1 second.

But, looking backwards, I am left to realize that, it seems, I am
nowhere near close to the levels of performance or efficiency of a lot
of these early systems.

Like, seemingly, often it is not so much that the CPU is too weak or
slow, but that my code code is still slow. Often, taking for granted
coding practices that were formed in the "relative abundance" of CPU
power in the early 2000s.

In nearly every other area of engineering, the design constraints were relatively constant; but in software, nearly everyone had the mistaken
belief that the exponential increases in computing speed and power would continue indefinitely.

Now it has been steadily falling off, but there has been a sort of
collective denial about it.

As I mentioned above, this has more to do with SW companies needing to
stay in business than in satisfying customer requirements.

Now, it isn't in the interest of CPU makers and others in
the computer industry for users not to be strongly motivated
to run out and buy newer and faster processors every year
or two. The death of Dennard Scaling, and the tapering off
of Moore's Law, however, are taking the wind out of the sails
of that. Eventually, the improvements will be so minor that
the CPU makers won't have enough *money* to fund fabs that
probe the ultimate limits of feature size any longer.

As someone who skipped every other generation; I bought my destops
more because the last one died, than the need for faster and faster
CPUs. Also, since I always had a second disk in the box, transferring
my files was as simple as moving the drive from box to box.

I kinda suspect that when Moore's Law is good and dead, there may
actually be a bit of a back-slide in these areas, as the "best" fabs
will likely be more expensive to run and maintain than the "good but not
the best" fabs, and this will create a back-pressure towards whatever is
"the most bang for the buck" in terms of fab technology.

The only thing chips smaller than 22nm bring is lower power (which we
can use and more cores which apparently we cannot). We have been at 5GHz
for nearly a decade.

I also suspect that the transition from the past/current state, to this
state of things, is a world where x86-64 is unlikely to fare well.

It is loosing %-TAM to cell phones and ARM.

Say, in this scenario, x86-64 would be left with an ultimately
unwinnable battle against the likes of ARM and RISC-V.

ARM: yes; RISC-V: I would bet against, but it is too early to tell.
With all the Chinese money in RISC-V I don't think USA.gov will
allow what the pundents are predicting.

The exact form things will take will likely depend on a tradeoff:
Whether it is better to have a smaller number of cores getting the best possible single-thread performance;
Or, a larger number of cores each giving comparably worse single-thread performance, but there can be more of them for cheaper and less power.

Say, if you could have cores that only got 1/3 as much performance per
clock, but could afford to have 8x as many cores in total for the same
amount of silicon.

Or, say, people can find ways to make multi-threaded programming not
suck as much (say, if using an async join/promise and/or channels model rather than traditional multi-threading with shared memory and synchronization primitives).

If you want multi-threaded programs to succeed you need to start writing
them in a language that is not inherently serial !! It is brain dead
easy to write embarrassingly parallel applications in a language like
Verilog. The programmer does not have to specify when or where a gate
is evaluated--that is the job of the environment (Verilog).....

Namely, with such models, it may be possible to make better use of a
many core system, with less pain and overhead than that associated with trying to spawn off large numbers of conventional threads and have them
all sitting around trying to lock mutexes for shared resources.

If you want multi-threaded parallel programs you need to design the
host language under the assumption of having infinite cores (not just
"many"). It is not the job of the programmer to distribute the work,
and verify the fork/joins or synchronization, but the environment

Though, not necessarily a great way to map this stuff onto "ye olde C",

None of the vonNeumann programming paradigms carry to the parallel realm.
a) you can't single step the program
b) you cannot assume that one inst is performed and then the next ...
c) you cannot assume the "I call you and you return to me" control handoff
d) you cannot assume there is 1 point of control

so effectively one may end up with something in this case resembling the processes communicating in a form resembling COM objects or similar,

You need a "net list" of how data flows through the application
that manages itself.

with the side effect that (given the structure of the internal dispatch loops), these "objects" can be self-synchronizing and thus don't need an explicit mutex (but, may potentially need a way for the task scheduler
to queue up in-flight requests, which are then handled asynchronously; possibly with a mechanism in place to indicate whether the request will
block the caller until it will be handled, or whether the caller will
resume immediately, potentially even though the called object has not
yet seen the request).

You are starting to get the jist, but our feet are still stuck in the vN paradigm.

Things like async/promises could scale a little easier to "well, do this thing, potentially using as many cores as available". Though, async's
don't make as much sense on a primarily or exclusively single threaded system, and have an annoying level of overhead if emulated on top of conventional multithreading (it effectively needs a mutex protected work-queue which can itself become a bottleneck).

This is simply syntactic sugar--the programmer should not have to
manage the parallelism !! The environment does this !!

Ideally, one would need a mechanism to distribute and balance tasks
across the available cores that does not depend on needing to lock a
mutex. Say, for example, maybe using an inter-processor interrupt or
similar to "push" tasks or messages to the other cores, with some shared visible state for "how busy each core is" but not needing to lock
anything to look at this state.

I used to make a joke:: Verilog compiles your parallel application into
2 miles of straight line code, the first mile is the clock high code,
the second mile is the clock low code. No loops, no branches, just
LD/ST and compute.

This was back in the 1 CPU/system era. But all those loops, linked
data structures, and procedure call/return tree were completely flattened
into straight line code.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Mon Dec 4 21:33:01 2023

mitchalsup@aol.com (MitchAlsup) writes:

BGB wrote:

Or, say, people can find ways to make multi-threaded programming not
suck as much (say, if using an async join/promise and/or channels model
rather than traditional multi-threading with shared memory and
synchronization primitives).

If you want multi-threaded programs to succeed you need to start writing
them in a language that is not inherently serial !! It is brain dead
easy to write embarrassingly parallel applications in a language like >Verilog. The programmer does not have to specify when or where a gate
is evaluated--that is the job of the environment (Verilog).....

That is true, but only really usable when the resulting design
is realized on silicon. Verilog simulations won't win any
speed races, even with verilator.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Mon Dec 4 17:21:27 2023

I kinda suspect that when Moore's Law is good and dead, there may
actually be a bit of a back-slide in these areas, as the "best" fabs
will likely be more expensive to run and maintain than the "good but
not the best" fabs, and this will create a back-pressure towards
whatever is "the most bang for the buck" in terms of fab technology.

AFAIK this future arrived a few years ago: the lowest cost
per-transistor is not on the densest/smallest nodes any more, which is
why many SoCs don't bother to use those densest/smallest nodes.

I also suspect that the transition from the past/current state, to
this state of things, is a world where x86-64 is unlikely to
fare well.

I suspect that the ISA makes sufficiently little difference at this
point that it doesn't matter too much.

Things like async/promises could scale a little easier to "well, do
this thing, potentially using as many cores as available".

Async/promises are handy for concurrency, but they don't bring much
benefit for parallelism.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Tue Dec 5 00:57:40 2023

On Mon, 04 Dec 2023 19:54:10 +0000, MitchAlsup wrote:

Gallium Arsenide.

I thought that while Gallium Arsenide was _once_ thought
of as something faster than silicon, Intel had, by using
it as a template for "stretched silicon", managed to
improve silicon enough to make it just as good as Gallium
Arsenide... or, at least, this seemed to be what they
were claiming.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Tue Dec 5 01:08:01 2023

Quadibloc wrote:

On Mon, 04 Dec 2023 19:54:10 +0000, MitchAlsup wrote:

Gallium Arsenide.

Gallium Arsenide is what is used in Hubble's 60GHz radio links.

I thought that while Gallium Arsenide was _once_ thought
of as something faster than silicon, Intel had, by using
it as a template for "stretched silicon", managed to
improve silicon enough to make it just as good as Gallium
Arsenide... or, at least, this seemed to be what they
were claiming.

Stretched, low-K dielectrics, high-K gates are all 10%-15% jumps
at the chip level; as was FinFET and will be Gate-all-around.

Gallium Arsenide is 5×; hideously expensive, dangerous to the
workers in the FAB, and chemical disposal, low yield,.....

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Tue Dec 5 01:11:24 2023

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup) writes:

BGB wrote:

Or, say, people can find ways to make multi-threaded programming not
suck as much (say, if using an async join/promise and/or channels model
rather than traditional multi-threading with shared memory and
synchronization primitives).

If you want multi-threaded programs to succeed you need to start writing >>them in a language that is not inherently serial !! It is brain dead
easy to write embarrassingly parallel applications in a language like >>Verilog. The programmer does not have to specify when or where a gate
is evaluated--that is the job of the environment (Verilog).....

That is true, but only really usable when the resulting design
is realized on silicon. Verilog simulations won't win any
speed races, even with verilator.

Because it treats each bit as if it had (at least) 4 states.

Verilog, with the model of 1-bit == 1-bit, would only have a 3× penalty;
but would allow one to use all 1M CPUs in a system; instantly, and with
out rewriting anything !

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to BGB on Tue Dec 5 09:17:03 2023

On Mon, 04 Dec 2023 20:48:55 -0600, BGB wrote:

The pressure against x86-64 is that one needs comparably expensive CPU
cores to get decent performance, whereas ARM and RISC-V can perform acceptably on cheaper cores.

The pressure would be in the direction of best perf/$, which will be
in-turn best perf per die area, which is not really a battle that x86 is likely to win in the a longer term sense.

If ARM or RISC-V catch up and end up being able to deliver more cores
that are faster and cheaper than what is realistically possible for x86
to offer, then x86's days will be numbered.

I think this reasoning makes a lot of sense.

The trouble is that:

a) x86 has an enormous pool of software, and
b) it is possible to build x86 processors, with current processes,
that anyone can afford, and which have adequate performance, and
c) much of the cost of a computer system is in the box housing
the CPU, not just the CPU itself.

However, in my opinion, x86-64 threw away the biggest advantage of x86,
because it repeated the mistake of the 80286. It wasn't designed to
make it easy and trivial for 16-bit Windows programs to run on 64-bit
editions of Windows, without resorting to any fancy techniques like virtualization.

Instead, they should just run, without Microsoft having to make much
effort (of course, they would still have to thunk the OS calls).

Then Windows' huge advantage, which carries over to the x86 architecture
as well, the huge pool of software written for it, would be there in
full.

So Windows today seems to be in the situation that all that which is
not bloatware is lost. That makes it easier for a competing architecture
to win, it just has to not make the same mistake. Then lightweight
programs _plus_ less complicated instruction decoding will compound
the performance advantage of an alternate ISA.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stephen Fuld on Tue Dec 5 09:13:00 2023

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

Overall, while the direction of the area/cost reduction, and performance
loss are clear, the magnitude of these is more difficult to predict
before actually doing it.

Fortunatley, ARM and Intel have implemented such CPUs, so we can
measure it. For the small Gforth benchmarks, I see (numbers are times
in seconds):

sieve bubble matrix fib fft
0.348 0.384 0.300 0.460 0.356 Bonnell 1.6GHz (in-order 2-wide 2008)
0.146 0.208 0.090 0.239 0.154 Silvermont 2.4GHz (OoO 2-wide E-core 2013)
0.112 0.124 0.028 0.116 0.036 Sandy Bridge 3GHz (OoO 4-wide P-Core 2011)
0.099 0.095 0.035 0.074 0.037 Tremont 2.8GHz (OoO E-Core 2020)
0.037 0.043 0.014 0.035 0.015 Rocket Lake 5.1GHz (OoO wide P-Core 2021)

0.250 0.296 0.159 0.256 0.151 Cortex A55 1.8GHz (in-order 2-wide 2017)
0.180 0.208 0.072 0.232 0.084 Cortex A73 1.8GHz (OoO 2-wide 2016)
0.116 0.160 0.042 0.087 0.051 Cortex A76 2.2GHz (OoO 4-wide 2018)
0.111 0.116 0.046 0.098 0.071 IceStorm 2.06GHz (OoO Apple M1 E-core)
0.088 0.054 0.028 0.047 0.034 Firestorm 3.2GHz (OoO Apple M1 P-core)

Bonnell is really slow. The A55 managed to be quite a bit faster even
though it has the same width and not much faster clock rate. In any
case, the A55 is beaten by Firestorm by a factor of 5 on most
benchmarks (and these are benchmarks that are not helped much by the
larger caches of Firestorm); not a factor 2 any more.

As for area, yes, I guess that the A55 is smaller than Firestorm by
more than a factor of 5. What does it help? There have been startups
that tried to put many small cores on a single chip (and Intel, in a
way, with Knight's Ferry, too). They were not particularly
successful; even Knights Ferry, where the target market was
supercomputing (where applications are well parallelizable and
software pipelining works well) was replaced by OoO Knights Corner and eventually with the mainline wide OoO cores.

My impression is that the caches and interconnect between cores costs
so much area that it does not pay off to build lots of small ones.
Intel announced one with 288 Gracemonts (the successor of Tremont),
but Gracemont is more advanced (and would probably use more area in
the technology of the day) than the GBOoO (probably K8) that Mitch
Alsup compared the little core with in 2006.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Tue Dec 5 11:26:23 2023

Quadibloc <quadibloc@servername.invalid> writes:

Given that OoO is a wildly inefficient way to improve
the single-thread performance of CPUs

"Wildly inefficient" in what way? As far as energy is concerned,
comparing the A55 to the A75 <https://images.anandtech.com/doci/14072/Exynos9820-Perf-Eff-Estimated.png>
at the highest respective efficiency, you get a factor of about 3.5 in performance for a cost factor of 1.1 in energy efficiency. Wildly
inefficient?

Compare that to just raising the voltage and clock of the in-order
core: There you get the same factor 3.5 at a loss in efficiency by
more than a factor of 2.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Tue Dec 5 11:07:09 2023

Quadibloc <quadibloc@servername.invalid> writes:
[IBM Model 195]

Its microarchitecture ended up being, in general
terms, copied by the Pentium Pro and the Pentium II.

Not really. The Models 91 and 195 only have OoO for FP, not for
integers. They have no reorder buffer and no speculative execution.
They have imprecise exceptions, whereas modern OoO processors have
precise exceptions. And do they split instructions into uops that
find their own way through the OoO execution engine? I don't think
so, because that needs a reorder buffer.

I have not read the HPS papers for a long time, but they certainly
look closer to what is implemented in modern OoO machines. However,
looking at my comments for [hwu&patt87isca], there is still quite a
bit of difference between that and modern OoO.

@InProceedings{patt+85a,
author = "Yale N. Patt and {Wen-mei} Hwu and Michael Shebanow",
title = "{HPS}, a New Microarchitecture: Rationale and Introduction",
crossref = "micro85",
pages = "103--108",
annote = "CISC instructions are decoded into RISC instructions,
which are executed in parallel using dynamic
scheduling etc. This microengine is presented as a
restricted data flow machine."
}

@InProceedings{patt+85b,
author = "Yale N. Patt and Stephen W. Melvin and {Wen-mei} Hwu
and Michael C. Shebanow",
title = "Critical Issues Regarding {HPS}, a High Performance Microarchitecture",
crossref = "micro85",
pages = "109--116",
annote = "Discusses in depth some of the issues in dynamic
scheduling hardware."
}

@Proceedings{micro85,
key = "MICRO-18",
booktitle = "The $18^{th}$ Annual Workshop on Microprogramming
(MICRO-18)",
title = "The $18^{th}$ Annual Workshop on Microprogramming
(MICRO-18)",
year = "1985",
}

@InProceedings{hwu&patt87isca,
author = "{Wen-mei} Hwu and Yale N. Patt",
title = "Checkpoint Repair for Out-of-order Execution Machines",
crossref = "isca87",
pages = "18--26",
note = "Newer version: \cite{hwu&patt87ieeetc}",
annote = "Describes design issues in checkpoint mechanisms for
precise interrupts and speculative execution. Their
design uses backup register files and difference
techniques for main memory. Instructions can be
retired out-of-order, avoiding full window
conditions."
}

@Article{hwu&patt87ieeetc,
author = "{Wen-mei} Hwu and Yale N. Patt",
title = "Checkpoint Repair for High-Performance Out-of-order
Execution Machines",
journal = ieeetc,
year = "1987",
volume = "36",
number = "12",
pages = "1496--1514",
month = dec
}

@Proceedings{isca87,
key = "ISCA-14",
booktitle = "The $14^{th}$ Annual International Symposium on
Computer Architecture (ISCA)",
title = "The $14^{th}$ Annual International Symposium on
Computer Architecture (ISCA)",
year = "1987",
address = "Pittsburgh, Pennsylvania",
organization = "IEEE Computer Society TCCA and ACM SIGARCH",
note = "{\em Computer Architecture News,} 15(2), June 1987",
month = jun # " 2--5,",
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to BGB on Tue Dec 5 14:59:04 2023

BGB <cr88192@gmail.com> writes:

On 12/4/2023 4:21 PM, Stefan Monnier wrote:

I kinda suspect that when Moore's Law is good and dead, there may
actually be a bit of a back-slide in these areas, as the "best" fabs
will likely be more expensive to run and maintain than the "good but
not the best" fabs, and this will create a back-pressure towards
whatever is "the most bang for the buck" in terms of fab technology.

AFAIK this future arrived a few years ago: the lowest cost
per-transistor is not on the densest/smallest nodes any more, which is
why many SoCs don't bother to use those densest/smallest nodes.

OK.

I also suspect that the transition from the past/current state, to
this state of things, is a world where x86-64 is unlikely to
fare well.

I suspect that the ISA makes sufficiently little difference at this
point that it doesn't matter too much.

The pressure against x86-64 is that one needs comparably expensive CPU
cores to get decent performance, whereas ARM and RISC-V can perform >acceptably on cheaper cores.

Are they cheaper? There are a lot of sunk costs already absorbed
by the x86-64 family both at Intel and AMD.

One of my professors back in the late 70's was researching
data flow architectures. Perhaps it's time to reconsider the
unit of compute using single instructions, instead providing a
set of hardware 'functions' than can be used in a data flow environment.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Tue Dec 5 15:44:24 2023

Quadibloc <quadibloc@servername.invalid> writes:

On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:

You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO by
following one simple rule:: No microarchitectural changes until the
causing instruction retires. AND you can do this without loosing
performance.

I thought that the mitigations that _were_ costly in performance
were mostly attempts to approach following just that rule.

No. What mitigations do we have:

* Retpolines (against Spectre v2): These ensure that an indirect
branch mispredicts in a harmless way, so they completely suppress
speculation. I have seen slowdowns in Gforth by up to a factor of
9.5 from retpolines. All indirect branches in a process have to be
converted to retpolines, and even if you do that, there is Inception
(which works without any indirect branch in the victim).

* Speculative load hardening (against Spectre v1): This adds the
control dependencies as data dependencies to loads, essentially
eliminating speculation of loads (and thus mostly eliminating
speculation and its speed benefits). The slowdown for Ultimate SLH
is a factor 2.5 on SPEC, and not much less for weaker SLH versions.
The hope is that this can be done selectively, only on the loads
where the attacker can influence the loaded address, reducing the
slowdown. But if you make one mistake in the wrong direction, your
program is vulnerable.

* Ways to clear various microarchitectural state (e.g., branch
predictors, caches), in the hope that this prevents the primed
predictor to reach the victim, or the changed microarchitectural
state to reach the attacker.

None of these mitigations prevent speculative changes to
microarchictural state from continuing to be in microarchitectural
state after the misprediction is resolved. By contrast, speculative architectural state is thrown away when the misprediction is resolved.
That's why they are mitigations, not fixes.

Since out-of-order is so expensive in power and transistors,
though, if mitigations do exact a performance cost, then
going to a simple CPU that is not out-of-order might be a
way to accept a loss of performance, but gain big savings in
power and die size, whereas mitigations make those worse.

Buy a Raspi3, or, for more performance, Odroid HC4. No Spectre there,
low power, maybe few transistors.

Anyway, what the mainstream players have been doing seems to be:
Hardware vendors throw the problem over to software people;
application people do nothing about it, while systems software people
try to mitigate the problems in various ways, including those outlined
above. Users a lulled by the claim that they are not affected by
Spectre, because there are other, easier-to-exploit vulnerabilities on
their computer, and Spectre is supposedly so harder to exploit. So
they buy fast OoO CPUs rather than Odroid HC4s. And consequently
nobody has capitalized on the Spectre-invulnerability of in-order
cores.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Wed Dec 6 02:29:31 2023

Quadibloc wrote:

On Mon, 04 Dec 2023 20:48:55 -0600, BGB wrote:

The pressure against x86-64 is that one needs comparably expensive CPU
cores to get decent performance, whereas ARM and RISC-V can perform
acceptably on cheaper cores.

The pressure would be in the direction of best perf/$, which will be
in-turn best perf per die area, which is not really a battle that x86 is
likely to win in the a longer term sense.

If ARM or RISC-V catch up and end up being able to deliver more cores
that are faster and cheaper than what is realistically possible for x86
to offer, then x86's days will be numbered.

I think this reasoning makes a lot of sense.

The trouble is that:

a) x86 has an enormous pool of software, and
b) it is possible to build x86 processors, with current processes,
that anyone can afford, and which have adequate performance, and
c) much of the cost of a computer system is in the box housing
the CPU, not just the CPU itself.

However, in my opinion, x86-64 threw away the biggest advantage of x86, because it repeated the mistake of the 80286. It wasn't designed to
make it easy and trivial for 16-bit Windows programs to run on 64-bit editions of Windows, without resorting to any fancy techniques like virtualization.

A lot of this had to do with reclaiming prefixes so that we could make
-64 work in long mode.

Instead, they should just run, without Microsoft having to make much
effort (of course, they would still have to thunk the OS calls).

This would have been a big loss--argument passing in registers, continued
use of segmentation when everyone was using a flat memory model,.....
No the real problem was 286 creating the segmentation model in the first
place. {I left a company at this transition because I did not want to go segmented style writing asm......they later went OoB.}

Then Windows' huge advantage, which carries over to the x86 architecture
as well, the huge pool of software written for it, would be there in
full.

So Windows today seems to be in the situation that all that which is
not bloatware is lost. That makes it easier for a competing architecture
to win, it just has to not make the same mistake. Then lightweight
programs _plus_ less complicated instruction decoding will compound
the performance advantage of an alternate ISA.

Now imaging an architecture that context switches in 10 cycles instead
of taking 1,000 cycles to reach somebody capable of slowly walking
through the context switching process...................

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to MitchAlsup on Wed Dec 6 07:31:54 2023

mitchalsup@aol.com (MitchAlsup) writes:

Scott Lurndal wrote:

That is true, but only really usable when the resulting design
is realized on silicon. Verilog simulations won't win any
speed races, even with verilator.

Because it treats each bit as if it had (at least) 4 states.

Actually, as I learned in HOPL-IV, Verilog won the speed race that
counts, the one against VHDL, because it has been designed around
these 4 states, and implementing them efficiently, whereas VHDL allows
more states.

Verilog, with the model of 1-bit == 1-bit, would only have a 3× penalty;
but would allow one to use all 1M CPUs in a system; instantly, and with
out rewriting anything !

Given that simulation efficiency is the reason that Verilog won, your 1-bit-Verilog should be a winner. But what do you do about the
high-impendance state of MOS?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Wed Dec 6 07:54:07 2023

scott@slp53.sl.home (Scott Lurndal) writes:

BGB <cr88192@gmail.com> writes:

The pressure against x86-64 is that one needs comparably expensive CPU >>cores to get decent performance, whereas ARM and RISC-V can perform >>acceptably on cheaper cores.

Are they cheaper?

Good question. I can buy a server with 128GB of ECC RAM and two
enterprise SSDs with a Ryzen for EUR 2000. With that amount of money,
I get no ARM or RISC-V machine with similar capabilities.

Looking for places where I can actually buy something with ARM: The
Rock 5B with 16GB RAM cost EUR 240 plus EUR 25 or so for the PSU (with
some anxiety on whether it would work), without a case. I can buy a
barebone with an Intel N100 starting at EUR 192 including case and
PSU, but I have to add 16GB RAM for about EUR 30; The N100 is faster
for single-threaded stuff, and probably similarly fast for
multi-threaded stuff. Here's a speed comparison between the A76 on
the Rock 5B, and the Tremont (predecessor of the Gracemont in the
N100); numbers are times in seconds, lower is better:

sieve bubble matrix fib fft
0.099 0.095 0.035 0.074 0.037 Tremont 2.8GHz (OoO E-Core 2020)
0.116 0.160 0.042 0.087 0.051 Cortex A76 2.2GHz (OoO 4-wide 2018)
0.452 0.526 0.314 0.676 0.603 JH7100 1GHz

The Raspi5 will be cheaper, but not offered with 16GB.

Concerning RISC-V, you can buy a Visionfive V2 with a JH7110 (for
USD100 with 8GB), but even with 1.5GHz, it will be dog slow, as you
can see above.

One of my professors back in the late 70's was researching
data flow architectures. Perhaps it's time to reconsider the
unit of compute using single instructions, instead providing a
set of hardware 'functions' than can be used in a data flow environment.

We already have data-flow microarchitectures since the mid-1990s, with
the success of OoO execution. And the "von Neumann" ISAs have proven
to be a good and long-term stable interface between software and these data-flow microarchitectures, whereas the data-flow ISAs of the 1970s
and their microarchitectures turned out to be uncompetetive.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andreas Eder@21:1/5 to BGB on Wed Dec 6 10:04:04 2023

On Di 05 Dez 2023 at 17:44, BGB <cr88192@gmail.com> wrote:

QEMU does better emulation, but lacks any real way of sharing files with
the host OS.

Look at what is described here:
https://en.wikibooks.org/wiki/QEMU/FreeDOS

You can simply mount the image (when qemu istn't running) and copy file
to and fro.

'Andreas

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Wed Dec 6 14:52:35 2023

Paul A. Clayton wrote:

On 11/24/23 9:49 PM, BGB wrote:
[snip]

Though, apparently, someone posted something recently showing RV64
and ARM64 to be much closer than expected, which is curious. The
main instructions that seem to have "the most bang for the buck"
are ones that ARM64 has equivalents of.

"An Empirical Comparison of the RISC-V and AArch64 Instruction
Sets" (Daniel Weaver and Simon McIntosh-Smith, 2023) used five
benchmarks, four scientific and STREAM. Just the fact that STREAM
did not use FMADD for the TRIAD portion slightly penalized AArch64
(though RISC-V will presumably add FMADD if it has not already).
SIMD was excluded based on the reasonable point that RISC-V has
not yet standardized its SIMD extension and "comparing the
different vector instruction sets across AArch64 and RISC-V is
beyond the scope of this initial comparison".

I rather suspect these benchmarks do not provide a good basis for
ISA design targeting minimum path length (much less performance).

Both ARM and RISC-V require close to 40% more instructions than My 66000.
So much for minimum path lengths.

AND, no ISA with more than about 200 instructions should be considered RISC,

The path lengths also varied considerably based on the compiler
version — a more recent version usually helping RISC-V more as
would be expected for a more recent ISA — though the results do
seem to point to general consistency of path length across
versions (one benchmark had negligible change for both ISAs, one
improved AArch64 only, two helped RISC-V only, and one helped both
ISAs but RISC-V more than AArch64).

I am somewhat surprised that indexed memory accesses did not
benefit AArch64 more (for such "scientific" benchmarks). AArch64's
need for a distinct comparison instruction for branches presumably
hurt, especially since loops were not unrolled. (AArch64 does, I
think, include a branch on equal/not-equal zero, so reverse
counted loops would have removed that disadvantage in some cases.)

My data indicates the indexed advantage is in the 2%-3% range.

Both RISC-V and AArch64 are RISC-oriented,

Under a perverted view of what the R in RISC stands for.

so one would expect the
most common operations to be present as instructions in both. The
differences would be mainly in special instructions (AArch64 has
many), memory addressing (AArch64 has more complex addressing
modes), branches (RISC-V has comparisons on integer values in the
branch instruction, AArch64 can sometimes set condition codes
without an additional instruction), and immediate sizes (AArch64
has larger base immediates — 16-bit vs. 12-bit and ways of
generating some larger immediates).

The special instructions seem unlikely to affect path length much
on such benchmarks and I suspect most of the constants are either
small integers or floating point values. This leaves branches and
memory accesses to affect path length.

A compiler or a web browser would have more interesting
instruction use, I suspect.

The benchmarks used were:
• STREAM [11]
A benchmark for measuring sustained memory bandwidth widely used
in industry, this consists of 4 simple kernels applied to elements
of arrays of size 10,000,000.
• CloverLeaf Serial [10]
A high energy physics simulation solving the compressible Euler
equations on a 2D Cartesian grid. This is broken down into a
series of kernels each of which loops over the entire grid. This
is run with default parameter.
• MiniBUDE [12, 15]
A mini app approximating the behaviour of a molecular docking
simulation used for drug discovery. Run with the bm1 input at 64
poses for one iteration (-n 64 -i 1 –deck /bm1).
• Lattice Boltzmann (LBM)
A d2q9-bgk Lattice Boltzmann algorithm, developed within the HPC
Research Group at the University of Bristol, optimised for serial
execution. Run with a grid size of 128x128 for 100 iterations.
• Minisweep [13]
A radiation transportation mini app reproducing the Denovo Sn
radiation transport behaviour used for nuclear reactor neutronics
modeling. Run with options –ncell_x 8 –ncell_y 16 –ncell_z 32
–ne 1 –na 32

The paper was not really very good.

Captain Obvious strikes again.

While some would argue that
excluding cache misses and branch mispredictions from
consideration even for maximum ILP is silly — I do not have a
problem with such in a limit study — the lack of loop unrolling
(or value inference/prediction for incremented values) makes the
results less accurately reflect a true maximum. The benchmarks are
also such that parallelism is much higher than usual.

Comparing performance of ISAs in such a limit study (same
frequency) seems to mostly be comparing the dataflow traits of the
programs rather than the tradeoffs presented by the ISAs, though
there were notable differences when instruction latencies were
allowed to be more realistic.

Fair ISA comparisons are hard. ISA interacts with multiple aspects
of microarchitecture. One could present an optimized
implementation space (with the dimensions of energy, time-to-
completion, and area/yield/cost — one might have to model an
optimum financial binning!), but that would seem to involve an
enormous amount of work even with rough approximations.

More testing may be needed.

Choosing benchmarks (and what to measure) tends to be iterative.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to BGB on Wed Dec 6 15:03:26 2023

BGB <cr88192@gmail.com> writes:

On 12/5/2023 8:29 PM, MitchAlsup wrote:

Quadibloc wrote:

But, the decoder still worked as-is for 32-bit x86, and the CPU isn't
going to be running 16-bit and 64-bit code at the same time, ...

Granted, IIRC an issue was that when Long-Mode-Enable is set, the mode >bit-patterns for 16-bit mode were reused for 64-bit mode (and VM86 mode
went poof as well).

But, otherwise they might have needed to "get creative" and find a way
to encode more CPU modes.

Either was, would have been happier if MS had included a built-in
emulator for 16-bit stuff.

At that point, nobody was using the 16-bit stuff except for a few
hobbyists. Good riddance.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Wed Dec 6 17:44:50 2023

Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Scott Lurndal wrote:

That is true, but only really usable when the resulting design
is realized on silicon. Verilog simulations won't win any
speed races, even with verilator.

Because it treats each bit as if it had (at least) 4 states.

Actually, as I learned in HOPL-IV, Verilog won the speed race that
counts, the one against VHDL, because it has been designed around
these 4 states, and implementing them efficiently, whereas VHDL allows
more states.

VHDL allows for "current fighting" between 2 driving nodes.

Verilog, with the model of 1-bit == 1-bit, would only have a 3× penalty; >>but would allow one to use all 1M CPUs in a system; instantly, and with
out rewriting anything !

Given that simulation efficiency is the reason that Verilog won, your 1-bit-Verilog should be a winner. But what do you do about the high-impendance state of MOS?

This requires the 4-state model (to mimic anything similar to real circuitry.) In any event, technologies smaller than 30nm no longer allow this form of logic.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to BGB on Wed Dec 6 19:07:32 2023

BGB <cr88192@gmail.com> writes:

On 12/6/2023 3:04 AM, Andreas Eder wrote:

On Di 05 Dez 2023 at 17:44, BGB <cr88192@gmail.com> wrote:

QEMU does better emulation, but lacks any real way of sharing files with >>> the host OS.

Look at what is described here:
https://en.wikibooks.org/wiki/QEMU/FreeDOS

You can simply mount the image (when qemu istn't running) and copy file
to and fro.

Windows can't mount filesystem images...

WSL1 can't do it either, and WSL2 doesn't work on my PC.

Maybe it's time to switch to linux? Or at least a dual-boot
setup?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Chris M. Thomasson on Wed Dec 6 21:55:55 2023

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 12/6/2023 11:07 AM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 12/6/2023 3:04 AM, Andreas Eder wrote:

On Di 05 Dez 2023 at 17:44, BGB <cr88192@gmail.com> wrote:

QEMU does better emulation, but lacks any real way of sharing files with >>>>> the host OS.

Look at what is described here:
https://en.wikibooks.org/wiki/QEMU/FreeDOS

You can simply mount the image (when qemu istn't running) and copy file >>>> to and fro.

Windows can't mount filesystem images...

WSL1 can't do it either, and WSL2 doesn't work on my PC.

Maybe it's time to switch to linux? Or at least a dual-boot
setup?

:^) Fwiw, I remember using a lot of those handy harddrive caddies way
back 22'ish years ago. I remember one time I had a lot of them, Solaris, >Linux, WinNT4, WinME, MSDOS, ect...

Today you can boot off an SSD connected to USB. With USB-C and an
NVME ssd, you can get excellent performance to boot.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to MitchAlsup on Thu Dec 7 13:33:53 2023

mitchalsup@aol.com (MitchAlsup) writes:

Anton Ertl wrote:

But what do you do about the
high-impendance state of MOS?

This requires the 4-state model (to mimic anything similar to real circuitry.) >In any event, technologies smaller than 30nm no longer allow this form of >logic.

But is it prevented by static checking? If not, you still need to
represent it in simulation, and report it as a bug there.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Thu Dec 7 19:03:50 2023

Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Anton Ertl wrote:

But what do you do about the
high-impendance state of MOS?

This requires the 4-state model (to mimic anything similar to real circuitry.)
In any event, technologies smaller than 30nm no longer allow this form of >>logic.

But is it prevented by static checking? If not, you still need to
represent it in simulation, and report it as a bug there.

You still need the X state {don't know if the value is 1 or 0}--set all flip-flops to X at power on and watch HW achieve initialized state.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcus@21:1/5 to MitchAlsup on Thu Dec 7 21:14:53 2023

On 2023-12-06, MitchAlsup wrote:

Paul A. Clayton wrote:

On 11/24/23 9:49 PM, BGB wrote:
[snip]

Though, apparently, someone posted something recently showing RV64
and ARM64 to be much closer than expected, which is curious. The main
instructions that seem to have "the most bang for the buck" are ones
that ARM64 has equivalents of.

"An Empirical Comparison of the RISC-V and AArch64 Instruction
Sets" (Daniel Weaver and Simon McIntosh-Smith, 2023) used five
benchmarks, four scientific and STREAM. Just the fact that STREAM
did not use FMADD for the TRIAD portion slightly penalized AArch64
(though RISC-V will presumably add FMADD if it has not already).
SIMD was excluded based on the reasonable point that RISC-V has
not yet standardized its SIMD extension and "comparing the
different vector instruction sets across AArch64 and RISC-V is
beyond the scope of this initial comparison".

I rather suspect these benchmarks do not provide a good basis for
ISA design targeting minimum path length (much less performance).

Both ARM and RISC-V require close to 40% more instructions than My 66000.
So much for minimum path lengths.
AND, no ISA with more than about 200 instructions should be considered
RISC,

I wonder if 200 is a fundamental constant for RISC vs CISC ;-)

I still struggle to find a good definition of "1 instruction". For me
the definition is loosely "one distinct operation", and so there can be
many variants of one instruction (e.g. variants with register operands
or register + immediate operands all count as a single instruction),
that all carry out the same operation, but with different kinds of
operands or operand sizes.

In my ISA I refer to these as "major instructions", and each
instructions typically have several variants (currently up to 18
variants per major instruction, where different permutations of scalar
and vector register operands count as different variants of a single instruction, for instance).

If I count this way, I currently have 106 instructions, which by your definition safely puts MRISC32 in the "RISC" camp. However, if I count
every variant as a separate instruction, I blow the budget.

Also worth mentioning is that my current instruction encoding scheme
allows for 1535 major instructions, so there is still plenty of room for extensions (even though I already have pretty complete integer,
floating-point and vector support).

The path lengths also varied considerably based on the compiler
version — a more recent version usually helping RISC-V more as
would be expected for a more recent ISA — though the results do
seem to point to general consistency of path length across
versions (one benchmark had negligible change for both ISAs, one
improved AArch64 only, two helped RISC-V only, and one helped both
ISAs but RISC-V more than AArch64).

I am somewhat surprised that indexed memory accesses did not
benefit AArch64 more (for such "scientific" benchmarks). AArch64's
need for a distinct comparison instruction for branches presumably
hurt, especially since loops were not unrolled. (AArch64 does, I
think, include a branch on equal/not-equal zero, so reverse
counted loops would have removed that disadvantage in some cases.)

My data indicates the indexed advantage is in the 2%-3% range.

Both RISC-V and AArch64 are RISC-oriented,

Under a perverted view of what the R in RISC stands for.

I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.

[snip]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Thu Dec 7 22:34:08 2023

BGB wrote:

On 12/7/2023 2:14 PM, Marcus wrote:

Under a perverted view of what the R in RISC stands for.

I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.

Yeah.

Load/Store, and doesn't use a "variable number of bytes" encoding scheme (like x86/Z80/6502 variants).

Does variable number of words fit this criterion.

Or, the 'R' could refer more to keeping instruction logic simple, rather
than minimizing the number of instructions that can exist in the
instruction listing.

In the end it is how do you fit K instructions through your pipeline in
fewer cycles than some on can fit 1.4×k instructions through their pipeline.

Well, and probably that it is viable to implement a CPU core for the
entire ISA without needing a microcode ROM or similar.

There is no microcode in My 66000 1-wide or 6-wide implementations.
But there is no reason one could not build a My 66000 using microcode
should that be the best choice for some implementation.

It is probably not viable to build a {bug for bug} compatible x86
without microcode.

Admittedly, I feel unease with instructions which violate the Load/Store model, which goes for both my experimental LDOP extension and the RISC-V
'A' extension (where essentially LDOP and 'A' represent the same basic
CPU functionality).

Though, it is "sort of passable" in that it is possible to implement
these by shoving a minimal ALU into the L1 cache, rather than needing to restructure the whole pipeline (as would be needed for a more general x86-like model).

It is SO EASY to track this dependency based on register forwarding
that creating a LdOp was done for some other reason.

Then again, not many people are going and being like "The A extension
makes RISC-V no longer RISC".

BECAUSE RISC-V is already not RISC (less than 200 instructions)...

But, then again, there are people who go on about how "[Rm+Ro*Sc]"
addressing is "Not RISC", nevermind that nearly every other RISC
(besides RISC-V) had included it (whether or not they also included a
way to explicitly encode the scale, or if the scale was baked into the instruction, *).

It was accepted as RISC in Mc88100 ISA. {MIPS did not have, ...}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Marcus on Thu Dec 7 22:23:50 2023

Marcus wrote:

On 2023-12-06, MitchAlsup wrote:

Paul A. Clayton wrote:

On 11/24/23 9:49 PM, BGB wrote:
[snip]

Though, apparently, someone posted something recently showing RV64
and ARM64 to be much closer than expected, which is curious. The main
instructions that seem to have "the most bang for the buck" are ones
that ARM64 has equivalents of.

"An Empirical Comparison of the RISC-V and AArch64 Instruction
Sets" (Daniel Weaver and Simon McIntosh-Smith, 2023) used five
benchmarks, four scientific and STREAM. Just the fact that STREAM
did not use FMADD for the TRIAD portion slightly penalized AArch64
(though RISC-V will presumably add FMADD if it has not already).
SIMD was excluded based on the reasonable point that RISC-V has
not yet standardized its SIMD extension and "comparing the
different vector instruction sets across AArch64 and RISC-V is
beyond the scope of this initial comparison".

I rather suspect these benchmarks do not provide a good basis for
ISA design targeting minimum path length (much less performance).

Both ARM and RISC-V require close to 40% more instructions than My 66000.
So much for minimum path lengths.
AND, no ISA with more than about 200 instructions should be considered
RISC,

I wonder if 200 is a fundamental constant for RISC vs CISC ;-)

I choose 200 as the upper bound since 100 is obviously too small
{even though I get by with 61} and any vectorized or SIMDed ISA
is way more than 200.

I still struggle to find a good definition of "1 instruction".

1 Instruction is 1 Spelling the assembly language programmer has to
remember.

For me
the definition is loosely "one distinct operation", and so there can be
many variants of one instruction (e.g. variants with register operands
or register + immediate operands all count as a single instruction),
that all carry out the same operation, but with different kinds of
operands or operand sizes.

I hold this same view.

In my ISA I refer to these as "major instructions", and each
instructions typically have several variants (currently up to 18
variants per major instruction, where different permutations of scalar
and vector register operands count as different variants of a single instruction, for instance).

VVM makes this distinction unnecessary.

If I count this way, I currently have 106 instructions, which by your definition safely puts MRISC32 in the "RISC" camp. However, if I count
every variant as a separate instruction, I blow the budget.

My 66000 has 61 instructions under this framework. This includes {flow
control, Integer, Logical, Shift, Floating point, Transcendentals,
conversions, privileged, vectorization, and SIMD}

Also worth mentioning is that my current instruction encoding scheme
allows for 1535 major instructions, so there is still plenty of room for extensions (even though I already have pretty complete integer, floating-point and vector support).

My 66000 encoding scheme supports 2048 1-operand instructions at the consumption of 1 Major OpCode. Only the 3-operand subGroup is stressed
for Minor OpCodes.

Under a perverted view of what the R in RISC stands for.

I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.

My point was that it should not be redefined into meaninglessness.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Fri Dec 8 02:36:04 2023

BGB wrote:

On 12/7/2023 4:34 PM, MitchAlsup wrote:

BGB wrote:

On 12/7/2023 2:14 PM, Marcus wrote:

Under a perverted view of what the R in RISC stands for.

I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.

Yeah.

Load/Store, and doesn't use a "variable number of bytes" encoding
scheme (like x86/Z80/6502 variants).

Does variable number of words fit this criterion.

Variable number of words is probably OK, otherwise Thumb2 and RVC would
no longer be RISC...

My point on VLEness is that all the position and length information is
found in the first container of the instruction and not determined by
a serial walk along the containers. IBM 360 is a lot less CISC than x86.

Serial decode is definitely not RISC.
Small field determines length, pointers, and sizes; remains RISCable if
it does not violate other RISC tenets.

Or, the 'R' could refer more to keeping instruction logic simple,
rather than minimizing the number of instructions that can exist in
the instruction listing.

In the end it is how do you fit K instructions through your pipeline in
fewer cycles than some on can fit 1.4×k instructions through their
pipeline.

I could probably save a number of instructions if BJX2 was not
Load/Store, but worth it?...

Say, without LDOP:
MOV 16, R6
MOV.L (R4, 0), R5
ADD R5, R6, R5
MOV.L R5, (R4, 0)

Vs, with LDOP:
ADDS.L 16, (R4, 0) //*

This is actually a OP-ST.

Or, maybe go further, and add, say:
INC.L (R4)
DEC.L (R4)
...

This is actually a Ld-Op-ST not a LD-Op.

-------------------------

Well, and probably that it is viable to implement a CPU core for the
entire ISA without needing a microcode ROM or similar.

There is no microcode in My 66000 1-wide or 6-wide implementations.
But there is no reason one could not build a My 66000 using microcode
should that be the best choice for some implementation.

It is probably not viable to build a {bug for bug} compatible x86
without microcode.

OK.

Admittedly, I feel unease with instructions which violate the
Load/Store model, which goes for both my experimental LDOP extension
and the RISC-V 'A' extension (where essentially LDOP and 'A' represent
the same basic CPU functionality).

Though, it is "sort of passable" in that it is possible to implement
these by shoving a minimal ALU into the L1 cache, rather than needing
to restructure the whole pipeline (as would be needed for a more
general x86-like model).

It is SO EASY to track this dependency based on register forwarding
that creating a LdOp was done for some other reason.

?...

How do you think 1-wide in order machines determine that stage 3 of the pipeline contains the required register value and that reading the
register file will have been in vain ??

It is called FORWARDING, no pipeline gets along without it. You can even
split the LD part from the OP part from the ST part, or like Athlon, you
can split the Ld-Op-ST into a triple firing reservation station, or like
modern *-lake convert them into 3 µOps.

Then again, not many people are going and being like "The A extension
makes RISC-V no longer RISC".

BECAUSE RISC-V is already not RISC (less than 200 instructions)...

Fair enough.

Ironically, if I want to support 'A', this means needing to have the
'LDOP' extension enabled, even if I am not really a fan of the cost or implications of this mechanism...

But, 'A' is needed for 'RV64G', which is annoyingly, what would need to
be supported to be able to have any hope of compatibility with the Linux
on RISC-V ecosystem.

The common superset of BJX2 and RV64G (at least for the userland side of things) is a bit more complexity than I would prefer though.

Well, along with the annoyance of the CPU core having functionality that
may exist in one ISA but not the other (and, don't want to port over everything from RISC-V, as this would pollute my own ISA with things
that don't really fit my own vision).

Implementation by implementation ISA differences in a non upwards compatible fashion is not good for the consumers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to MitchAlsup on Thu Dec 7 21:42:03 2023

On 12/4/2023 12:34 PM, MitchAlsup wrote:

BGB wrote:

snip

Or, say, people can find ways to make multi-threaded programming not
suck as much (say, if using an async join/promise and/or channels
model rather than traditional multi-threading with shared memory and
synchronization primitives).

If you want multi-threaded programs to succeed you need to start writing
them in a language that is not inherently serial !! It is brain dead
easy to write embarrassingly parallel applications in a language like Verilog. The programmer does not have to specify when or where a gate
is evaluated--that is the job of the environment (Verilog).....

I am not sure what you are proposing here. While Verilog is fine for
the domain it was designed for (a domain specific language?), it isn't
suitable for most other things, e.g. you couldn't easily write say a
compiler in Verilog. There are other domains where various languages
have helped ease development of parallel programs, but they are also
domain specific, e.g. some simulation languages, and heck, even COBOL
had an (optional) parallel processing capability. There have also been
various attempts to create general purpose languages that do that, e.g. dataflow languages, OCCAM, I think Ada, but AFAIK, none has been hugely successful.

BTW, embarrassingly parallel parallel applications usually aren't much
of a problem, as, pretty much by definition, they have little to no
interaction between the threads/processes (whatever you call them).

https://en.wikipedia.org/wiki/Embarrassingly_parallel

so you can easily fire off as many copies as you need. But change that
to distributed computing problems, and I agree.

So, do you have a proposal for a general purpose language that makes development of distributed computing problems easier?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcus@21:1/5 to MitchAlsup on Fri Dec 8 09:44:17 2023

On 2023-12-07, MitchAlsup wrote:

Marcus wrote:

On 2023-12-06, MitchAlsup wrote:

Paul A. Clayton wrote:

On 11/24/23 9:49 PM, BGB wrote:
[snip]

Though, apparently, someone posted something recently showing RV64
and ARM64 to be much closer than expected, which is curious. The
main instructions that seem to have "the most bang for the buck"
are ones that ARM64 has equivalents of.

"An Empirical Comparison of the RISC-V and AArch64 Instruction
Sets" (Daniel Weaver and Simon McIntosh-Smith, 2023) used five
benchmarks, four scientific and STREAM. Just the fact that STREAM
did not use FMADD for the TRIAD portion slightly penalized AArch64
(though RISC-V will presumably add FMADD if it has not already).
SIMD was excluded based on the reasonable point that RISC-V has
not yet standardized its SIMD extension and "comparing the
different vector instruction sets across AArch64 and RISC-V is
beyond the scope of this initial comparison".

I rather suspect these benchmarks do not provide a good basis for
ISA design targeting minimum path length (much less performance).

Both ARM and RISC-V require close to 40% more instructions than My
66000.
So much for minimum path lengths.
AND, no ISA with more than about 200 instructions should be
considered RISC,

I wonder if 200 is a fundamental constant for RISC vs CISC ;-)

I choose 200 as the upper bound since 100 is obviously too small
{even though I get by with 61} and any vectorized or SIMDed ISA
is way more than 200.

I still struggle to find a good definition of "1 instruction".

1 Instruction is 1 Spelling the assembly language programmer has to
remember.

That is my view too. Some examples:

BZ (branch if zero), 1 variant:

bz r3, #foo@pc

CLZ (count leading zeros), 6 variants:

clz r2, r1 // scalar
clz.b r2, r1 // scalar, packed bytes
clz.h r2, r1 // scalar, packed half-words
clz v2, v1 // vector
clz.b v2, v1 // vector, packed bytes
clz.h v2, v1 // vector, packed half-words

AND (bitwise and), 18 variants:

and r3, r1, r2 // scalar
and.pn r3, r1, r2 // scalar, r1 & ~r2
and.np r3, r1, r2 // scalar, ~r1 & r2
and.nn r3, r1, r2 // scalar, ~r1 & ~r2
and v3, v1, r2 // vector/scalar
and.pn v3, v1, r2 // vector/scalar, v1 & ~r2
and.np v3, v1, r2 // vector/scalar, ~v1 & r2
and.nn v3, v1, r2 // vector/scalar, ~v1 & ~r2
and v3, v1, v2 // vector
and.pn v3, v1, v2 // vector, v1 & ~v2
and.np v3, v1, v2 // vector, ~v1 & v2
and.nn v3, v1, v2 // vector, ~v1 & ~v2
and/f v3, v1, v2 // folding vector
and.pn/f v3, v1, v2 // folding vector, v1 & ~v2
and.np/f v3, v1, v2 // folding vector, ~v1 & v2
and.nn/f v3, v1, v2 // folding vector, ~v1 & ~v2
and r3, r1, #im // scalar immediate
and v3, v1, #im // vector/scalar immediate

Some of the variants above are superfluous (at least three AND variants
are useless and the value of a couple more can be debated), but I can
live with that. The symmetry and ease of encoding/decoding weighs up for
the potential loss of encoding space (of which there is plenty left).

For me
the definition is loosely "one distinct operation", and so there can be
many variants of one instruction (e.g. variants with register operands
or register + immediate operands all count as a single instruction),
that all carry out the same operation, but with different kinds of
operands or operand sizes.

I hold this same view.

In my ISA I refer to these as "major instructions", and each
instructions typically have several variants (currently up to 18
variants per major instruction, where different permutations of scalar
and vector register operands count as different variants of a single
instruction, for instance).

VVM makes this distinction unnecessary.

If I count this way, I currently have 106 instructions, which by your
definition safely puts MRISC32 in the "RISC" camp. However, if I count
every variant as a separate instruction, I blow the budget.

My 66000 has 61 instructions under this framework. This includes {flow control, Integer, Logical, Shift, Floating point, Transcendentals, conversions, privileged, vectorization, and SIMD}

Also worth mentioning is that my current instruction encoding scheme
allows for 1535 major instructions, so there is still plenty of room for
extensions (even though I already have pretty complete integer,
floating-point and vector support).

My 66000 encoding scheme supports 2048 1-operand instructions at the consumption of 1 Major OpCode. Only the 3-operand subGroup is stressed
for Minor OpCodes.

I have plenty of space left for 1-register-operand (99%) and 2-register-operands (84%) instructions, however since I encode
immediates as part of the instruction word (unlike My 66000), the
immediate versions are crowded. In fact the 21-bit immediate
instructions are all used up (all seven of them). OTOH I'm pretty
content with the ones that I have, as they cover quite some ground in
terms of usefulness (e.g. they provide PC-relative load/store/call/jump
with a range of +/-4MiB in a single 32-bit instruction).

Under a perverted view of what the R in RISC stands for.

I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.

My point was that it should not be redefined into meaninglessness.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Marcus on Fri Dec 8 14:40:35 2023

On 07/12/2023 21:14, Marcus wrote:

On 2023-12-06, MitchAlsup wrote:

Paul A. Clayton wrote:

On 11/24/23 9:49 PM, BGB wrote:

Both ARM and RISC-V require close to 40% more instructions than My 66000.
So much for minimum path lengths.
AND, no ISA with more than about 200 instructions should be considered
RISC,

I wonder if 200 is a fundamental constant for RISC vs CISC ;-)

Both RISC-V and AArch64 are RISC-oriented,

Under a perverted view of what the R in RISC stands for.

I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.

[snip]

I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
it meant small and simple instructions, rather than a small number of
different instructions. The idea was that instructions should, on the
whole, be single-cycle and implemented directly in the hardware, rather
than multi-cycle using sequencers or microcode. You could have as many
as you want, and they could be as complicated to describe as you want,
as long as they were simple to implement. (I've worked with a few
PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
/lot/ of instructions!)

In practice, though I think RISC vs CISC is more often used to
distinguish between positions in a range of tradeoffs common in ISA
design, such as :

* fixed-size, fixed-format instruction codes vs variable encodings
* many orthogonal registers vs fewer specialised registers
* load/store vs advanced addressing modes
* "one thing at a time" vs combing common tasks in one instruction

But there's no clear boundaries. The original 68k architecture was
always classified as "CISC". Then the later ColdFire versions were
called "Variable instruction length RISC", though there was a 90%
overlap in the ISA.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to David Brown on Fri Dec 8 15:19:47 2023

David Brown <david.brown@hesbynett.no> writes:

On 07/12/2023 21:14, Marcus wrote:

On 2023-12-06, MitchAlsup wrote:

Paul A. Clayton wrote:

On 11/24/23 9:49 PM, BGB wrote:

Both ARM and RISC-V require close to 40% more instructions than My 66000. >>> So much for minimum path lengths.
AND, no ISA with more than about 200 instructions should be considered
RISC,

I wonder if 200 is a fundamental constant for RISC vs CISC ;-)

Both RISC-V and AArch64 are RISC-oriented,

Under a perverted view of what the R in RISC stands for.

I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.

[snip]

I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
it meant small and simple instructions, rather than a small number of >different instructions. The idea was that instructions should, on the
whole, be single-cycle and implemented directly in the hardware, rather
than multi-cycle using sequencers or microcode. You could have as many
as you want, and they could be as complicated to describe as you want,
as long as they were simple to implement. (I've worked with a few
PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
/lot/ of instructions!)

Surely then, the PDP-8 can be counted as a RISC processor. There are
only 8 instructions defined by a 3-bit opcode, and due to the
instruction encoding, a single operate instruction can perform multiple (sequential) operations.

000 - AND - AND the memory operand with AC.
001 - TAD - Two's complement ADd the memory operand to <L,AC> (a 12 bit signed value (AC) w. carry in L).
010 - ISZ - Increment the memory operand and Skip next instruction if result is Zero.
011 - DCA - Deposit AC into the memory operand and Clear AC.
100 - JMS - JuMp to Subroutine (storing return address in first word of subroutine!).
101 - JMP - JuMP.
110 - IOT - Input/Output Transfer (see below).
111 - OPR - microcoded OPeRations (see below).

https://en.wikipedia.org/wiki/PDP-8#Instruction_set

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Fri Dec 8 15:38:52 2023

David Brown <david.brown@hesbynett.no> writes:

On 07/12/2023 21:14, Marcus wrote:

I wonder if 200 is a fundamental constant for RISC vs CISC ;-)

It's fundamental nonsense, because:

I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
it meant small and simple instructions, rather than a small number of >different instructions.

Yes. John Mashey made this point in his repeated posts on this topic.

The idea was that instructions should, on the
whole, be single-cycle and implemented directly in the hardware, rather
than multi-cycle using sequencers or microcode.

Somewhat: "Single-cycle" is a microarchitectural property, not an ISA
property, but yes, the idea of the first RISCs was that the ISA should
be implementable with such a microarchitecture.

Also, single-cycle means the issue rate on a pipelined processor.
There were many RISC implementations that needed two cycles of latency
for loads. And likewise, FP instructions needed multiple cycles of
latency. And finally, the MIPS R2000 integer multiplier and divider
was not even pipelined (but could run in parallel with the rest of the
integer pipeline).

There have been attempts at splitting, e.g. FP instructions into their
parts (align, add, normalize or somesuch) as a RISCier way to do
things, but it never was implemented in a mainstream processor. What
has been implemented in mainstream processors:

* no integer multiplier/divider (SPARC, HPPA, no divide on Alpha and
IA-64), instead go for multiply step, do it in the FPU, or implement
division through subtraction (Alpha) or fma (IA-64)).

* no 8-bit or 16-bit memory access: eliminates a part of the aligner
from the load data path, eliminates ECC problems for write-back
caches (but no Alpha implementation without BWX extension had such
problems).

What is common in RISCs is to split large constants into sequences of instructions (e.g., for loading the constant from the global table).

I guess I forgot a few.

You could have as many
as you want, and they could be as complicated to describe as you want,
as long as they were simple to implement. (I've worked with a few
PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
/lot/ of instructions!)

Power(PC) is also an example of how moot it is to count instructions.
It has, e.g., load instructions with and without update, which
correspond to one load instruction with different addressing modes in
ARM A64.

In practice, though I think RISC vs CISC is more often used to
distinguish between positions in a range of tradeoffs common in ISA
design, such as :

* fixed-size, fixed-format instruction codes vs variable encodings

Many RISCs use variable-size instruction encodings, e.g., ROMP, ARM
A32/T32 and RISC-V with the C extension.

* many orthogonal registers vs fewer specialised registers

VAX (the exemplary CISC) has 16 registers, like ARM A32 (first
generation RISC).

* load/store vs advanced addressing modes

That's not a dichotomy. Many load/store architectures have more
addressing modes than, e.g., AMD64 (not a load/store architecture);
e.g. ARM A64. Power(PC), HPPA, and 88000 also have at least as many
as AMD64.

The dichotomy is between load/store and non-load/store architectures.
And that's how I usually distinguish between RISC and CISC.

However, it seems that a bigger issue is: one vs. multiple memory
references per instruction. The VAX has multiple, which complicates
many things, whereas (for the most part) AMD64 and load/store
architectures have only one. There is MOVS and REP MOVS for AMD64,
and there is ARM A32 and Power load/store multiple instructions, which
require special treatment.

One interesting aspect here is that modern general-purpose
architectures all support unaligned accesses, which may require
accessing two different cache lines and even two different pages.
Once you support that, the load-pair/store-pair instructions of ARM
A64 does not make loads and stores more complicated (but it
complicates register porting).

* "one thing at a time" vs combing common tasks in one instruction

RISCs have done so for FP instructions, addressing modes, and by
putting multiply and divide instructions in.

But there's no clear boundaries.

There are clear boundaries between load/store and other (the classical
RISC boundary), and between lots of other properties of instruction
sets. Of course marketing people and advocates have tried to claim
RISCness when it was cool to be a RISC. An example can be found here:

The original 68k architecture was
always classified as "CISC". Then the later ColdFire versions were
called "Variable instruction length RISC", though there was a 90%
overlap in the ISA.

Is Coldfire a load/store architecture? If not, it's not a RISC.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to David Brown on Fri Dec 8 17:40:06 2023

David Brown wrote:

On 07/12/2023 21:14, Marcus wrote:

I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
it meant small and simple instructions, rather than a small number of different instructions. The idea was that instructions should, on the
whole, be single-cycle and implemented directly in the hardware, rather
than multi-cycle using sequencers or microcode.

Why should::
ADD R7,R8,#0x123456789abcdef
take any longer to execute than::
ADD R7,R8,R9
???

You could have as many
as you want, and they could be as complicated to describe as you want,
as long as they were simple to implement. (I've worked with a few
PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
/lot/ of instructions!)

In practice, though I think RISC vs CISC is more often used to
distinguish between positions in a range of tradeoffs common in ISA
design, such as :

* fixed-size, fixed-format instruction codes vs variable encodings
* many orthogonal registers vs fewer specialised registers
* load/store vs advanced addressing modes

Like::
lui a0, %hi(.LCPI10_2)
ld a0, %lo(.LCPI10_2)(a0)
instead of::
LD R7,[IP,,.LCPT10_2]

* "one thing at a time" vs combing common tasks in one instruction

Like:
fmv.x.d a0, ft6
call log@plt
fmv.d.x ft0, a0
instead of::
LOG R7,R9

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Fri Dec 8 17:42:31 2023

Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.

[snip]

I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
it meant small and simple instructions, rather than a small number of >>different instructions. The idea was that instructions should, on the >>whole, be single-cycle and implemented directly in the hardware, rather >>than multi-cycle using sequencers or microcode. You could have as many
as you want, and they could be as complicated to describe as you want,
as long as they were simple to implement. (I've worked with a few
PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
/lot/ of instructions!)

Surely then, the PDP-8 can be counted as a RISC processor. There are
only 8 instructions defined by a 3-bit opcode, and due to the
instruction encoding, a single operate instruction can perform multiple (sequential) operations.

000 - AND - AND the memory operand with AC.
001 - TAD - Two's complement ADd the memory operand to <L,AC> (a 12 bit signed value (AC) w. carry in L).
010 - ISZ - Increment the memory operand and Skip next instruction if result is Zero.
011 - DCA - Deposit AC into the memory operand and Clear AC.
100 - JMS - JuMp to Subroutine (storing return address in first word of subroutine!).
101 - JMP - JuMP.
110 - IOT - Input/Output Transfer (see below).
111 - OPR - microcoded OPeRations (see below).

PDP-8 fails to be RISC because does not have a large number of GPRs.
Does not really have LDs (has only a LD-Op).

https://en.wikipedia.org/wiki/PDP-8#Instruction_set

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Scott Lurndal on Fri Dec 8 19:56:00 2023

On 08/12/2023 16:19, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 07/12/2023 21:14, Marcus wrote:

On 2023-12-06, MitchAlsup wrote:

Paul A. Clayton wrote:

On 11/24/23 9:49 PM, BGB wrote:

Both ARM and RISC-V require close to 40% more instructions than My 66000. >>>> So much for minimum path lengths.
AND, no ISA with more than about 200 instructions should be considered >>>> RISC,

I wonder if 200 is a fundamental constant for RISC vs CISC ;-)

Both RISC-V and AArch64 are RISC-oriented,

Under a perverted view of what the R in RISC stands for.

I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.

[snip]

I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
it meant small and simple instructions, rather than a small number of
different instructions. The idea was that instructions should, on the
whole, be single-cycle and implemented directly in the hardware, rather
than multi-cycle using sequencers or microcode. You could have as many
as you want, and they could be as complicated to describe as you want,
as long as they were simple to implement. (I've worked with a few
PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
/lot/ of instructions!)

Surely then, the PDP-8 can be counted as a RISC processor. There are
only 8 instructions defined by a 3-bit opcode, and due to the
instruction encoding, a single operate instruction can perform multiple (sequential) operations.

000 - AND - AND the memory operand with AC.
001 - TAD - Two's complement ADd the memory operand to <L,AC> (a 12 bit signed value (AC) w. carry in L).
010 - ISZ - Increment the memory operand and Skip next instruction if result is Zero.
011 - DCA - Deposit AC into the memory operand and Clear AC.
100 - JMS - JuMp to Subroutine (storing return address in first word of subroutine!).
101 - JMP - JuMP.
110 - IOT - Input/Output Transfer (see below).
111 - OPR - microcoded OPeRations (see below).

https://en.wikipedia.org/wiki/PDP-8#Instruction_set

By my logic (such as it is - I don't claim it is in any sense
"correct"), the PDP-8 would definitely be /CISC/. It only has a few instructions, but that is irrelevant (that was my point) - the
instructions are complex, and therefore it is CISC.

There was a microcontroller that we once considered for a project, which
had only a single instruction - "move". We ended up with a different
chip, so I never got to play with it in practice.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Fri Dec 8 20:06:22 2023

On 08/12/2023 16:38, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

(I'm snipping because I pretty much agree with the rest of what you wrote.)

Is Coldfire a load/store architecture? If not, it's not a RISC.

I agree that there's a fairly clear boundary between a "load/store architecture" and a "non-load/store architecture". And I agree that it
is usually a more important distinction than the number of instructions,
or the complexity of the instructions, or any other distinctions.

But does that mean LSA vs. NLSA should be used to /define/ RISC vs CISC?
Things have changed a lot since the term "RISC" was first coined, and
maybe architectural and ISA features are so mixed that the terms "RISC"
and "CISC" have lost any real meaning. If that's the case, then we
should simply talk about LSA and NLSA architectures, and stop using
"RISC" and "CISC". I don't think trying to redefine "RISC" to mean
something different from its original purpose helps.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to David Brown on Fri Dec 8 19:39:43 2023

David Brown wrote:

On 08/12/2023 16:38, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

(I'm snipping because I pretty much agree with the rest of what you wrote.)

Is Coldfire a load/store architecture? If not, it's not a RISC.

I agree that there's a fairly clear boundary between a "load/store architecture" and a "non-load/store architecture". And I agree that it
is usually a more important distinction than the number of instructions,
or the complexity of the instructions, or any other distinctions.

Would CDC 6600 be considered to have a LD/ST architecture ??

But does that mean LSA vs. NLSA should be used to /define/ RISC vs CISC?
Things have changed a lot since the term "RISC" was first coined, and

It HAS been 43 years since being coined.

maybe architectural and ISA features are so mixed that the terms "RISC"
and "CISC" have lost any real meaning. If that's the case, then we
should simply talk about LSA and NLSA architectures, and stop using
"RISC" and "CISC". I don't think trying to redefine "RISC" to mean
something different from its original purpose helps.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to David Brown on Fri Dec 8 19:50:34 2023

David Brown <david.brown@hesbynett.no> writes:

On 08/12/2023 16:19, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 07/12/2023 21:14, Marcus wrote:

On 2023-12-06, MitchAlsup wrote:

Paul A. Clayton wrote:

On 11/24/23 9:49 PM, BGB wrote:

Both ARM and RISC-V require close to 40% more instructions than My 66000. >>>>> So much for minimum path lengths.
AND, no ISA with more than about 200 instructions should be considered >>>>> RISC,

I wonder if 200 is a fundamental constant for RISC vs CISC ;-)

Both RISC-V and AArch64 are RISC-oriented,

Under a perverted view of what the R in RISC stands for.

I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.

[snip]

I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
it meant small and simple instructions, rather than a small number of
different instructions. The idea was that instructions should, on the
whole, be single-cycle and implemented directly in the hardware, rather
than multi-cycle using sequencers or microcode. You could have as many
as you want, and they could be as complicated to describe as you want,
as long as they were simple to implement. (I've worked with a few
PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
/lot/ of instructions!)

Surely then, the PDP-8 can be counted as a RISC processor. There are
only 8 instructions defined by a 3-bit opcode, and due to the
instruction encoding, a single operate instruction can perform multiple
(sequential) operations.

000 - AND - AND the memory operand with AC.
001 - TAD - Two's complement ADd the memory operand to <L,AC> (a 12 bit signed value (AC) w. carry in L).
010 - ISZ - Increment the memory operand and Skip next instruction if result is Zero.
011 - DCA - Deposit AC into the memory operand and Clear AC.
100 - JMS - JuMp to Subroutine (storing return address in first word of subroutine!).
101 - JMP - JuMP.
110 - IOT - Input/Output Transfer (see below).
111 - OPR - microcoded OPeRations (see below).

https://en.wikipedia.org/wiki/PDP-8#Instruction_set

By my logic (such as it is - I don't claim it is in any sense
"correct"), the PDP-8 would definitely be /CISC/. It only has a few >instructions, but that is irrelevant (that was my point) - the
instructions are complex, and therefore it is CISC.

Given the age of the PDP-8, I'd argue that the instructions
are anything but complex. Leaving aside the optional EAE extension
which provided multiplication and division.

A load (hooked to the adder), a store, and a few logic operations.

The IOT instruction is effectively an MMIO operation, as in
the instruction was put on the bus and the I/O controller
responded appropriately as if it were a load or store operation.

The lack of general purpose registers doesn't disqualify it
from the RISC label in my opinion.

Likewise, the complexity that RISC was attempting to address
were instructions like the Vax POLY, MOVC3/MOCV5 and the
queuing instructions (insert & remove).

The entire RISC vs CISC argument seems somewhat contrived
in these modern times.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to MitchAlsup on Fri Dec 8 21:17:37 2023

On 08/12/2023 20:39, MitchAlsup wrote:

David Brown wrote:

On 08/12/2023 16:38, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

(I'm snipping because I pretty much agree with the rest of what you
wrote.)

Is Coldfire a load/store architecture? If not, it's not a RISC.

I agree that there's a fairly clear boundary between a "load/store
architecture" and a "non-load/store architecture". And I agree that
it is usually a more important distinction than the number of
instructions, or the complexity of the instructions, or any other
distinctions.

Would CDC 6600 be considered to have a LD/ST architecture ??

I don't know - that was /long/ before my time!

But does that mean LSA vs. NLSA should be used to /define/ RISC vs
CISC? Things have changed a lot since the term "RISC" was first
coined, and

It HAS been 43 years since being coined.

maybe architectural and ISA features are so mixed that the terms
"RISC" and "CISC" have lost any real meaning. If that's the case,
then we should simply talk about LSA and NLSA architectures, and stop
using "RISC" and "CISC". I don't think trying to redefine "RISC" to
mean something different from its original purpose helps.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Fri Dec 8 23:11:28 2023

Scott Lurndal wrote:

The lack of general purpose registers doesn't disqualify it
from the RISC label in my opinion.

Then RISC is a meaningless term.

PDP-8 certainly is simple nor does it have many instructions,
but it certainly is NOT RISC.

Did not have a large GPR register file
Was Not pipelined
Was Not single cycle execution
Did not overlap instruction fetch with execution
Did not rely on compiler for good code performance

Likewise, the complexity that RISC was attempting to address
were instructions like the Vax POLY, MOVC3/MOCV5 and the
queuing instructions (insert & remove).

CALL, RET, and EDIT were nightmares to pipeline, too.
But above that:: VAX address modes prevented pipelining.

The entire RISC vs CISC argument seems somewhat contrived
in these modern times.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to David Brown on Fri Dec 8 23:14:08 2023

David Brown wrote:

On 08/12/2023 20:39, MitchAlsup wrote:

David Brown wrote:

On 08/12/2023 16:38, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

(I'm snipping because I pretty much agree with the rest of what you
wrote.)

Is Coldfire a load/store architecture? If not, it's not a RISC.

I agree that there's a fairly clear boundary between a "load/store
architecture" and a "non-load/store architecture". And I agree that
it is usually a more important distinction than the number of
instructions, or the complexity of the instructions, or any other
distinctions.

Would CDC 6600 be considered to have a LD/ST architecture ??

I don't know - that was /long/ before my time!

If you wrote into A1..A5 then X1..X5 was loaded from memory
If you wrote into A6..A7 then X6..X7 was stored into memory

Peripheral processes (I/O controllers) performed the job of the OS
leaving the CPUs strictly for number crunching.

But does that mean LSA vs. NLSA should be used to /define/ RISC vs
CISC? Things have changed a lot since the term "RISC" was first
coined, and

It HAS been 43 years since being coined.

maybe architectural and ISA features are so mixed that the terms
"RISC" and "CISC" have lost any real meaning. If that's the case,
then we should simply talk about LSA and NLSA architectures, and stop
using "RISC" and "CISC". I don't think trying to redefine "RISC" to
mean something different from its original purpose helps.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Sat Dec 9 10:09:45 2023

scott@slp53.sl.home (Scott Lurndal) writes:

Surely then, the PDP-8 can be counted as a RISC processor.

I don't count it as a RISC, because it's too different from the
architectures that are commonly seen as RISCs:

1. It is not a load-store architecture
5. It does not have 16 or more general-purpose registers.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to MitchAlsup on Sat Dec 9 10:19:12 2023

mitchalsup@aol.com (MitchAlsup) writes:

Would CDC 6600 be considered to have a LD/ST architecture ??

If I understand the description right, the load or store happen as
side effects of an operation that writes to A1..A7. And it only loads
to X1..X5 and stores from X6..X7. If somebody says that an
architecture is a load-store architecture, I certainly do not expect
such restrictions; I actually expect a register machine (i.e., with
GPRs), but the CDC-6600 has three sets of special-purpose registers.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to A. Clayton on Sat Dec 9 21:07:00 2023

In article <ul2bu6$2a7gb$4@dont-email.me>, paaronclayton@gmail.com (Paul
A. Clayton) wrote:

For Itanium, binary translation provided better performance on
the same hardware, so it was more evident that the compatibility
had a mediocre performance target.

From memory of conversations with Intel people in 1997-2000, they thought
the hardware-provided compatibility would be faster than it turned out.
That suggests they expected higher clockspeeds, and that the translator
would make effective use of Itanium bundled instructions.

I accidentally benchmarked the 667MHz Merced running IA-32 code, by
selecting the wrong build tree, and it was about a third of the
performance of optimised native code. That suggests little use of bundled instructions.

The other reason the IA-32 emulation seemed so slow was that IA-32
performance standards rose considerably while Merced was being designed
and built. That was due to the clockspeed war between Pentium II/III and
AMD's Athlon. That disrupted Intel's plans for a slower clockspeed ramp,
and Intel's response gave the world NetBurst.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to MitchAlsup on Sun Dec 10 10:56:36 2023

MitchAlsup <mitchalsup@aol.com> schrieb:

If you want multi-threaded programs to succeed you need to start writing
them in a language that is not inherently serial !! It is brain dead
easy to write embarrassingly parallel applications in a language like Verilog. The programmer does not have to specify when or where a gate
is evaluated--that is the job of the environment (Verilog).....

But the job of a programmer to keep everything that can be parallel in
mind... Would you write a compiler, or a word processor, in Verilog?
How much harder would that be, compared to a serial language?

My personal favorites for parallel programming are PGAS languages
like (yes, you guessed it) Fortran, where the central data
structure is the coarray.

On image X, you can manuipulate data on your own image, and you can
access data on other images (let's call it Y) in these coarrays via
special syntax, as a[Y].

You have to make sure that you synchronize before accessing data
that has been modified on another image.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to MitchAlsup on Sun Dec 10 10:39:31 2023

MitchAlsup <mitchalsup@aol.com> schrieb:

Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??

In my case: Yes.

Besides making many things worse, the new formula editor (since
2010?) in Word is reasonable to work with, especially since it is
possible to use LaTeX notation now (and thus it is now possible
to paste from Maple).

Previously, I actually wrote some reports in LaTeX, going to some
trouble to make them appear visually like the Word template du jour
(but the formulas gave it away, they looked to nice for Word).

Formula _numbering_ - now that, Microsoft managed to make worse
(which simply comes naturally in LaTeX).

And, come to think of it, since Office 365 (I think) they now
allow direct use of svg files as graphics, allowing two
non-braindead ways of including pdf graphics in Word - either
via Inkscape (read as pdf, write as svg) or through command-line
tools (usually via Cygwin).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to MitchAlsup on Sun Dec 10 10:51:03 2023

mitchalsup@aol.com (MitchAlsup) writes:

Scott Lurndal wrote:

The lack of general purpose registers doesn't disqualify it
from the RISC label in my opinion.

Then RISC is a meaningless term.

PDP-8 certainly is simple nor does it have many instructions, but it certainly is NOT RISC.

Did not have a large GPR register file
Was Not pipelined
Was Not single cycle execution
Did not overlap instruction fetch with execution
Did not rely on compiler for good code performance

Of course the PDP-8 is a RISC. These propperties may have been
common among some RISC processors, but they don't define what
RISC is. RISC is a design philosophy, not any particular set
of architectural features.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Tim Rentsch on Sun Dec 10 19:16:02 2023

Tim Rentsch wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Scott Lurndal wrote:

The lack of general purpose registers doesn't disqualify it
from the RISC label in my opinion.

Then RISC is a meaningless term.

PDP-8 certainly is simple nor does it have many instructions, but it
certainly is NOT RISC.

Did not have a large GPR register file
Was Not pipelined
Was Not single cycle execution
Did not overlap instruction fetch with execution
Did not rely on compiler for good code performance

Of course the PDP-8 is a RISC. These propperties may have been
common among some RISC processors, but they don't define what
RISC is. RISC is a design philosophy, not any particular set
of architectural features.

So what we can take from this is that RISC as a term has become meaningless.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Sun Dec 10 20:09:44 2023

mitchalsup@aol.com (MitchAlsup) writes:

Tim Rentsch wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Scott Lurndal wrote:

The lack of general purpose registers doesn't disqualify it
from the RISC label in my opinion.

Then RISC is a meaningless term.

PDP-8 certainly is simple nor does it have many instructions, but it
certainly is NOT RISC.

Did not have a large GPR register file
Was Not pipelined
Was Not single cycle execution
Did not overlap instruction fetch with execution
Did not rely on compiler for good code performance

Of course the PDP-8 is a RISC. These propperties may have been
common among some RISC processors, but they don't define what
RISC is. RISC is a design philosophy, not any particular set
of architectural features.

So what we can take from this is that RISC as a term has become meaningless.

Or that it never had meaning, in the sense you're looking for.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Tim Rentsch on Mon Dec 11 18:28:09 2023

On Sun, 10 Dec 2023 10:51:03 -0800, Tim Rentsch wrote:

Of course the PDP-8 is a RISC. These propperties may have been common
among some RISC processors, but they don't define what RISC is. RISC is
a design philosophy, not any particular set of architectural features.

I can't agree.

Your final sentence may be true enough, but I think that the architectural feature of being a load-store architecture is very much indicative of
whether the RISC design philosophy was being followed. Of course, it isn't absolutely _decisive_, as Concertina II demonstrates.

The PDP-8 is just a very small computer, with a very small instruction
set, designed before the RISC design philosophy was even concieved of.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to tkoenig@netcologne.de on Mon Dec 11 22:47:56 2023

On Sun, 10 Dec 2023 10:56:36 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

If you want multi-threaded programs to succeed you need to start writing
them in a language that is not inherently serial !! It is brain dead
easy to write embarrassingly parallel applications in a language like
Verilog. The programmer does not have to specify when or where a gate
is evaluated--that is the job of the environment (Verilog).....

But the job of a programmer to keep everything that can be parallel in >mind... Would you write a compiler, or a word processor, in Verilog?
How much harder would that be, compared to a serial language?

There are (at least) 3 problems:

1: most programming languages are predominantly serial and their
support for parallelism (relatively) is poor.

2: many programmers are much better at figuring out what CAN be
done in parallel than they are at figuring out what SHOULD be
done in parallel. The result often is too many threads each
making very little progress.

3: the skill level of the average programmer now is only slightly
above "novice". More software is being written now than ever
before, but the vast majority of it is poor quality.

Better languages can help, but "better" in my view does not include C.

My personal favorites for parallel programming are PGAS languages
like (yes, you guessed it) Fortran, where the central data
structure is the coarray.

Mileage varies considerably and I don't intend to start a language
war: a lot has to due with the history of parallel applications a
person has developed. I can respect your point of view even though I
don't agree with it.

My favorite model is CSP (ala Hoare) with no shared memory. Which is
not to say I don't use threads, but I try to design programs such that
threads (mostly) are not sharing writable data structures.

You have to make sure that you synchronize before accessing data
that has been modified on another image.

And that's where most languages fall down: they provide either
primitives which are too low level and too hard to use correctly, or
they provide high level mechanisms that don't scale well and are too
limiting in actual use.

Which goes back to #3 above. Repeated studies have shown that most
programmers can't write correct parallel code to operate on shared
data structures. The results are congruent with, but even worse than,
the studies on memory management which showed most programmers had
trouble with manual (de)allocation of shared structures.

I've been programming for 40 years now, and I have yet to see a
language that I would want to hand to a novice intending to write
parallel code. I've seen what I think have been some good approaches,
but the languages involved: Lisps/Schemes, functional, and constraint
logic lanaguages ... are just too different for many people to grasp.

Again, MMV.
George

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Tue Dec 12 23:07:04 2023

Paul A. Clayton wrote:

On 12/7/23 9:36 PM, MitchAlsup wrote:
[snip]

My point on VLEness is that all the position and length
information is
found in the first container of the instruction and not determined by
a serial walk along the containers. IBM 360 is a lot less CISC
than x86.

Serial decode is definitely not RISC.
Small field determines length, pointers, and sizes; remains
RISCable if it does not violate other RISC tenets.

I would have guessed that encoding a length to next length
indicator would also be somewhat simple for decode when the
additional chunk contains only opcode extension that does not
affect routing (which includes some hints) or immediate data. In
terms of parsing the instruction stream into instructions, such is
not different than having a nop with the length specified in *its*
first container. [see ENDNOTE]

My VLE encoding (4-bits) deals with constants (±5-bits, 32-bits, 64-bits)
and operand sign control {rs1,rs2..rs1,-rs2..-rs1,rs2..-rs1,-rs2}
The trick is finding where you can position to place these bits
such that the same bits are used in {1-operand, 2-operand, 3-operand,
and memory references.} This means you can decode them prior to
determining the instruction subGroup. And you cannot more a 5-bit
register specifier.....

My 66000's instruction modifiers seem to add some decoding
complexity in that bits of the container are distributed to the
following instructions (which may themselves be variable length);
clearly, this is considered acceptable complexity. I think a
DOUBLE prefix was also proposed (architected? it was not in the 28
Jan 2020 version that I have) that encoded additional operands
into the prefix, forming a kind of explicit instruction fusion.

Yes, I toyed with a DBLE instruction. Its job was to give 3-more
operands and 1 more result register in support of 128-bit registers calculations, and memory references. It can be resurrected if desired.
But I don't think there is currently enough demand for 128-bit except
ii market niches so small that I am not interested.

(I have a suspicion that a large-chunk instruction encoding with
borrow-lend across chunks could facilitate code density while
providing some of the advantages of fixed-length encoding. I have
not thought about this deeply, but I sense there may be problems
with allowing arbitrary bits to be borrowed. Limiting such to
immediates might reduce the exposure to danger. However, I
suspect that more emphasis should be on targeting an OoO
implementation than on code density.)

=== ENDNOTE ===
An encoding with multiple length specifiers might theoretically
reduce the overhead of encoding the length for the more common
short cases — perhaps by an entire bit!!!!☺ — but in addition to increasing the size overhead for longer instructions it would
split large fields by inserting the length extension specifier.
The extra size overhead also means that 32-bit and 64-bit
immediates could further bloat the instruction by requiring an
additional parcel for only a few bits. One bit of length
information effectively becomes a marker bit per parcel, which is
a technique that was used for some x86 pre-decoded caches and for
one ISA that encoded immediates.

If there were 16 instruction lengths, perhaps a split length
specifier *might* make sense, but My 66000's five instruction
lengths obviously does not take that much space. I believe the
lengths are not fully orthogonal, so it does not take 2.32 bits.
(I think only a store can be longer than 3 parcels, though perhaps
some 3-input compute operations might be theoretically able to use
two constant inputs.)

My 66000 uses 4-bits, you found 2.32-bits of utility, another 1.2
bits of utility are used to allow 5-bit constants replacing the
5-bit register specifier; and the rest are sign control over the
operands. The 4-bits are completely used {16-patterns} only 2
are lightly used}

Yet if the extra bits are not on a critical path (such as register specifiers) such a clunkier mechanism might not be so horrible.
Maybe only 0.2 x86s (the unit of measurement for ISA horror☺).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Wed Dec 13 04:58:13 2023

On Tue, 05 Dec 2023 01:08:01 +0000, MitchAlsup wrote:

Gallium Arsenide is 5×; hideously expensive, dangerous to the workers in
the FAB, and chemical disposal, low yield,.....

If the yield is _so_ low that they can't make anything bigger than an
8086 out of it, then of course that means they're losing more than the
5x gain.

But if they could make a Pentium Pro or Pentium II out of Gallium Arsenide,
and get, say, a 10% yield, I'm sure that government (specifically military) users would be happy to pay the price premium for it. Even if it's
exorbitant, like $20,000 per processor.

I'm not saying that it would be for *everybody*.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Wed Dec 13 05:09:52 2023

On Tue, 12 Dec 2023 23:07:04 +0000, MitchAlsup wrote:

My VLE encoding (4-bits) deals with constants (±5-bits, 32-bits,
64-bits)
and operand sign control {rs1,rs2..rs1,-rs2..-rs1,rs2..-rs1,-rs2}
The trick is finding where you can position to place these bits such
that the same bits are used in {1-operand, 2-operand, 3-operand,
and memory references.} This means you can decode them prior to
determining the instruction subGroup. And you cannot more a 5-bit
register specifier.....

Of course, Concertina II solves this problem too, even if it does
so in a way which you believe to be the wrong way. But of course it
can also be solved in a relatively simple way with an encoding of
the first few bits of a variable-length instruction - the trick to
keeping it simple would be to have _two_ sets of prefix bits, because
there are only a few lengths for instructions, and a few lengths
for constants used as immediates - it's only the _combinations_ that
get out of control.

And you're already using that trick, IIRC, so you don't need to
take any lessons from the monstrosity that is Concertina II.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Wed Dec 13 12:31:50 2023

On Wed, 13 Dec 2023 05:09:52 +0000, Quadibloc wrote:

the trick to keeping it simple
would be to have _two_ sets of prefix bits, because there are only a few lengths for instructions, and a few lengths for constants used as
immediates - it's only the _combinations_ that get out of control.

Actually, there is one other thing. So that the instructions for which immediates are used can have one consistent format, unlike the prefixes
for instruction length, which can have different numbers of bits, the
prefixes for constant length need to all be the same number of bits in
length.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Wed Dec 13 13:14:40 2023

On Wed, 13 Dec 2023 12:31:50 +0000, Quadibloc wrote:

On Wed, 13 Dec 2023 05:09:52 +0000, Quadibloc wrote:

the trick to keeping it simple would be to have _two_ sets of prefix
bits, because there are only a few lengths for instructions, and a few
lengths for constants used as immediates - it's only the _combinations_
that get out of control.

Actually, there is one other thing. So that the instructions for which immediates are used can have one consistent format, unlike the prefixes
for instruction length, which can have different numbers of bits, the prefixes for constant length need to all be the same number of bits in length.

I went back and checked. When I proposed what a variable-length coding
for the Concertina instruction set would look like, at first glance it
seemed as though I didn't follow that rule:

Something like:

0 - 16 bits
1 - 32 bits, except
111011001 32 bits + 16 bits
111011010 32 bits + 32 bits
111011011 32 bits + 64 bits
1110111000 32 bits + 48 bits
1110111001 32 bits + 32 bits
1110111010 32 bits + 64 bits
1110111011 32 bits + 128 bits
11110 - 48 bits
11111 - 64 bits

But notice that the lengths 32 and 64 bits appear twice. So this
actually was intended to keep the instruction format constant;
the prifixes that were one bit longer were for floating-point
instructions. While both integer and floating-point instructions
use register banks of 32 registers, the difference is that there
are two load-store instructions for floats - LOAD and STORE -
for integers there are also (where the integer type is shorter than
the register) LOAD UNSIGNED (zero out all unused higher bits) and
INSERT (leave all higher bits unaffected) in addition to LOAD
(sign extend into all unused bits of the register more significant
than the leading bit of the argument type).

So I did know of that principle, and was following it, with a
slight customization for the specifics of the Concertina II
instruction set.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Wed Dec 13 15:42:43 2023

On 10/12/2023 11:39, Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??

In my case: Yes.

Besides making many things worse, the new formula editor (since
2010?) in Word is reasonable to work with, especially since it is
possible to use LaTeX notation now (and thus it is now possible
to paste from Maple).

If I want to write something serious with formula, I use LaTeX.

Previously, I actually wrote some reports in LaTeX, going to some
trouble to make them appear visually like the Word template du jour
(but the formulas gave it away, they looked to nice for Word).

What a strange thing to do - that sounds completely backwards to me!

I was happy when I had made a template for LibreOffice (it might have
been one of the forks of OpenOffice, pre-LibreOffice) that looked
similar to what I have for LaTeX. Then I could make reasonable-looking documents for customers that insisted on having docx format instead of pdf.

I don't think there has been much exciting or important (to me) added to
word processors for decades. Direct pdf generation was one, which
probably existed in Star Office (the ancestor of OpenOffice /
LibreOffice, IIRC). And then when styles, numbering, templates and pdf
export got good enough that you could make pdfs with real table of
contents, clickable links, etc., so that word processed documents could
look almost professional.

Apart from that, the only benefits I see of newer LibreOffice over older
ones is better handling of the insane chaos that MS Office uses for its
file formats. LibreOffice is /much/ better at this than MS Office is, especially if the file has been modified by a number of different MS
Office versions.

Formula _numbering_ - now that, Microsoft managed to make worse
(which simply comes naturally in LaTeX).

And, come to think of it, since Office 365 (I think) they now
allow direct use of svg files as graphics, allowing two
non-braindead ways of including pdf graphics in Word - either
via Inkscape (read as pdf, write as svg) or through command-line
tools (usually via Cygwin).

LibreOffice has had that for ages.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Paul A. Clayton on Wed Dec 13 09:47:39 2023

Paul A. Clayton wrote:

On 12/7/23 9:36 PM, MitchAlsup wrote:
[snip]

My point on VLEness is that all the position and length information is
found in the first container of the instruction and not determined by
a serial walk along the containers. IBM 360 is a lot less CISC than x86.

Serial decode is definitely not RISC.
Small field determines length, pointers, and sizes; remains RISCable
if it does not violate other RISC tenets.

I would have guessed that encoding a length to next length
indicator would also be somewhat simple for decode when the
additional chunk contains only opcode extension that does not
affect routing (which includes some hints) or immediate data. In
terms of parsing the instruction stream into instructions, such is
not different than having a nop with the length specified in *its*
first container. [see ENDNOTE]

If one designs the ISA on the assumption that there will be separate
stages for Fetch and Decode, and I think that's a good idea,
then there are two parses taking place, the external inter-instruction
parse performed by Fetch, and internal instruction field parse by Decode.

The Fetch length parse needs to be simple *except* that Fetch needs to
be able to pick off all conditional and unconditional branch, call, ret,
and consult the branch predictors for which-path information.

Additionally for BRcc/CALL Fetch needs access to the branch offset,
which is an internal parse, to add to the its future RIP
in case the predictor says to follow the alternate path.
And, as in my case, there could be multiple branch offset sizes
which interacts with the length parse, sign extension delay,
and the final RIP add result delay.

Everything else is an internal parse by Decode where it is mostly a matter
of chopping things up. For instruction with immediates RISC-V designers
seemed to be very concerned about sign/zero extension delay and the
location of the sign bit but I'm not sure why - to me it looks like a
single mux delay at the end. And if all immediates are parsed by Fetch,
because it needs the BR/CALL offset, then these might arrive in the
Decode input buffer already parsed and sign extended.

=== ENDNOTE ===
An encoding with multiple length specifiers might theoretically
reduce the overhead of encoding the length for the more common
short cases — perhaps by an entire bit!!!!☺ — but in addition to increasing the size overhead for longer instructions it would
split large fields by inserting the length extension specifier.
The extra size overhead also means that 32-bit and 64-bit
immediates could further bloat the instruction by requiring an
additional parcel for only a few bits. One bit of length
information effectively becomes a marker bit per parcel, which is
a technique that was used for some x86 pre-decoded caches and for
one ISA that encoded immediates.

If there were 16 instruction lengths, perhaps a split length
specifier *might* make sense, but My 66000's five instruction
lengths obviously does not take that much space. I believe the
lengths are not fully orthogonal, so it does not take 2.32 bits.
(I think only a store can be longer than 3 parcels, though perhaps
some 3-input compute operations might be theoretically able to use
two constant inputs.)

Yet if the extra bits are not on a critical path (such as register specifiers) such a clunkier mechanism might not be so horrible.
Maybe only 0.2 x86s (the unit of measurement for ISA horror☺).

The other considerations are frequency of occurrence of instructions
and the relative cost of the length bits in the parse tokens.
A 2-bit length field can be simple but in a 16-bit token it also
only allows 25% of the opcode space for the shortest instructions,
which is where opcodes are most precious.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to David Brown on Wed Dec 13 19:06:49 2023

David Brown wrote:

On 10/12/2023 11:39, Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??

In my case: Yes.

Besides making many things worse, the new formula editor (since
2010?) in Word is reasonable to work with, especially since it is
possible to use LaTeX notation now (and thus it is now possible
to paste from Maple).

If I want to write something serious with formula, I use LaTeX.

When I want to write an unmisunderstandable formula I use CorelDraw
and then export as *.jpg. {Everything, except NGs like this, can take
*.jpgs.} And a Draw program can create symbols that are not in char-
acture Maps.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to David Brown on Wed Dec 13 22:40:57 2023

David Brown wrote:

On 10/12/2023 11:39, Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??

In my case: Yes.

Besides making many things worse, the new formula editor (since
2010?) in Word is reasonable to work with, especially since it is
possible to use LaTeX notation now (and thus it is now possible
to paste from Maple).

If I want to write something serious with formula, I use LaTeX.

Previously, I actually wrote some reports in LaTeX, going to some
trouble to make them appear visually like the Word template du jour
(but the formulas gave it away, they looked to nice for Word).

What a strange thing to do - that sounds completely backwards to me!

I was happy when I had made a template for LibreOffice (it might have
been one of the forks of OpenOffice, pre-LibreOffice) that looked
similar to what I have for LaTeX. Then I could make reasonable-looking documents for customers that insisted on having docx format instead of pdf.

I don't think there has been much exciting or important (to me) added to
word processors for decades. Direct pdf generation was one, which
probably existed in Star Office (the ancestor of OpenOffice /

*.pdf arrives in Word ~2000 (maybe before).

<snip>

Apart from that, the only benefits I see of newer LibreOffice over older
ones is better handling of the insane chaos that MS Office uses for its
file formats. LibreOffice is /much/ better at this than MS Office is, especially if the file has been modified by a number of different MS
Office versions.

I still require people sending me *.docx to convert it back to
WORD2003 format *.doc and retransmitting it. It is surprising how
many people don't know how to do that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Wed Dec 13 22:36:02 2023

Quadibloc wrote:

On Wed, 13 Dec 2023 12:31:50 +0000, Quadibloc wrote:

I went back and checked. When I proposed what a variable-length coding
for the Concertina instruction set would look like, at first glance it
seemed as though I didn't follow that rule:

Something like:

0 - 16 bits
1 - 32 bits, except
111011001 32 bits + 16 bits
111011010 32 bits + 32 bits
111011011 32 bits + 64 bits
1110111000 32 bits + 48 bits
1110111001 32 bits + 32 bits
1110111010 32 bits + 64 bits
1110111011 32 bits + 128 bits
11110 - 48 bits
11111 - 64 bits

My 66000 can do::

FADD R7,#1,R9 // the #1 is interpreted as +1.0D0
FADD R7,#-1,R9 // the #-1 is interpreted as -1.0D0
FMAC R7,R8,R9,#1 // a+b+1
FDIV R7,#1,R9 // reciprocate
CVTID R1,#28 // R7 = 28D0

And all of these are 32-bit instructions. In addition we have::

CVTFD R1,#377 // R7 = 377D0
FADD R7,R8,#799 // R7 = R9+799D0
FMAC R7,R8,#799,R9 // R7 = r8*799+R9

as 64-bit instructions.

~00 Instruction is in the Major OpCode group
00 Instruction can have long constants
000 XOM Instruction is in the negative eXtended OpCode group
001 XOP Instruction is in the positive eXtended OpCode group ----------------
bits<15,11,14,12>
0000 +Rs1 +Rs2
0001 +Rs1 -Rs2
0010 -Rs1 +Rs2
0011 -Rs1 -Rs2
0100 +Rs1 +imm5
0101 +imm5 +Rs2
0110 +Rs1 -imm5
0111 -imm5 +rs2
1000 +rs1 imm32
1001 #imm32 +Rs2
1010 -Rs1 #imm32
1011 #imm32 -Rs2
1100 +rs1 imm64
1101 #imm64 +Rs2
1110 -Rs1 #imm64
1111 #imm64 -Rs2

The 5-bit immediates, when used in 32-bit FP calculations, are expanded
into float32.
The 5-bit immediates, when used in 64-bit FP calculations, are expanded
into double64.
The 32-bit immediates, when used in 64-bit FP calculations, are expanded
into double64--this requires the compiler not put denorms in these
constants.

I found it very useful to separate those instruction that can have long constants from those that cannot.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Wed Dec 13 22:58:36 2023

EricP wrote:

Paul A. Clayton wrote:

On 12/7/23 9:36 PM, MitchAlsup wrote:
[snip]

My point on VLEness is that all the position and length information is
found in the first container of the instruction and not determined by
a serial walk along the containers. IBM 360 is a lot less CISC than x86. >>>
Serial decode is definitely not RISC.
Small field determines length, pointers, and sizes; remains RISCable
if it does not violate other RISC tenets.

I would have guessed that encoding a length to next length
indicator would also be somewhat simple for decode when the
additional chunk contains only opcode extension that does not
affect routing (which includes some hints) or immediate data. In
terms of parsing the instruction stream into instructions, such is
not different than having a nop with the length specified in *its*
first container. [see ENDNOTE]

If one designs the ISA on the assumption that there will be separate
stages for Fetch and Decode, and I think that's a good idea,
then there are two parses taking place, the external inter-instruction
parse performed by Fetch, and internal instruction field parse by Decode.

My 66000 ISA was designed under the notion that there is::

FETCH -- PARSE -- DECODE --

The parse stage includes address comparison (hit 5-gates) and Set selection
(4 gates) along with instruction length decode (4-gates). This leaves the
SRAMs of Fetch 1-whole clock from flopped-address to flopped-data.

The Fetch length parse needs to be simple *except* that Fetch needs to
be able to pick off all conditional and unconditional branch, call, ret,
and consult the branch predictors for which-path information.

I am going to disagree, here, in that one can run fetch entirely from predictors without knowing if the previous fetch satisfied this or that.
There is time to sort this out later as long as the predictor is good.

Additionally for BRcc/CALL Fetch needs access to the branch offset,

None of {R2000, SPARC V8, Mc88100, CRIPS} did that in fetch, we
all did that in decode--hence the delay slot.

which is an internal parse, to add to the its future RIP
in case the predictor says to follow the alternate path.

The predictor that says: "follow the alternate path" can supply
an index (6-8 bits) and access alternate path instructions
{and sort out the minutia later}.

And, as in my case, there could be multiple branch offset sizes
which interacts with the length parse, sign extension delay,
and the final RIP add result delay.

Everything else is an internal parse by Decode where it is mostly a matter
of chopping things up.

Once your ISA bites-off-on VLE, you basically need a PARSE stage
{or DECODE 1 ^ 2 stages}. The PARSE stage (as mentioned above)
can absorb the hit and set select gate delays taking pressure of
FETCH and DECODE. What PARSE delivers to DECODE is the instruction-
specifiers of al instruction to be DECODEd that cycle {and unary
pointers to constants--which becomes more inputs to the forwarding
logic.

For instruction with immediates RISC-V designers seemed to be very concerned about sign/zero extension delay and the
location of the sign bit but I'm not sure why - to me it looks like a
single mux delay at the end. And if all immediates are parsed by Fetch, because it needs the BR/CALL offset, then these might arrive in the
Decode input buffer already parsed and sign extended.

Based on the frequencies RISC-V implementations have achieved to
date--this is a poor assumption.

In My 66000 case, there is at least 10-gates of delay to perform sign
extension prior to consuming the constant at forwarding.

=== ENDNOTE ===
An encoding with multiple length specifiers might theoretically
reduce the overhead of encoding the length for the more common
short cases — perhaps by an entire bit!!!!☺ — but in addition to
increasing the size overhead for longer instructions it would
split large fields by inserting the length extension specifier.
The extra size overhead also means that 32-bit and 64-bit
immediates could further bloat the instruction by requiring an
additional parcel for only a few bits.

And THAT is why you don't do it that way!!

One bit of length
information effectively becomes a marker bit per parcel, which is
a technique that was used for some x86 pre-decoded caches and for
one ISA that encoded immediates.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Wed Dec 13 23:10:38 2023

BGB wrote:

On 12/13/2023 8:47 AM, EricP wrote:

Paul A. Clayton wrote:

On 12/7/23 9:36 PM, MitchAlsup wrote:
[snip]

Luckily, if one can classify each instruction word into one of:
16-bit op;
32-bit scalar op;
32-bit bundle;
32-bit jumbo prefix.

By eliminating 16-bit Ops, headers and prefixes (except in rare circumstances) one gets rid of the mess.

Then looking at 1 or 2 instruction words for a 1-3 word instruction
isn't too much of an ask.

Additionally for BRcc/CALL Fetch needs access to the branch offset,

Technically it is a displacement.....

which is an internal parse, to add to the its future RIP
in case the predictor says to follow the alternate path.

By the time you are doing 4-wide and wider, you quit thinking
like this. You predict it, and sort it out later. In Mc88120
we did not verify correct branch target address until the
branch instruction executed. This did not show up on the top
10 things slowing the CPU down.

And, as in my case, there could be multiple branch offset sizes
which interacts with the length parse, sign extension delay,
and the final RIP add result delay.

So, have 1 size for conditional and 1 size for unconditional--

AND THEN, you build a sign extending adder that adds 64+16 and
produces the correct sign extended 64-bit result. Guys it is not
1975 anymore !! Why do you think this is a serial process ??

Also Note:: The address produced by the branch target adder only
needs to produce enough bits to index the cache {you can sort out
all the harder stuff later} in the DECODE stage of the pipeline.
A check for page crossing (because you don't necessarily need to
access the TLB) finishes the problem.

More or less how it is done in my case, except it works by computing PC
+ one of several different branch-sizes (8s, 11s, and 20s), and if the

With the cache sizes you have shown in the past (and word accesses)
you probably don't need to calculate more than 11-bits in the decode
cycle. Once you have enough bits to index the SRAM macro which comprises
the cache, every other bit needed to finish the calculation can be deferred
to later.

corresponding branch hits (matches the pattern and is selected as
"taken") it then uses this output as the destination (via MUX'ing).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to MitchAlsup on Thu Dec 14 09:10:04 2023

On 13/12/2023 20:06, MitchAlsup wrote:

David Brown wrote:

On 10/12/2023 11:39, Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??

In my case: Yes.

Besides making many things worse, the new formula editor (since
2010?) in Word is reasonable to work with, especially since it is
possible to use LaTeX notation now (and thus it is now possible
to paste from Maple).

If I want to write something serious with formula, I use LaTeX.

When I want to write an unmisunderstandable formula I use CorelDraw
and then export as *.jpg. {Everything, except NGs like this, can take *.jpgs.} And a Draw program can create symbols that are not in char-
acture Maps.

Drawing programs can be useful if you want some unusual symbols (though
they would have to be /very/ unusual if they are not in some package on
CTAN). And of course you can be /much/ freer with the layout of the maths.

If I needed something with such freehand layout, I'd write it on paper
and scan it in, as it is much faster to do. That's also fine for notes,
or documentation only read by the development team. But it is not "professional" quality, if that is important for the job in hand.

If you are making such files, I'd suggest png as a better format than
jpg - it is far better suited to sharp contrast images. jpg is for
photographs and similar images, and will blur the lines and figures on
drawn maths. (Or use a vector image format, like svg.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to MitchAlsup on Thu Dec 14 09:57:55 2023

On 13/12/2023 23:40, MitchAlsup wrote:

David Brown wrote:

On 10/12/2023 11:39, Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??

In my case: Yes.

Besides making many things worse, the new formula editor (since
2010?) in Word is reasonable to work with, especially since it is
possible to use LaTeX notation now (and thus it is now possible
to paste from Maple).

If I want to write something serious with formula, I use LaTeX.

Previously, I actually wrote some reports in LaTeX, going to some
trouble to make them appear visually like the Word template du jour
(but the formulas gave it away, they looked to nice for Word).

What a strange thing to do - that sounds completely backwards to me!

I was happy when I had made a template for LibreOffice (it might have
been one of the forks of OpenOffice, pre-LibreOffice) that looked
similar to what I have for LaTeX. Then I could make
reasonable-looking documents for customers that insisted on having
docx format instead of pdf.

I don't think there has been much exciting or important (to me) added
to word processors for decades. Direct pdf generation was one, which
probably existed in Star Office (the ancestor of OpenOffice /

*.pdf arrives in Word ~2000 (maybe before).

Word 2007, according to Wikipedia, google, and the never-wrong internet community. Prior to that, people used "pdf printers" which gave basic
pdf output (image only - no links, contents, cross-references, etc.).
Or they did a lot of manual work using expensive Adobe Acrobat Writer
tools so that they could add the "active" bits.

My experience with MS Office is mostly in helping others - I haven't had
that overrated, overpriced monstrosity on a computer since Word for
Windows 2.0 on Windows 3.1. But it seems that these days it does a lot
better job at exporting pdfs than it used to. I did a quick test with
online Office 365 with an old LibreOffice document, and Office 365 did
get the table of contents right when exporting, and the cross-references
had the right section numbers and were clickable. But if failed to get
the cross-referenced section names in the pdf, despite showing them fine
in the docx file it was editing. It was not an extensive test, and the original document was written with LibreOffice, not MS Office. (It was exported from LibreOffice in docx, thus it was in the official ISO
standard ooxml format, rather than the screwed up version of that which
MS Office prefers.)

<snip>

Apart from that, the only benefits I see of newer LibreOffice over
older ones is better handling of the insane chaos that MS Office uses
for its file formats. LibreOffice is /much/ better at this than MS
Office is, especially if the file has been modified by a number of
different MS Office versions.

I still require people sending me *.docx to convert it back to
WORD2003 format *.doc and retransmitting it. It is surprising how many
people don't know how to do that.

It is surprising that anyone would want them to. Why not just install LibreOffice, and have a tool that is better at reading MS Word generated
files than any version of MS Word ever was?

Of course, people should not be sending .docx or any other source-format
file unless they expect you to edit the document - finished documents
should always be sent in pdf format.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Thu Dec 14 14:41:32 2023

mitchalsup@aol.com (MitchAlsup) writes:

David Brown wrote:

On 10/12/2023 11:39, Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??

In my case: Yes.

Besides making many things worse, the new formula editor (since
2010?) in Word is reasonable to work with, especially since it is
possible to use LaTeX notation now (and thus it is now possible
to paste from Maple).

If I want to write something serious with formula, I use LaTeX.

Previously, I actually wrote some reports in LaTeX, going to some
trouble to make them appear visually like the Word template du jour
(but the formulas gave it away, they looked to nice for Word).

What a strange thing to do - that sounds completely backwards to me!

I was happy when I had made a template for LibreOffice (it might have
been one of the forks of OpenOffice, pre-LibreOffice) that looked
similar to what I have for LaTeX. Then I could make reasonable-looking
documents for customers that insisted on having docx format instead of pdf.

I don't think there has been much exciting or important (to me) added to
word processors for decades. Direct pdf generation was one, which
probably existed in Star Office (the ancestor of OpenOffice /

*.pdf arrives in Word ~2000 (maybe before).

Are you sure about that? IIRC it was a decade later before
adobe wasn't required.

<snip>

I still require people sending me *.docx to convert it back to
WORD2003 format *.doc and retransmitting it. It is surprising how
many people don't know how to do that.

I ask for PDF's. I have no ability to read windows office formats
of any type without using star/open/libre office, and I detest WYSIWYG
word processors of all stripes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to David Brown on Thu Dec 14 18:05:23 2023

David Brown wrote:

On 13/12/2023 23:40, MitchAlsup wrote:

I was happy when I had made a template for LibreOffice (it might have
been one of the forks of OpenOffice, pre-LibreOffice) that looked
similar to what I have for LaTeX. Then I could make
reasonable-looking documents for customers that insisted on having
docx format instead of pdf.

I don't think there has been much exciting or important (to me) added
to word processors for decades. Direct pdf generation was one, which
probably existed in Star Office (the ancestor of OpenOffice /

*.pdf arrives in Word ~2000 (maybe before).

Word 2007, according to Wikipedia, google, and the never-wrong internet community.

I worked at AMD 1999-2006 and we used save as to *.pdf all the time.
{This would have been the professional version of WORD/Office.}
The student version 2003 also has this, I still have the CD-ROM.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to MitchAlsup on Thu Dec 14 14:19:14 2023

MitchAlsup wrote:

On 12/13/2023 8:47 AM, EricP wrote:

And, as in my case, there could be multiple branch offset sizes
which interacts with the length parse, sign extension delay,
and the final RIP add result delay.

So, have 1 size for conditional and 1 size for unconditional--

I have 2 branch formats, one for small 16b size offset with 16b opspec,
and one for medium 32b and large 64b offsets with 32b opspec,

Later I added a 3rd format for compare and branch when there are
two variable size immediates, one for offset and one for compare value.
The offset is the first immediate so it starts in a known buffer location.

AND THEN, you build a sign extending adder that adds 64+16 and
produces the correct sign extended 64-bit result. Guys it is not
1975 anymore !! Why do you think this is a serial process ??

I didn't say serial.
I was thinking of starting all 3 offsets sizes 16b, 32b, 64b,
adds immediately before knowing the instruction type or size,
then use the actual type and size to select the correct result.
The 64b adds could be further subdivided as 4 * 16b adders then
combine the size select with 16b carry select to assemble a 64b result.

Which is why I said I thought this was just a mux delay at the end.

Also Note:: The address produced by the branch target adder only
needs to produce enough bits to index the cache {you can sort out
all the harder stuff later} in the DECODE stage of the pipeline.
A check for page crossing (because you don't necessarily need to access
the TLB) finishes the problem.

In the hypothetical design I have in mind the instruction bytes
get parsed from fetch buffers, whose job is to hide the pipeline
latency to I$L1, and also allow prefetch for possible alternate path.
It also allows local looping and replay out of the fetch buffers.

In that design the full 64b parse RIP is need as a tag for
selecting from the multiple fetch buffers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Thu Dec 14 21:57:39 2023

EricP wrote:

MitchAlsup wrote:

On 12/13/2023 8:47 AM, EricP wrote:

And, as in my case, there could be multiple branch offset sizes
which interacts with the length parse, sign extension delay,
and the final RIP add result delay.

So, have 1 size for conditional and 1 size for unconditional--

I have 2 branch formats, one for small 16b size offset with 16b opspec,
and one for medium 32b and large 64b offsets with 32b opspec,

Later I added a 3rd format for compare and branch when there are
two variable size immediates, one for offset and one for compare value.
The offset is the first immediate so it starts in a known buffer location.

AND THEN, you build a sign extending adder that adds 64+16 and
produces the correct sign extended 64-bit result. Guys it is not
1975 anymore !! Why do you think this is a serial process ??

I didn't say serial.
I was thinking of starting all 3 offsets sizes 16b, 32b, 64b,
adds immediately before knowing the instruction type or size,
then use the actual type and size to select the correct result.
The 64b adds could be further subdivided as 4 * 16b adders then
combine the size select with 16b carry select to assemble a 64b result.

Which is why I said I thought this was just a mux delay at the end.

I am trying to tell you to put that mux in the subsequent cycle.

A 16-bit address can access a 64KB cache. A 64KB cache is bigger than
we will be willing to build. So, to access the cache all we need is
the lower order bits, and all 3 formats are the same here, so we
add 16 bits and start accessing the cache, Then we flop the carry out
and in the subsequent cycle we add the bits bigger than 16 while the
cache is being accessed. And now at the time when the tag is available
so is the rest of the address bits.

Also Note:: The address produced by the branch target adder only
needs to produce enough bits to index the cache {you can sort out
all the harder stuff later} in the DECODE stage of the pipeline.
A check for page crossing (because you don't necessarily need to access
the TLB) finishes the problem.

In the hypothetical design I have in mind the instruction bytes
get parsed from fetch buffers, whose job is to hide the pipeline
latency to I$L1, and also allow prefetch for possible alternate path.
It also allows local looping and replay out of the fetch buffers.

I call this the instruction buffer and hold both sequential and
alternate path instructions for decode. I can access a whole cache
line per cycle, so I have little problem feeding the sequential
path--but instead of fetching whole cache lines, I fetch four ¼
cache lines and a next fetch predictor. Each I$ access uses a
7-bit index and a 3-bit set {turning the 4-way cache into direct
mapped cache} and another 7+bit index to the fetch predictor.

void accessICache(Fetch fetch)
{
static Index index;

for( i = 0; i < SETS; i++ )
InstBuf[fetch+i] = column[i].SRAM[index[i]];
index = FetchPredictor[index[5]];
}

In that design the full 64b parse RIP is need as a tag for
selecting from the multiple fetch buffers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to David Brown on Sat Dec 16 13:49:26 2023

David Brown <david.brown@hesbynett.no> schrieb:

On 10/12/2023 11:39, Thomas Koenig wrote:

Previously, I actually wrote some reports in LaTeX, going to some
trouble to make them appear visually like the Word template du jour
(but the formulas gave it away, they looked to nice for Word).

What a strange thing to do - that sounds completely backwards to me!

When you work at a company that prescribes (sort of) a certain
format, that is one possibility. I did the cover sheet in Word,
though, and pasted it together as PDF.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to George Neuner on Sat Dec 16 14:20:56 2023

George Neuner <gneuner2@comcast.net> schrieb:

On Sun, 10 Dec 2023 10:56:36 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

[...]

My personal favorites for parallel programming are PGAS languages
like (yes, you guessed it) Fortran, where the central data
structure is the coarray.

Mileage varies considerably and I don't intend to start a language
war: a lot has to due with the history of parallel applications a
person has developed. I can respect your point of view even though I
don't agree with it.

My favorite model is CSP (ala Hoare) with no shared memory. Which is
not to say I don't use threads, but I try to design programs such that threads (mostly) are not sharing writable data structures.

Which is a sound idea, Inadvertently shared variables are a major
source of errors in OpenMP, for example.

You have to make sure that you synchronize before accessing data
that has been modified on another image.

And that's where most languages fall down: they provide either
primitives which are too low level and too hard to use correctly, or
they provide high level mechanisms that don't scale well and are too
limiting in actual use.

I think Fortran has gotten many things right here, at least for the
domain of scientific computing - the complexity is manageable.

For those who are interested, I've written a short tutorial about
Fortran coarrays, which can be found at

https://github.com/tkoenig1/coarray-tutorial/blob/main/tutorial.md

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Sat Dec 16 16:32:49 2023

On 16/12/2023 14:49, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

On 10/12/2023 11:39, Thomas Koenig wrote:

Previously, I actually wrote some reports in LaTeX, going to some
trouble to make them appear visually like the Word template du jour
(but the formulas gave it away, they looked to nice for Word).

What a strange thing to do - that sounds completely backwards to me!

When you work at a company that prescribes (sort of) a certain
format, that is one possibility. I did the cover sheet in Word,
though, and pasted it together as PDF.

Ah, your aim was to make the LaTeX documents look like the corporate
standard, which happened to be made in Word. That makes a lot more
sense. Typical out-of-the-box Word templates (and LibreOffice, and
every other word processor I have seen) all look amateur in comparison
to LaTeX layouts. But company standardisation trumps quality
typesetting. (And maybe you are one of the lucky people whose company
standard templates are well designed.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Chris M. Thomasson on Sat Dec 16 16:28:08 2023

On 15/12/2023 03:59, Chris M. Thomasson wrote:

On 12/14/2023 6:41 AM, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup) writes:

David Brown wrote:

On 10/12/2023 11:39, Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??

In my case: Yes.

Besides making many things worse, the new formula editor (since
2010?) in Word is reasonable to work with, especially since it is
possible to use LaTeX notation now (and thus it is now possible
to paste from Maple).

If I want to write something serious with formula, I use LaTeX.

Previously, I actually wrote some reports in LaTeX, going to some
trouble to make them appear visually like the Word template du jour
(but the formulas gave it away, they looked to nice for Word).

What a strange thing to do - that sounds completely backwards to me!

I was happy when I had made a template for LibreOffice (it might have
been one of the forks of OpenOffice, pre-LibreOffice) that looked
similar to what I have for LaTeX. Then I could make reasonable-looking >>>> documents for customers that insisted on having docx format instead
of pdf.

I don't think there has been much exciting or important (to me)
added to
word processors for decades. Direct pdf generation was one, which
probably existed in Star Office (the ancestor of OpenOffice /

*.pdf arrives in Word ~2000 (maybe before).

Are you sure about that? IIRC it was a decade later before
adobe wasn't required.

<snip>

I still require people sending me *.docx to convert it back to
WORD2003 format *.doc and retransmitting it. It is surprising how
many people don't know how to do that.

I ask for PDF's. I have no ability to read windows office formats
of any type without using star/open/libre office, and I detest WYSIWYG
word processors of all stripes.

Try to stay far away from windows office docs, they can be filled with interesting macros, well back in the day! I do remember a lot of print
to PDF programs. Mock up a printer device, print, produce a file.

They are only a problem if you use MS Office. LibreOffice, and its predecessors, disable the macros by default.

PDF also supports dangerous links and Javascript. It's not a problem if
you use a decent pdf viewer, but if you use Adobe Acrobat on Windows,
you can definitely be at risk.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Thomas Koenig on Tue Dec 19 03:14:05 2023

Thomas Koenig wrote:

George Neuner <gneuner2@comcast.net> schrieb:

On Sun, 10 Dec 2023 10:56:36 -0000 (UTC), Thomas Koenig >><tkoenig@netcologne.de> wrote:

And that's where most languages fall down: they provide either
primitives which are too low level and too hard to use correctly, or
they provide high level mechanisms that don't scale well and are too
limiting in actual use.

I think Fortran has gotten many things right here, at least for the
domain of scientific computing - the complexity is manageable.

For those who are interested, I've written a short tutorial about
Fortran coarrays, which can be found at

https://github.com/tkoenig1/coarray-tutorial/blob/main/tutorial.md

Nicely done.

Notice how they specified "the what" without specifying "the how".
Notice that C and C++ atomics to the reverse.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Tue Dec 19 03:29:22 2023

Paul A. Clayton wrote:

On 12/6/23 2:54 AM, Anton Ertl wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

[snip]

One of my professors back in the late 70's was researching
data flow architectures. Perhaps it's time to reconsider the
unit of compute using single instructions, instead providing a
set of hardware 'functions' than can be used in a data flow environment.

We already have data-flow microarchitectures since the mid-1990s, with
the success of OoO execution. And the "von Neumann" ISAs have proven
to be a good and long-term stable interface between software and these
data-flow microarchitectures, whereas the data-flow ISAs of the 1970s
and their microarchitectures turned out to be uncompetetive.

I suspect that a superior interface could be designed which
exploits diverse locality (i.e., data might naturally be closer to
some computational resources than to others) and communication
(and storage) costs and budgets (budgets being related to urgency
and importance).

Consider the inner loop of DGEM on a properly resourced GBOoO::

LD |AGEN |cache|align|reslt|
LD |AGEN |cache|align|reslt|
FMUL | ex1 | ex2 | ex3 | ex4 |reslt|
FADD | ex1 | ex2 | ex3 |reslt|
ST |AGEN |cache|-----------------------------------------------|align|write|
LOOP | inc |

LD |AGEN |cache|align|reslt|
LD |AGEN |cache|align|reslt|
FMUL | ex1 | ex2 | ex3 | ex4 |reslt|
FADD | ex1 | ex2 | ex3 |reslt|
ST |AGEN |cache|-----------------------------------------------|align|write|
LOOP | inc |

LD |AGEN |cache|align|reslt|
LD |AGEN |cache|align|reslt|
FMUL | ex1 | ex2 | ex3 | ex4 |reslt|
FADD | ex1 | ex2 | ex3 |reslt|
ST |AGEN |cache|-----------------------------------------------|align|write|
LOOP | inc |

This is about as much Out-of-order one gets in a GBOoO machine.
Andy Glew would say that this is "not that much" out of order.
Notice that every function unit sees every operation in program
order and that the only OoO-ness is the latency of calculations.

I think the original dataflow architectures
attempted to be very general with significant overhead for
readiness determination and communication. They also (as I
understand) lacked value prediction whereas OoO effectively uses
value prediction for branches.

The original data-flow architectures over did their ability to
discover parallelism. And whereas vonNeumann ISAs struggled to
find parallelism, data-flow architectures found "way too much".
They found so much parallelism that they got diverted down a
dark alley for a decade trying to "so manage" the parallelism
so their queuing structures did not overflow and crash the
machine. They stumbled upon parallelism in the 10s of millions
of instructions that could be fired each cycle. Primarily, they
failed in trying to reign in the parallelism down to the point
where they could build actual machines.

They died from finding too much not from finding too little.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to All on Tue Dec 19 03:36:06 2023

I have made some further minor modifications.

I changed where, in the opcode space, the supplementary
memory-reference instructions were located. This allowed
me to have a few more bits available for them.

Also, I added a mechanism for a set of instructions
longer than 32 bits that can be used without
recourse to a block header of any kind, so that they
can be slipped into code in the format formerly
consisting purely of 32-bit instructions. This is very
inefficient, though, and so the previous format for
long instructions is also kept.

But this way, the basic instruction set used from
unblocked code is open-ended, which I think is important.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Tue Dec 19 07:22:10 2023

On Tue, 19 Dec 2023 03:36:06 +0000, Quadibloc wrote:

I changed where, in the opcode space, the supplementary memory-reference instructions were located. This allowed me to have a few more bits
available for them.

I've moved them again, making even more space available... because in
my last change, I made the mistake of using the opcode space that I
was already using for block headers. I couldn't reduce the amount of information in a block header by two bits, by using a combination of
ten bits instead of eight to indicate a block header, so I had to do
my rearranging in this place instead.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Paul A. Clayton on Tue Dec 19 14:30:25 2023

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

On 12/6/23 2:54 AM, Anton Ertl wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

[snip]

One of my professors back in the late 70's was researching
data flow architectures. Perhaps it's time to reconsider the
unit of compute using single instructions, instead providing a
set of hardware 'functions' than can be used in a data flow environment.

We already have data-flow microarchitectures since the mid-1990s, with
the success of OoO execution. And the "von Neumann" ISAs have proven
to be a good and long-term stable interface between software and these
data-flow microarchitectures, whereas the data-flow ISAs of the 1970s
and their microarchitectures turned out to be uncompetetive.

I suspect that a superior interface could be designed which
exploits diverse locality (i.e., data might naturally be closer to
some computational resources than to others) and communication
(and storage) costs and budgets (budgets being related to urgency
and importance). I think the original dataflow architectures
attempted to be very general with significant overhead for
readiness determination and communication. They also (as I
understand) lacked value prediction whereas OoO effectively uses
value prediction for branches.

I would argue that the Cavium coprocessors are data flow at the
level envisioned in the 1970s research. The data (network packets)
are presented (queued) to the appropriate coprocessor as it
flows through the configured set of coprocessors for the data
stream.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Thomas Koenig on Tue Dec 19 17:34:34 2023

On Fri, 10 Nov 2023 22:03:23 +0000, Thomas Koenig wrote:

Quadibloc <quadibloc@servername.invalid> schrieb:

For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.

This breaks with the central tenet of the /360, the PDP-11,
the VAX, and all RISC architectures: (Almost) all registers are general-purpose registers.

This would make your ISA very un-S/360-like.

Well, I felt that _some_ compromises had to be made; otherwise,
there was no way instructions with base-index addressing _and_
16-bit displacements would fit into 32 bits.

So this isn't a decision I can reverse. Yes, it has its problems,
but it's an unavoidable result of my goal of combining aspects of
RISC and CISC in a single ISA.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Tue Dec 19 17:47:25 2023

On Tue, 19 Dec 2023 07:22:10 +0000, Quadibloc wrote:

On Tue, 19 Dec 2023 03:36:06 +0000, Quadibloc wrote:

I changed where, in the opcode space, the supplementary
memory-reference instructions were located. This allowed me to have a
few more bits available for them.

I've moved them again, making even more space available... because in my
last change, I made the mistake of using the opcode space that I was
already using for block headers. I couldn't reduce the amount of
information in a block header by two bits, by using a combination of ten
bits instead of eight to indicate a block header, so I had to do my rearranging in this place instead.

And now, with what I've learned from this experience, I've made further changes. I've increased the length of the opcode field in the supplementary memory-reference instructions that were moved to be among the other memory-reference instructions, so as to have enough for the different
sizes of the various types to be supported.

But in addition, I have now engaged in what some may see as an act of
pure evil.

Once again there are supplementary memory-reference instructions among
the operate instructions as well. *These*, however, provide for the conventional integer and floating-point types, CISC-style memory to
register operate instructions! So even within the basic 32-bit instruction
set, although _these_ instructions are highly restricted in register use
and addressing modes, the pretense of being a load-store architecture
has been dropped!

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Anton Ertl on Tue Dec 19 17:39:19 2023

On Tue, 05 Dec 2023 11:07:09 +0000, Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> writes:
[IBM Model 195]

Its microarchitecture ended up being, in general terms, copied by the >>Pentium Pro and the Pentium II.

Not really. The Models 91 and 195 only have OoO for FP, not for
integers.

As do the Pentium Pro and the Pentium II. (The Motorola 68050 did it
the other way around, only having OoO for integers, and not for FP,
figuring, I guess, that integers are used the most, so this would
create better performance numbers.)

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Tue Dec 19 17:56:03 2023

On Tue, 19 Dec 2023 17:39:19 +0000, Quadibloc wrote:

On Tue, 05 Dec 2023 11:07:09 +0000, Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> writes:
[IBM Model 195]

Its microarchitecture ended up being, in general terms, copied by the >>>Pentium Pro and the Pentium II.

Not really. The Models 91 and 195 only have OoO for FP, not for
integers.

As do the Pentium Pro and the Pentium II. (The Motorola 68050 did it the other way around, only having OoO for integers, and not for FP,
figuring, I guess, that integers are used the most, so this would create better performance numbers.)

Oops, the Motorola 68060.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Tue Dec 19 18:29:41 2023

Quadibloc <quadibloc@servername.invalid> writes:

On Tue, 05 Dec 2023 11:07:09 +0000, Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> writes:
[IBM Model 195]

Its microarchitecture ended up being, in general terms, copied by the >>>Pentium Pro and the Pentium II.

Not really. The Models 91 and 195 only have OoO for FP, not for
integers.

As do the Pentium Pro and the Pentium II.

This is the first time I have seen that claimed. What makes you think
so?

Everything I have read about the Pentium Pro indicates that it has
complete OoO with speculation and precise exceptions (and neither
speculation nor precise exceptions would work with FP-only OoO, as
demonstrated by the Model 91 which has neither and is infamous for its imprecise exceptions).

(The Motorola 68050 did it
the other way around, only having OoO for integers, and not for FP,
figuring, I guess, that integers are used the most, so this would
create better performance numbers.)

According to <https://en.wikipedia.org/wiki/Motorola_68000_series#68050_and_68070>,
there was no 68050. According to <https://en.wikipedia.org/wiki/68060>.

|The 68060 shares most architectural features with the P5 Pentium. Both
|have a very similar superscalar in-order dual instruction pipeline |configuration

I.e., no OoO.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Wed Dec 20 01:03:33 2023

On Tue, 19 Dec 2023 18:29:41 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Quadibloc <quadibloc@servername.invalid> writes:

On Tue, 05 Dec 2023 11:07:09 +0000, Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> writes:
[IBM Model 195]

Its microarchitecture ended up being, in general terms, copied by
the Pentium Pro and the Pentium II.

Not really. The Models 91 and 195 only have OoO for FP, not for
integers.

As do the Pentium Pro and the Pentium II.

This is the first time I have seen that claimed. What makes you think
so?

Everything I have read about the Pentium Pro indicates that it has
complete OoO with speculation and precise exceptions (and neither
speculation nor precise exceptions would work with FP-only OoO, as demonstrated by the Model 91 which has neither and is infamous for its imprecise exceptions).

Back in 2019-03 I tried to educate John about OoO in Pentium-Pro and
Pentium-II but failed misarably. Now I understand that I had to blame
myself for the failure - I was not sifficiently polite.
I wish you better luck.

(The Motorola 68050 did it
the other way around, only having OoO for integers, and not for FP, >figuring, I guess, that integers are used the most, so this would
create better performance numbers.)

According to <https://en.wikipedia.org/wiki/Motorola_68000_series#68050_and_68070>,
there was no 68050. According to
<https://en.wikipedia.org/wiki/68060>.

|The 68060 shares most architectural features with the P5 Pentium.
Both |have a very similar superscalar in-order dual instruction
pipeline |configuration

I.e., no OoO.

- anton

May be, he was thinking about MPC740/750 ? That was one of the more
successfull PowerPC cores. Even after withdrowal from personal
computing market it lived for many more years as Freescale e600 core.
Its integer side can be described as 'barely OoO' but OoO nevertheless.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to BGB on Wed Dec 20 01:12:43 2023

On Tue, 19 Dec 2023 17:19:24 -0600, BGB wrote:

On 12/19/2023 11:34 AM, Quadibloc wrote:

Well, I felt that _some_ compromises had to be made; otherwise, there
was no way instructions with base-index addressing _and_ 16-bit
displacements would fit into 32 bits.

So this isn't a decision I can reverse. Yes, it has its problems,
but it's an unavoidable result of my goal of combining aspects of RISC
and CISC in a single ISA.

As I see it, there are two major situations:
Stack frames and structs, where a 16-bit displacement is likely
overkill;
Global variables, where for the general case it is almost entirely insufficient.

Yes, but does that mean that 16-bit displacements are a bad idea?

The Motorola 68000, the 8086, the PowerPC, and lots of other architectures
all had them.

So: what are 16-bit variables _for_? *Local* variables, of course. Allocate
one base register to the start of the data area for a program, and another
base register to the start of the program area for a program, and you're
done.

The architecture also provides _one_ base register that works with 15-bit displacements. This allows instructions to have a smaller format if that
base register is used.

And then there's another seven registers allocated as base registers that
work with 12-bit displacements. If you want to save the base registers with 16-bit displacements, then you can use those.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Wed Dec 20 00:17:42 2023

BGB wrote:

On 12/19/2023 11:34 AM, Quadibloc wrote:

On Fri, 10 Nov 2023 22:03:23 +0000, Thomas Koenig wrote:

Quadibloc <quadibloc@servername.invalid> schrieb:

For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.

This breaks with the central tenet of the /360, the PDP-11,
the VAX, and all RISC architectures: (Almost) all registers are
general-purpose registers.

This would make your ISA very un-S/360-like.

Well, I felt that _some_ compromises had to be made; otherwise,
there was no way instructions with base-index addressing _and_
16-bit displacements would fit into 32 bits.

So this isn't a decision I can reverse. Yes, it has its problems,
but it's an unavoidable result of my goal of combining aspects of
RISC and CISC in a single ISA.

As I see it, there are two major situations:
Stack frames and structs, where a 16-bit displacement is likely overkill; Global variables, where for the general case it is almost entirely insufficient.

EMBench is filled with stack frames illustrating RISC-V's 12-bit
immediates are not big enough.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Wed Dec 20 01:46:29 2023

On Wed, 20 Dec 2023 01:12:43 +0000, Quadibloc wrote:

On Tue, 19 Dec 2023 17:19:24 -0600, BGB wrote:

On 12/19/2023 11:34 AM, Quadibloc wrote:

Well, I felt that _some_ compromises had to be made; otherwise, there
was no way instructions with base-index addressing _and_ 16-bit
displacements would fit into 32 bits.

So this isn't a decision I can reverse. Yes, it has its problems,
but it's an unavoidable result of my goal of combining aspects of RISC
and CISC in a single ISA.

As I see it, there are two major situations:
Stack frames and structs, where a 16-bit displacement is likely
overkill;
Global variables, where for the general case it is almost entirely
insufficient.

Yes, but does that mean that 16-bit displacements are a bad idea?

The Motorola 68000, the 8086, the PowerPC, and lots of other
architectures all had them.

So: what are 16-bit variables _for_? *Local* variables, of course.
Allocate one base register to the start of the data area for a program,
and another base register to the start of the program area for a
program, and you're done.

The architecture also provides _one_ base register that works with
15-bit displacements. This allows instructions to have a smaller format
if that base register is used.

And then there's another seven registers allocated as base registers
that work with 12-bit displacements. If you want to save the base
registers with 16-bit displacements, then you can use those.

And I forgot to mention: there are _another_ seven registers allocated
as base registers that work with 20-bit displacements. The instructions
using them, though, are all longer than 32 bits. I did not include this feature, though, because I thought there wa a need for it, but because
it had been added in z/Architecture; so it's there for ease in translating programs over.

As the architecture provides for instructions longer than 32 bits, I
could indeed add instructions which contained a full 64-bit address,
or instructions with 32-bit displacements. The first of those two
possibilities certainly is a simple way to deal with, in a pinch,
a single external variable without tying up one base register just
for it.

I haven't made a place for that feature yet, but one thing I do have
is Array Mode. So if a program has a lot of large arrays, so they
don't all fit into the 64K that one base register can cover, instead
of using several registers to cover those arrays, one base register
points to a table of array addresses - and the displacement picks
out the array address, and the index register contents are added to
that address to find the pointer to the operand. It's basically a form
of post-indexed indirect addressing.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Wed Dec 20 03:30:34 2023

BGB wrote:

On 12/19/2023 6:17 PM, MitchAlsup wrote:

BGB wrote:

On 12/19/2023 11:34 AM, Quadibloc wrote:

On Fri, 10 Nov 2023 22:03:23 +0000, Thomas Koenig wrote:

Quadibloc <quadibloc@servername.invalid> schrieb:

For 32-bit instructions, the only implication is that the first few >>>>>> integer registers would be used as index registers, and the last few >>>>>> would be used as base registers, which is likely to be true in any >>>>>> case.

This breaks with the central tenet of the /360, the PDP-11,
the VAX, and all RISC architectures: (Almost) all registers are
general-purpose registers.

This would make your ISA very un-S/360-like.

Well, I felt that _some_ compromises had to be made; otherwise,
there was no way instructions with base-index addressing _and_
16-bit displacements would fit into 32 bits.

So this isn't a decision I can reverse. Yes, it has its problems,
but it's an unavoidable result of my goal of combining aspects of
RISC and CISC in a single ISA.

As I see it, there are two major situations:
Stack frames and structs, where a 16-bit displacement is likely overkill; >>> Global variables, where for the general case it is almost entirely
insufficient.

EMBench is filled with stack frames illustrating RISC-V's 12-bit
immediates are not big enough.

As mentioned before, if you scale the displacement here, it is like it
is 3 bits bigger.

RISC-V is a weak case here because:
The displacements are unscaled;
The displacements are signed.

And yet, it has sucked all the oxygen out of the room..........

For stack frames, this effectively loses 4 bits, so RISC-V's 12-bit displacement is effectively more equivalent to 8 bits with my scheme...

Well, combined with the issue that exceeding the +/- 2K limit in RISC-V
sucks (there is no low-cost fallback strategy).

Universal constants solve that RISC-V's problem.

But, generally, not many stack frames seem to have much issue with the current 4K limit.

Granted, I just ran into a watched benchmarks that makes RISC-V look
less optimal then those sucking the oxygen out of the room.

If it were more of an issue, could potentially add a few ops to extend
the limit to around 32K in XG2 mode. Say:
MOV.Q (SP, Disp12u*8), Rn
MOV.Q Rn, (SP, Disp12u*8)

You still need 64-bit displacements for when we have Atta Byte address spaces.......

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to BGB on Wed Dec 20 04:38:47 2023

On Tue, 19 Dec 2023 20:07:09 -0600, BGB wrote:

On 12/19/2023 7:12 PM, Quadibloc wrote:

Yes, but does that mean that 16-bit displacements are a bad idea?

It is, if they end up hurting the encoding in some other way, like
making the register fields smaller or eating too much of the opcode
space.

In some previous iterations of the Concertina II architecture, I
did follow the SEL 32 architecture, in having instructions that
only accessed aligned locations in memory. Then I used the bits
at the end of the address to indicate the length of the operand
in a way similar to what the SEL 32 did. This is sort of like
shortening the address by scaling it.

But I ended up not having to do that. I still had enough room
for load-store memory-reference instructions. I did lose
opcode space, because now I couldn't have 16-bit instructions.

But my 16-bit instructions themselves had a restriction on
register use that was bad; so I replaced them with 17-bit
instructions (they can be used, but with an overhead of one
32-bit header word in a block).

Sometimes, I also considered replacing 16-bit displacements by
15-bit displacements, but those designs also ended up not going
anywhere.

But then, I've been able to consider a wide variety of design
alternatives in Concertina II precisely because having the block
format means I potentially have as much opcode space available to
me as I want. Spend four bits per block, and get 36-bit instructions
instead of 32-bit instructions, for example.

One major goal - not one that I've discussed much - is to make
Concertina II look a lot like a conventional RISC architecture.

Of course it does plenty of things that no conventional RISC
architecture does, but except for the fact that I can only use
seven (instead of 31) of the 32 integer registers as index
registers (or, rather, as base registers, since when your
displacement is too short to cover memory, you need a base
register *first*)... its instruction set is basically a
_superset_ of a conventional RISC architecture.

You've got load-store memory-reference instructions, with
16-bit displacements, like typical microprocessors (and unlike
the System/360, which got along just fine with just 12 bits).

You've got three-address register to register operate
instructions - with a C bit to turn on or off affecting the
condition codes.

Just like some typical RISC designs!

But then the block structure lets you do things never seen
except in VLIW designs (instruction predication, explitly
indicating certain instructions may execute in parallel).

And the instruction set also goes into CISC territory. The
block structure makes it *obvious*, even to an idiot, that
one can process the header, and then locate instructions to
be decoded without going through the whole block serially.

Mitch has pointed out that with a simple length prefix
scheme _can_ be decoded really quickly, but I want even
implementors who aren't as smart as Mitch to be able to
implement Concertina II properly so it runs fast. Length
prefixes invite people to decode them serially.

Concertina II seems almost "architecture-agnostic", as it
doesn't know if it's RISC, CISC, or VLIW. But in thinking
about it, my intention is perhaps this: to take a CISC
instruction set, but use RISC and VLIW packaging on it,
so that the performance advantages of RISC and VLIW can
be given to a CISC instruction set.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Wed Dec 20 09:27:09 2023

On Wed, 20 Dec 2023 03:30:34 +0000, MitchAlsup wrote:

BGB wrote:

RISC-V is a weak case here because:
The displacements are unscaled;
The displacements are signed.

And yet, it has sucked all the oxygen out of the room..........

I know that I felt the 68000 using signed displacements was
a major weakness of the architecture, and on the Macintosh it did
lead to segments being half as large as they could have otherwise
been.

Granted, I just ran into a watched benchmarks that makes RISC-V look
less optimal then those sucking the oxygen out of the room.

Ah.

So x86/x86-64 and ARM are the ones _really_ sucking most of the oxygen
out of the room... RISC-V is just sucking what little they've left
behind out of the room.

As Concertina II is unlikely to please anyone but myself, I wish
your MY 66000 all the luck in the world in overcoming this problem.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to All on Thu Dec 21 07:12:41 2023

Now I've simplified the format of the Composed Instructions, which
allow instructions longer than 32 bits to appear in code without
block headers.

This freed up just enough opcode space that I could just barely
add a header format for reserving part of a block for pseudo-immediates
with essentially zero overhead back in to the instruction set.

I felt this feature was needed to make immediate values feel like
a real part of the instruction set; if they always required a full
32-bit header as overhead, there would be reluctance to use them.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Chris M. Thomasson on Thu Dec 21 08:58:10 2023

On 21/12/2023 04:00, Chris M. Thomasson wrote:

On 12/16/2023 7:28 AM, David Brown wrote:

On 15/12/2023 03:59, Chris M. Thomasson wrote:

On 12/14/2023 6:41 AM, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup) writes:

I ask for PDF's. I have no ability to read windows office formats
of any type without using star/open/libre office, and I detest WYSIWYG >>>> word processors of all stripes.

Try to stay far away from windows office docs, they can be filled
with interesting macros, well back in the day! I do remember a lot of
print to PDF programs. Mock up a printer device, print, produce a file.

They are only a problem if you use MS Office. LibreOffice, and its
predecessors, disable the macros by default.

PDF also supports dangerous links and Javascript.

Indeed!

It's not a problem if you use a decent pdf viewer, but if you use
Adobe Acrobat on Windows, you can definitely be at risk.

Well, just make sure the PDF reader has javascript turned off all
around. Trust in it.

"Trust in it" ?

Some readers /are/ trustworthy. Adobe's are not - Acrobat reader has
endless lists of security holes. I haven't had it installed on a PC for
many years, so things may have changed, but in comparison to any other
reader it was huge, slow, and required continuous upgrading to deal with vulnerabilities, requiring a reboot of Windows each time. Horrible
software.

On Linux, common readers like evince don't support javascript - you can
trust them!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Quadibloc on Thu Dec 21 13:21:45 2023

Quadibloc wrote:

On Tue, 05 Dec 2023 11:07:09 +0000, Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> writes:
[IBM Model 195]

Its microarchitecture ended up being, in general terms, copied by the
Pentium Pro and the Pentium II.

Not really. The Models 91 and 195 only have OoO for FP, not for
integers.

As do the Pentium Pro and the Pentium II. (The Motorola 68050 did it

Huh???

I'm sure Andy Glew would disagree re the PPro!

Terje

the other way around, only having OoO for integers, and not for FP,
figuring, I guess, that integers are used the most, so this would
create better performance numbers.)

John Savard

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Chris M. Thomasson on Thu Dec 21 14:51:45 2023

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 12/16/2023 7:28 AM, David Brown wrote:

On 15/12/2023 03:59, Chris M. Thomasson wrote:

On 12/14/2023 6:41 AM, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup) writes:

David Brown wrote:

On 10/12/2023 11:39, Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

Question (to everyone):: Has your word processor or spreadsheet >>>>>>>> added anything USEFUL TO YOU since 2000 ??

In my case: Yes.

Besides making many things worse, the new formula editor (since
2010?) in Word is reasonable to work with, especially since it is >>>>>>> possible to use LaTeX notation now (and thus it is now possible
to paste from Maple).

If I want to write something serious with formula, I use LaTeX.

Previously, I actually wrote some reports in LaTeX, going to some >>>>>>> trouble to make them appear visually like the Word template du jour >>>>>>> (but the formulas gave it away, they looked to nice for Word).

What a strange thing to do - that sounds completely backwards to me! >>>>>
I was happy when I had made a template for LibreOffice (it might have >>>>>> been one of the forks of OpenOffice, pre-LibreOffice) that looked
similar to what I have for LaTeX. Then I could make
reasonable-looking
documents for customers that insisted on having docx format instead >>>>>> of pdf.

I don't think there has been much exciting or important (to me)
added to
word processors for decades. Direct pdf generation was one, which >>>>>> probably existed in Star Office (the ancestor of OpenOffice /

*.pdf arrives in Word ~2000 (maybe before).

Are you sure about that? IIRC it was a decade later before
adobe wasn't required.

<snip>

I still require people sending me *.docx to convert it back to
WORD2003 format *.doc and retransmitting it. It is surprising how
many people don't know how to do that.

I ask for PDF's. I have no ability to read windows office formats
of any type without using star/open/libre office, and I detest WYSIWYG >>>> word processors of all stripes.

Try to stay far away from windows office docs, they can be filled with
interesting macros, well back in the day! I do remember a lot of print
to PDF programs. Mock up a printer device, print, produce a file.

They are only a problem if you use MS Office. LibreOffice, and its
predecessors, disable the macros by default.

PDF also supports dangerous links and Javascript.

Indeed!

Although my PDF reader ignores links and Javascript (xpdf),
and I've yet to encounter a PDF that xpdf cannot read.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to David Brown on Thu Dec 21 14:52:32 2023

David Brown <david.brown@hesbynett.no> writes:

On 21/12/2023 04:00, Chris M. Thomasson wrote:

On 12/16/2023 7:28 AM, David Brown wrote:

On 15/12/2023 03:59, Chris M. Thomasson wrote:

On 12/14/2023 6:41 AM, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup) writes:

I ask for PDF's. I have no ability to read windows office formats >>>>> of any type without using star/open/libre office, and I detest WYSIWYG >>>>> word processors of all stripes.

Try to stay far away from windows office docs, they can be filled
with interesting macros, well back in the day! I do remember a lot of
print to PDF programs. Mock up a printer device, print, produce a file. >>>

They are only a problem if you use MS Office. LibreOffice, and its
predecessors, disable the macros by default.

PDF also supports dangerous links and Javascript.

Indeed!

It's not a problem if you use a decent pdf viewer, but if you use
Adobe Acrobat on Windows, you can definitely be at risk.

Well, just make sure the PDF reader has javascript turned off all
around. Trust in it.

"Trust in it" ?

Some readers /are/ trustworthy. Adobe's are not - Acrobat reader has
endless lists of security holes. I haven't had it installed on a PC for
many years, so things may have changed, but in comparison to any other
reader it was huge, slow, and required continuous upgrading to deal with >vulnerabilities, requiring a reboot of Windows each time. Horrible
software.

On Linux, common readers like evince don't support javascript - you can
trust them!

Although the evince UI is crap. I prefer xpdf.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Terje Mathisen on Thu Dec 21 16:25:59 2023

On Thu, 21 Dec 2023 13:21:45 +0100, Terje Mathisen wrote:

Huh???

I'm sure Andy Glew would disagree re the PPro!

I distinctly remember reading somewhere about the Pentium Pro, II, and
the 68060, but Wikipedia doesn't back me up, so it's entirely possible
that the one place where I read this - which I can't identify, not
remembering what it was - was in error. Since this was the same as the
360/91, naturally it was memorable to me, so I remembered that, and forgot anything contradicting it I might have read elsewhere.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Quadibloc on Mon Jan 1 12:07:00 2024

Quadibloc <quadibloc@servername.invalid> writes:

On Sun, 10 Dec 2023 10:51:03 -0800, Tim Rentsch wrote:

Of course the PDP-8 is a RISC. These propperties may have been common
among some RISC processors, but they don't define what RISC is. RISC is
a design philosophy, not any particular set of architectural features.

I can't agree.

Your final sentence may be true enough, but I think that the architectural feature of being a load-store architecture is very much indicative of
whether the RISC design philosophy was being followed. Of course, it isn't absolutely _decisive_, as Concertina II demonstrates.

The PDP-8 is just a very small computer, with a very small instruction
set, designed before the RISC design philosophy was even concieved of.

That it was designed before is irrelevant. All that matters is
that the end result is consistent with that philosophy.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to MitchAlsup on Mon Jan 1 12:11:45 2024

mitchalsup@aol.com (MitchAlsup) writes:

Tim Rentsch wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Scott Lurndal wrote:

The lack of general purpose registers doesn't disqualify it
from the RISC label in my opinion.

Then RISC is a meaningless term.

PDP-8 certainly is simple nor does it have many instructions, but it
certainly is NOT RISC.

Did not have a large GPR register file
Was Not pipelined
Was Not single cycle execution
Did not overlap instruction fetch with execution
Did not rely on compiler for good code performance

Of course the PDP-8 is a RISC. These propperties may have been
common among some RISC processors, but they don't define what
RISC is. RISC is a design philosophy, not any particular set
of architectural features.

So what we can take from this is that RISC as a term has become meaningless.

The term isn't meaningless. You yourself in another posting
quoted the definitional property, and all I'm saying is that
the PDP-8 is consistent with that original description.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Tim Rentsch on Tue Jan 2 04:03:30 2024

On Mon, 01 Jan 2024 12:07:00 -0800, Tim Rentsch wrote:

Quadibloc <quadibloc@servername.invalid> writes:

The PDP-8 is just a very small computer, with a very small instruction
set, designed before the RISC design philosophy was even concieved of.

That it was designed before is irrelevant. All that matters is that the
end result is consistent with that philosophy.

It is true that the PDP-8 had a small and simple instruction set.

Is it a load-store machine? Does it attempt to minimize
communications with memory by having a large register file?

Unfortunately, the designers of PDP-8 were working too soon
to know that these things, and not just a small and simple
instruction set, would be defining characteristics of RISC.

But, hey, all the PDP-8's instructions were one 12-bit word
long, so they got one thing right!

Not only isn't the PDP-8 RISC, neither is the IBM 704 nor
the SDS/Xerox Sigma series of computers (or the SDS 930,
for that matter).

Yes, the PDP-8 did have a small and simple instruction set.

But that is _not_ what the meaning of RISC is commonly understood
to be.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Tue Jan 2 20:41:07 2024

Paul A. Clayton wrote:

[This is long and much less organized than I wished, but I feel
rushed to get this written while I have a day off.]

On 11/24/23 9:49 PM, BGB wrote:> On 11/24/2023 12:24 PM,
MitchAlsup wrote:

Paul A. Clayton wrote:

[snip]

I suspect you could write a multi-volume treatise on x86 about
hardware-software interface design and management (including the
social and economic considerations of project/product

management).

Ignoring human factors, including those outside the organization
owning the interface, seems attractive to a certain engineering
mindset but human factors are significant design considerations.

It would be more beneficial to the world just to build an
architecture
without any of those flaws--just to show them how its done.

I thought My 66000 was very close to being completed (with
refinements coming slowly and generally being compatible at the
software level). Yes, there is lots of work getting the proof-of-
concept more publicly recognized and lots of work exploring
details of various implementations.

That effort fell apart in August. I am waiting for market conditions,
and a potential customer before restarting that effort.

(Sadly, even if open source high-quality HDL implementations for a
variety of interesting design points were published, My 66000
seems unlikely to get much more adoption than Open RISC. Prophets
have to speak, but people seem at least as likely to kill a
prophet as to accept the prophet's message.)

I wish there were world enough and time for everyone (especially
experts) to publish their experience and wisdom and everyone to
interact with that wisdom, but I can intellectually (if not
emotionally) recognize that recording history is often not as
critical as making history.

That is the way of the world.

People can probably debate what is ideal.

Certainly. Yet there are different degrees of expertise. I believe
I am more qualified to critique an ISA than even most computer
programmers. Mitch Alsup (who has designed hardware for at least
four ISAs — SPARC, x86, Motorola 88k, and an unspecified GPU
architecture as well as done compiler and other related work) is
more qualified to critique an ISA than most professional computer
architects.

There can also be different goals. Critiquing an ISA independent
of its goals is unjust (except as a warning about goal
constraints), but changing the goals to blunt criticism

An ISA designed for teaching and research (the initial purpose of
RISC-V) is unlikely to be excellent for general-purpose designs.
Features which are elegant tend to be difficult to appreciate for
new students; elegance involves complexity and synergy, which is
more information even though it compresses nicely when the entire
context is known.

The Mona Lisa was not Leonardo's first painting.
Nor did he finish it in a semester.....

There seem to be people around who see RISC-V as the model of
perfection.

I would _like_ to think that all such people are noobs (or people
who use "perfection" rather loosely). I doubt even Mitch Alsup
considers My 66000 the model of perfection in ISA design, "merely"
a model of unusual excellence superior to all other published ISAs
for general purpose computing.

The problem is that they do not see themselves as noobs.
{I know, I did not feel like a noob when designing 88K}

I tend to agree with Mitch though I am still skeptical about VVM
and slightly skeptical about ESM. I trust a hardware designer to
know that VVM is implementable with equivalent Power-Performance-
Area even when I cannot see how, but I am not certain it addresses
all the use cases of SIMD, specifically in-register blocking and isolated/limited SIMD use.

There are certain things SIMD can do that VVM cannot, there are
tings VVM can do that SIDM has lots of trouble with. The biggest
difference is SIMD can perform super linearly without being in a
loop; on the other hand, VVM can change the width in calculations
{byte operands halfwords results or word operands bytes out}--
SIMD has problems here.

For ESM, I am not confident that idiom
recognition will be cheap enough to avoid the need for special
atomics (again I do basically trust a hardware designer's
expertise) and I disagree mildly about the capacity guarantees. (I
also disagree about the importance of reserving opcodes common for
data as perpetually undefined to add a barrier to executing data.
Since those opcodes could be reclaimed later if somehow opcode
space became scarce, this is a rather trivial objection.)

I don't see Idiom recognition in ESM. When I see is the C/C++ atomics
have direct compiler sequences expressing ESM semantics. Programmer
uses the language "intrinsics", compiler spits out EMS code, hardware
sequences ESM code within the definition of ATOMICity.

There might be a few areas where I think AArch64 may benefit from
being less abstracted from the implementation. Load register pair
seems a nice feature; My 66000 could provide such (and more) with
idiom recognition (two or more loads or stores using the same base
address register and slightly different offsets could avoid
multiple accesses in many cases).

My 66000 has LDM (load multiple) and STM, but these are seldom used.
What is used often is ENTER and EXIT as these provide prologue and
epilogue sequences for non-leaf subroutines. Not just storing/loading
registers to/from the stack, but dealing with SP and <optional> FP manipulations.

I do not have a good sense of when idiom recognition should be
preferred over "explicit" encoding. Both introduce complexity in
hardware and compilers. For idiom recognition, an optimizing
compiler adds another consideration for scheduling code and
sometimes choosing whether to do more conceptual work that is
faster (and sometimes less actual work by the hardware) with
uncertainty about the performance impact for different
implementations; high-performance hardware becomes more complex in
having to recognize the idiom and covert it to the
microarchtitecture's functional support. Idiom recognition also
has a code density cost. However, simple but complete
implementations are simpler (and not subsetted) than for explicit instructions, some idioms appear without explicit compiler
intention often enough to justify special handling, and handling
such in microarchitecture reduces the interface complexity.

The idioms recognized in My 66150 core:
CMP Rt,--,-- ; BBit Rt,label
Calk Rd,--,-- ; BCnd Rd,label
LD Rd,[--] ; BCnd Rd,label
ST Rd,[--] ; Calk --,--,--
CALL Label ; BR Label
These all CoIssue (both instruction pass through the pipeline
as if they were a single instruction from a single DECODE cycle
through a single WRITE cycle.

And one case of register write elision:
Calk Rd,--,-- ; Calk Rd,--,--
Which frees up the write port of the register file for latent STs.
Write elision is determined in the WAIT stage of the pipeline just
before WRITE stage.

For explicit instructions, a compiler need not use them (in which
case they are useless frills wasting opcode, decoder, and backend
resources) or even know they exist,

so why have them ??

but an optimizing compiler
would have to try to recognize idioms to convert to special
instructions. Complete hardware (compatibility issues) must pay
the costs to implement the special instructions. Minor variations
of special instructions that are discovered to be common or useful
require hardware idiom-recognition to convert _and_ compilers are
unlikely to have made any effort to facilitate idiom recognition
and it is more difficult to justify the hardware effort for less
common cases.

As with constants, it is easier for the compiler just to spit out
reasonable code and have HW treat 2-instructions as 1. Since the
architecture is already multiple words per cycle even in a 1-wide
machine, idiom recognition is just a few more patterns.

Some of the AArch64 conditional instructions seem clever in
exploiting the number of variable operands. My 66000's PRED
provides **MUCH** more flexibility, though at the cost of hardware
complexity and code density.

The thing about PRED is that it transfers control without disrupting
FETCH !! it is worth putting for this reason alone and even if only
20% of conditional branches use it.

I agree with Mitch Alsup that having to paste constants together
in software (or load them as if variable data) is suboptimal
generally. (There may be some cases where the importance of static
text size [or working set] justifies the extra effort of a level
of indirection, but such would generally seem to be a performance
loser.)

Suboptimal is a vast understatement !!

I disagree, where some things seem to be corner cutting in areas
where doing so is a foot gun, and other areas being needlessly
expensive (and some things in the reaches of "extensions land"
being just kinda absurd).>
In some ways, it is (as I see it) better to define some things and
leave them as optional, rather than define little, and leave
everyone else to make an incoherent mess of things.

One of the benefits of such is being able to approach elegance;
nonce extensions have difficulty appropriating synergy.

I do not really understand the hostility to subsetting.

I do not mind subsetting at the implementation level.
I do mind subsetting at the architectural level.

RISC-V choose to do the contrapositive--define as little as one can
get away with and let everyone invent their own additions--you end
up with a mismash of additions that fit no overall pattern and have
not <well> elegance.

Then again, likely there is disagreements as to what sorts of
features seem meaningful, wasteful, or needless extravagance.

This is as it should be. Special purpose or experimental features
should be viewed as "wasteful" when the target of those features
is not shared. The contention also concerns the limited space for standardized extensions within a single encoding space.

16-bit instructions take 3/4ths of the OpCode Map of RISC-V. If
you dropped the compressed instructions, I can fit then entire
My 66000 ISA into the vacated space.....

Standardized extensions can avoid redundant effort and some
incompatibility, but without modes to break-up the encoding space
the more extensions means less free encoding space.

Extensions need to be added in such a way that the extension remains
compatible with may unstated things WRT ISA encoding. My 66000 ISA
encoding has a property that a 40-gate logic block (4-gates of delay)
can parse instruction boundaries and create unary pointers into IB
for the next instruction and the constants. You can't just willy
nilly add instructions without knowing about this decoding logic
block.

This also introduces the argument about extensions, coprocessors,
and accelerators. Accelerators are obviously least tied to the ISA
interface, but changing an accelerator can be effectively as
incompatible as an ISA change. (Of course, microarchitecture
changes can break software performance.)

RISC-V's early encoding choices were probably quite suitable for a
teaching and research RISC ISA. Research would emphasize easy
extensibility for isolated efforts (VLE and lots of unassigned
opcode space facilitates such). Compatibility is something of an anti-consideration; researchers should be free to add any
functionality they wish without consideration to an "ecosystem".

Agreed

The commercial interest in open source implementations and even
just license-free ISA use changed the goals. This interest
expanded such that people were considering the possibility of
competing with ARM not just in the microcontroller area but more
generally.

We shall see.

Expanded interest also exposed weakness in organization.
Commercial interests wanted closed-door meetings, open systems
people wanted public information. The "prestige" of a _standard_
extension motivates standardizing more localized extensions, the
limited extension space motivates rushing to stake claims, the
increased value of the opcode space encourages conflict.

And people wonder why I am doing all My 66000 architecture and
µArchitecture by myself. {{Like Leonardo asking a local painter
to finish the Mona Lisa.}}

Some
think idiom recognition is so cheap that the bar for new
instructions should be high, some think the flexibility of RISC-V
encoding should make the bar low. Some think only "simple"
instructions should be provided, some think complex instructions
can easily be justified. The founders seem to have been,
understandably, unprepared to handle the volume of conflict
resolution involved.

In my work, code path length (roughly number of instructions) and
the frequency of operation govern performance. My 66000 has a path
length similar to VAX and a pipelineability similar to MIPS. RISC-V
has a path length == MIPS and pipelineability == MIPS. Thus, My 66000
needs only 70% the number of instructions RISC-V needs.

Granted, it does seem like x86 probably needs to be retired at
some point...

Nah. Intel has already proposed expanding the register count to 32
and possibly simplifying some of the architecture (mostly system-
level aspects, I think).

Toss the various descriptor tables.

Adding yet another encoding that retained the architectural
features is another possibility, but I doubt Intel/AMD would move
to such an encoding. The value of x86 is primarily legacy
software. Providing a cleaner encoding hints that legacy software
support might be dropped ("why add a radically different encoding
if there is not the intent to drop support for the legacy
encoding?"). That fear would reduce the value of legacy binary
support, increasing the relative attractiveness of ARM or other
alternatives.

I do not see any hope for ISA excellence.

Somedays I agree with you.

But, realistically, what is a retired computer architect to do ??
Take up gardening ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to BGB on Wed Jan 3 02:10:27 2024

On Fri, 24 Nov 2023 20:49:57 -0600, BGB wrote:

Granted, it does seem like x86 probably needs to be retired at some
point...

While in a certain sense, this is an undoubtedly true statement,
my initial reaction to it was of the ROTFL nature.

There's so much software out there that is distributed only in
binary form that runs only on x86 that retiring the x86 by fiat
while it's still so actively in use just won't happen, no matter
how bad it may be.

This is why I miss the 680x0 architecture so much. If that were
still out there as an active competitor to x86, then because this
alternative _is_ something better... even if not _radically_
better, since it's still CISC, not something really different like
RISC... then I could envisage, over time, the market for it gradually
growing while that for the x86 gradually shrinks.

At the present time, RISC-V and ARM are the contenders. Microsoft has
a version of Windows that runs on ARM. Apple now uses ARM processors
in its current Macintosh computers, and is claiming that their
performance is superior to x86 processors.

Right now, though, there's no real motive for people to go from x86
to ARM.

In time, something surely will happen to change matters, and new
computer architectures will rise up to prominence. Right now, though,
signs of movement away from x86 to something else are few.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Wed Jan 3 02:47:11 2024

Quadibloc wrote:

On Fri, 24 Nov 2023 20:49:57 -0600, BGB wrote:

Granted, it does seem like x86 probably needs to be retired at some
point...

While in a certain sense, this is an undoubtedly true statement,
my initial reaction to it was of the ROTFL nature.

There's so much software out there that is distributed only in
binary form that runs only on x86 that retiring the x86 by fiat
while it's still so actively in use just won't happen, no matter
how bad it may be.

This is why I miss the 680x0 architecture so much. If that were
still out there as an active competitor to x86, then because this
alternative _is_ something better... even if not _radically_
better, since it's still CISC, not something really different like
RISC... then I could envisage, over time, the market for it gradually
growing while that for the x86 gradually shrinks.

At the present time, RISC-V and ARM are the contenders. Microsoft has
a version of Windows that runs on ARM. Apple now uses ARM processors
in its current Macintosh computers, and is claiming that their
performance is superior to x86 processors.

Technically, Apple uses its own processors under ISA license from ARM.

Right now, though, there's no real motive for people to go from x86
to ARM.

You know, as much as I hate Intel and x86, I hate Apple even more.

In time, something surely will happen to change matters, and new
computer architectures will rise up to prominence. Right now, though,
signs of movement away from x86 to something else are few.

The movement is towards mobile {cell phones and tablets} and away
from desktops. Thus, there are more ARM cores sold per year than x86s.
But (crap) you cannot do large engineering on tablets or cells.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Wed Jan 3 07:29:52 2024

Quadibloc <quadibloc@servername.invalid> writes:

There's so much software out there that is distributed only in
binary form that runs only on x86 that retiring the x86 by fiat
while it's still so actively in use just won't happen, no matter
how bad it may be.

Well, Apple succeeded, for the computers they produce.

Still, keep that thought in mind.

This is why I miss the 680x0 architecture so much. If that were
still out there as an active competitor to x86, then because this
alternative _is_ something better...

The 68000 is worse than IA-32, because it does not have
general-purpose registers, while IA-32 does. And the 68000 then grew
baroque extensions in the 68020, at a time when the rest of the world
already knew that such things are more hindrance than help. And the
hindrance showed, when the 68040 and 68060 took longer than Intel's counterparts, and much longer than the competing RISCs: The two-wide
50MHz 68060 appeared in the same year as the 4-wide 266MHz 21164.

even if not _radically_
better, since it's still CISC, not something really different like
RISC... then I could envisage, over time, the market for it gradually
growing while that for the x86 gradually shrinks.

It did not happen in the 1980s when 68000 was strong, why should it
happen later? People even did not switch away from 8086/IA-32 when
RISCs outperformed them by a lot, because, as you write above, it's
about the software distributed in binary form. The users don't care
whether the binary contains IA-32/AMD64 instructions, 68000, PowerPC,
ARM A64, RV64GC, IA-64 or whatever else.

A software ecosystem with a single controlling instance like the MacOS ecosystem can switch architectures, one without won't. The only
opportunities for retiring AMD64 are if PCs are replaced by something
else (mobile phones and tablets settled on ARM A32/T32 and A64), or
when the address width becomes insufficient; Intel tried to use the
latter event to replace IA-32 with IA-64, but failed (probably mostly
because the necessary IA-32 emulation was not fast enough). We will
see whether the address space of AMD64 ever becomes too small for the
mass market.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Paul A. Clayton on Wed Jan 3 14:38:45 2024

On Mon, 01 Jan 2024 15:12:40 -0500, Paul A. Clayton wrote:

On 11/24/23 9:49 PM, BGB wrote:> On 11/24/2023 12:24 PM, MitchAlsup
wrote:

There seem to be people around who see RISC-V as the model of
perfection.

I would _like_ to think that all such people are noobs (or people who
use "perfection" rather loosely).

Given that David Patterson, one of the designers of MIPS, was on the
RISC-V design team, though, I can quite understand if many people
expect the RISC-V design to be a paragon of excellence - even before
they had looked at it.

I doubt even Mitch Alsup considers My
66000 the model of perfection in ISA design, "merely"
a model of unusual excellence superior to all other published ISAs for general purpose computing.

This makes it sound as though he lacks modesty, but actually, that no
doubt _is_ a factual categorization of what MY 66000 is.

I do not see any hope for ISA excellence.

Why? MY 66000 exists, and it is excellent.

If you mean no hope in it taking over the market... yes, I think that
x86 and ARM will dominate for a considerable time to come. Unlike x86,
though, I would assume that ARM is at least passable; as a commercial
RISC, it isn't as lacking in code density as RISC-V.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Anton Ertl on Wed Jan 3 14:26:52 2024

On Wed, 03 Jan 2024 07:29:52 +0000, Anton Ertl wrote:

The 68000 is worse than IA-32, because it does not have general-purpose registers, while IA-32 does. And the 68000 then grew baroque extensions
in the 68020, at a time when the rest of the world already knew that
such things are more hindrance than help.

It is true that in addition to eight general registers, the 68000 also
had eight _address_ registers. But in the addressing modes that used
address registers, there was a bit to use a general register instead,
so I don't think one can say that the 68000 didn't have general registers.

Since base registers don't have their values changed frequently, having
an additional register bank for base registers increases the supply of registers for all other purposes, so I don't think that's such a bad idea.
I did the same thing in my original Concertina design.

As for the 68020: with the 68000, the only address mode that let you form addresses by adding the contents of two registers to a displacement had a displacement of eight bits. The 68020 let you use a 16-bit displacement
in that mode. Since base-index addressing is so fundamental to accessing arrays, I think that the 68020 added at least _one_ thing that was
essential rather than superfluous.

However, instructions in that mode took up three 16-bit words, so I won't
argue against the claim that the 68000 and 68020 also had a lot of
addressing modes that _weren't_ needed. In order to have 16-bit
displacements instead of 12-bit ones, with 3-bit register fields instead
of 4-bit ones, so following the 68000 instead of System/360, I made the
format of memory reference instructions in the original Condertina this:

opcode (7 bits)
destination register (3 bits)
index register (3 bits)
base register (3 bits)

The destination register could be any of the eight general registers.

The index register could be general register 1 to 7; 0 in the field means
no indexing.

The base register could be base register 1 to 7; if 0 is in that field,
then the "index register" field becomes the "source register" field,
and the instruction is a 16-bit long register-to-register instruction.

My goal was to combine the best of the System/360 and the 68000 in a
single architecture - but then I switched to including every feature
but the kitchen sink, so as to give me an opportunity to explain how
they all worked.

Since my base registers could not be used as index registers, they
weren't the same as the address registers of the 68000.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Wed Jan 3 14:41:51 2024

On Tue, 02 Jan 2024 20:41:07 +0000, MitchAlsup wrote:

16-bit instructions take 3/4ths of the OpCode Map of RISC-V. If you
dropped the compressed instructions, I can fit then entire My 66000 ISA
into the vacated space.....

Ouch! 16-bit instructions took up 1/4th of the opcode space of Concertina
II, and that turned out to be too much, and I had to drop them.

But then, RISC-V was designed with little or no regard for code density,
while code density has been one of my foremost considerations in the design
of Concertina II, so this is hardly a fair comparison.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Wed Jan 3 15:17:16 2024

Quadibloc <quadibloc@servername.invalid> writes:

On Wed, 03 Jan 2024 07:29:52 +0000, Anton Ertl wrote:

The 68000 is worse than IA-32, because it does not have general-purpose
registers, while IA-32 does. And the 68000 then grew baroque extensions
in the 68020, at a time when the rest of the world already knew that
such things are more hindrance than help.

It is true that in addition to eight general registers, the 68000 also
had eight _address_ registers. But in the addressing modes that used
address registers, there was a bit to use a general register instead,
so I don't think one can say that the 68000 didn't have general registers.

The 68000 has 8 address registers and 8 data registers. Motorola say
so themselves. It has no general-purpose registers. You may wish
that the data registers could be used as GPRs by there being an
addressing mode "(Dn)", but neither the 68000 nor the 68020 have such
an addressing mode. I know, because I tried to code things in 68000
assembly where I first used some instruction that produces the result
in a data register, and wanted to use the result as address; this is
only possible by first moving the result to an address register.

You may not find my memory trustworthy, so look yourself at <https://en.wikibooks.org/wiki/68000_Assembly/Addressing_Modes> and
search for the non-existent (Dn) addressing mode. This page includes
the 68020 addressing modes; they added all kinds of baroque stuff, but
not (Dn).

As for the 68020: with the 68000, the only address mode that let you form >addresses by adding the contents of two registers to a displacement had a >displacement of eight bits. The 68020 let you use a 16-bit displacement
in that mode. Since base-index addressing is so fundamental to accessing >arrays, I think that the 68020 added at least _one_ thing that was
essential rather than superfluous.

Essential? How often do you use a reg+reg+disp addressing mode where
the displacement does not fit in 8 bits?

Looking at a glibc-2.31 AMD64 binary:

[~:145982] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|wc -l
341019
[~:145984] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
124
[~:145986] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
119

So 124 occurences of displacements that don't fit into unsigned 8
bits, and 119 that fit into unsigned 8 bits, but not into signed 8
bits, a total of less than 0.1% of the static instructions. And yes,
counting with

[~:145991] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[-]0x[0-9a-f]*[(]%[^)]*,'|sed 's/.*-0x/-0x/'|sed 's/(.*$//'|wc -l
1202

there are 1202 occurences of negative displacements, so making
displacement a signed number is more valuable than fitting the values
in the range 128..255 into the displacement.

But sure, making the displacement longer is not a major problem of the
68020; it's still the question if they did not add more complication
than benefit. And given that the benefit is tiny, the answer is
probably yes.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Quadibloc on Wed Jan 3 11:18:10 2024

Quadibloc wrote:

Given that David Patterson, one of the designers of MIPS, was on the
RISC-V design team, though, I can quite understand if many people
expect the RISC-V design to be a paragon of excellence - even before
they had looked at it.

Hennessy was Stanford MIPS in 1981, Patterson was RISC-1 at Berkeley in 1981.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Wed Jan 3 16:27:04 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Quadibloc <quadibloc@servername.invalid> writes:

Essential? How often do you use a reg+reg+disp addressing mode where
the displacement does not fit in 8 bits?

Looking at a glibc-2.31 AMD64 binary:

[~:145982] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|wc -l
341019
[~:145984] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
124
[~:145986] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
119

I'm not sure glibc is a representative sample. It's far more likely
for application code to have structures larger than 128 bytes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Wed Jan 3 16:51:23 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Quadibloc <quadibloc@servername.invalid> writes:

Essential? How often do you use a reg+reg+disp addressing mode where
the displacement does not fit in 8 bits?

Looking at a glibc-2.31 AMD64 binary:

[~:145982] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|wc -l
341019
[~:145984] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
124
[~:145986] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
119

I'm not sure glibc is a representative sample. It's far more likely
for application code to have structures larger than 128 bytes.

Ok, so I measured the main firefox binary (firefox puts a lot of stuff
in shared libaries, so the main binary contains only a part of the
code):

[~:145999] objdump -d /usr/lib/firefox-esr/firefox-esr|wc -l
129114
[~:146000] objdump -d /usr/lib/firefox-esr/firefox-esr|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
134

The next part was a copy-paste error. Here's the correct number:

[~:146002] objdump -d /usr/lib/firefox-esr/firefox-esr|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
12

At least for Firefox your explanation with the larger structures does
not seem to hold. Looking at the larger displacements, many don't
seem to be due to field offsets:

[~:146004] objdump -d /usr/lib/firefox-esr/firefox-esr|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |uniq -c
1 0x1000
7 0x10000
2 0x1010
8 0x180
2 0x1c0
2 0x2000
6 0x20000
3 0x280
2 0x2b0
1 0x30b
4 0x320
20 0x359d3e2a
8 0x380
1 0x4d0
20 0x5a827999
6 0x600
20 0x6ed9eba1
20 0x70e44324
1 0x8000000

Anyway, in the Firefox binary slightly more than 0.1% of the
instructions have offsets outside the signed 8-bit range. Still does
not seem essential to me.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Wed Jan 3 16:42:02 2024

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Quadibloc <quadibloc@servername.invalid> writes:

Essential? How often do you use a reg+reg+disp addressing mode where
the displacement does not fit in 8 bits?

Looking at a glibc-2.31 AMD64 binary:

[~:145982] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|wc -l
341019
[~:145984] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
124
[~:145986] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
119

I'm not sure glibc is a representative sample. It's far more likely
for application code to have structures larger than 128 bytes.

Ok, so I measured the main firefox binary (firefox puts a lot of stuff
in shared libaries, so the main binary contains only a part of the
code):

[~:145999] objdump -d /usr/lib/firefox-esr/firefox-esr|wc -l
129114
[~:146000] objdump -d /usr/lib/firefox-esr/firefox-esr|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
134
[~:146001] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
119

So in this binary 0.2% of the instructions have displacements that do
not fit into a signed 8 bits. Essential?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Wed Jan 3 17:07:43 2024

Anton Ertl wrote:

The 68000 is worse than IA-32, because it does not have
general-purpose registers, while IA-32 does. And the 68000 then grew
baroque extensions in the 68020, at a time when the rest of the world
already knew that such things are more hindrance than help. And the hindrance showed, when the 68040 and 68060 took longer than Intel's counterparts, and much longer than the competing RISCs: The two-wide
50MHz 68060 appeared in the same year as the 4-wide 266MHz 21164.

Architecture is as much about what to leave out as what to put in.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Wed Jan 3 17:06:25 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes: >>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Quadibloc <quadibloc@servername.invalid> writes:

Essential? How often do you use a reg+reg+disp addressing mode where >>>>the displacement does not fit in 8 bits?

Looking at a glibc-2.31 AMD64 binary:

[~:145982] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|wc -l
341019
[~:145984] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
124
[~:145986] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
119

I'm not sure glibc is a representative sample. It's far more likely
for application code to have structures larger than 128 bytes.

Ok, so I measured the main firefox binary (firefox puts a lot of stuff
in shared libaries, so the main binary contains only a part of the
code):

[~:145999] objdump -d /usr/lib/firefox-esr/firefox-esr|wc -l
129114
[~:146000] objdump -d /usr/lib/firefox-esr/firefox-esr|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
134

The next part was a copy-paste error. Here's the correct number:

[~:146002] objdump -d /usr/lib/firefox-esr/firefox-esr|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
12

At least for Firefox your explanation with the larger structures does
not seem to hold. Looking at the larger displacements, many don't
seem to be due to field offsets:

[~:146004] objdump -d /usr/lib/firefox-esr/firefox-esr|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |uniq -c
1 0x1000
7 0x10000
2 0x1010
8 0x180
2 0x1c0
2 0x2000
6 0x20000
3 0x280
2 0x2b0
1 0x30b
4 0x320
20 0x359d3e2a
8 0x380
1 0x4d0
20 0x5a827999
6 0x600
20 0x6ed9eba1
20 0x70e44324
1 0x8000000

Anyway, in the Firefox binary slightly more than 0.1% of the
instructions have offsets outside the signed 8-bit range. Still does
not seem essential to me.

I'm not sure you are picking up all the offsets with your grep.

For one of my applications:

5 0x100
3 0x110
5 0x1170
4 0x160
3 0x16b0
14 0x170
5 0x1720
1 0x1723
5 0x1724
5 0x1728
6 0x18a0
1 0x198
13 0x1a0
3 0x1b0
3 0x1d0
5 0x200
2 0x230
20 0x28f8
20 0x2900
3 0x2f0
8 0x3308
4 0x350
2 0x3528
5 0x3748
2 0x40a0
5 0x54ed6
2 0x54ef48
9 0x5559f0
5 0x555a41
5 0x55f50
2 0x800
1 0x8b0
1 0x9a0
1 0xe6438
1 0xe6590
1 0xe728c
1 0xe7668
6 0xe7990
3 0xe7a60

232294: 48 89 85 08 d7 ff ff mov %rax,-0x28f8(%rbp)
23229b: 48 8d 95 08 d7 ff ff lea -0x28f8(%rbp),%rdx
2322a2: 48 8b 85 70 cd ff ff mov -0x3290(%rbp),%rax

Why isn't 0x3290 in the output of the grep?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Wed Jan 3 17:05:59 2024

BGB wrote:

On 1/1/2024 2:12 PM, Paul A. Clayton wrote:

I wish there were world enough and time for everyone (especially
experts) to publish their experience and wisdom and everyone to
interact with that wisdom, but I can intellectually (if not
emotionally) recognize that recording history is often not as
critical as making history.

Might make sense if Mitch put his specifications and other stuff up on
GitHub or something?...

If someone could explain how I could do this, I would.

At least, assuming it is meant to be open.

Then, it becomes something one can look at, at their leisure.

My main concern with PRED is that it seems like it will involve some
amount of implicit architectural state which would need to be dealt with somehow in interrupt handlers (and "pipeline state" is extra hairy).

PRED state is 8-bits in thread-header.

Well, and also "make hardware do all of this stuff" isn't really part of
my philosophy. Or, effectively, any state that may exist, the interrupt handler needs to make sure it can save/restore it correctly.

Note: HW is responsible for saving and restoring state in My 66000, not SW.

I agree with Mitch Alsup that having to paste constants together
in software (or load them as if variable data) is suboptimal
generally. (There may be some cases where the importance of static
text size [or working set] justifies the extra effort of a level
of indirection, but such would generally seem to be a performance
loser.)

Yeah, I can also agree with this.

Though it seems a point of disagreement that I consider jumbo-prefixes
and (occasionally) dropping constants into temporary registers, to be acceptable.

The jumbo-prefix scheme does effectively still break the constant into pieces, but, at least all the pieces get reassembled within a single clock-cycle (unlike the multi-instruction case).

Does still have the annoyance of needing to have relocs for these cases
(and it is also desirable to try to limit the number of reloc types).

I disagree, where some things seem to be corner cutting in areas
where doing so is a foot gun, and other areas being needlessly
expensive (and some things in the reaches of "extensions land"
being just kinda absurd).>
In some ways, it is (as I see it) better to define some things and
leave them as optional, rather than define little, and leave
everyone else to make an incoherent mess of things.

One of the benefits of such is being able to approach elegance;
nonce extensions have difficulty appropriating synergy.

I do not really understand the hostility to subsetting.

Yeah.

Though, I sometimes wonder if defining everything up-front, and then
allowing for implementations to use subsets, may make the ISA spec seem
more threatening.

This is my plan ! And it makes the ISA way cleaner than "anyone can add an extension" RISC-V model.

Say, "Look at all this stuff, all this complexity", when someone doing a minimal implementation can safely ignore "most of it".

As long as you don't violate the ISA specs of the things you implement
you are OK.

Then again, likely there is disagreements as to what sorts of
features seem meaningful, wasteful, or needless extravagance.

This is as it should be. Special purpose or experimental features
should be viewed as "wasteful" when the target of those features
is not shared. The contention also concerns the limited space for
standardized extensions within a single encoding space.
Standardized extensions can avoid redundant effort and some
incompatibility, but without modes to break-up the encoding space
the more extensions means less free encoding space.

This also introduces the argument about extensions, coprocessors,
and accelerators. Accelerators are obviously least tied to the ISA
interface, but changing an accelerator can be effectively as
incompatible as an ISA change. (Of course, microarchitecture
changes can break software performance.)

Yeah.

Then there may also be things like putting devices in MMIO, but then
needing some way to detect if the device, or certain functionality is present.

Options like, "well, write this magic bit pattern to this MMIO register,
read it back, and see how the bits are set" is a little tacky.

Cores are devices and have a configuration page in configuration space
you can directly read core capabilities from here. L2s are similar.
So, CPUID is merely a LD to config space.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Wed Jan 3 17:15:46 2024

EricP wrote:

Quadibloc wrote:

Given that David Patterson, one of the designers of MIPS, was on the
RISC-V design team, though, I can quite understand if many people
expect the RISC-V design to be a paragon of excellence - even before
they had looked at it.

Hennessy was Stanford MIPS in 1981, Patterson was RISC-1 at Berkeley in 1981.

Stanford MIPS became MIPS the company
Berkeley RISC-1 became Sun Microsystems and named SPARC

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Wed Jan 3 17:16:29 2024

Scott Lurndal wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Quadibloc <quadibloc@servername.invalid> writes:

Essential? How often do you use a reg+reg+disp addressing mode where
the displacement does not fit in 8 bits?

Looking at a glibc-2.31 AMD64 binary:

[~:145982] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|wc -l
341019
[~:145984] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
124
[~:145986] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
119

I'm not sure glibc is a representative sample. It's far more likely
for application code to have structures larger than 128 bytes.

You might try EMBench.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Quadibloc on Wed Jan 3 17:42:22 2024

Quadibloc <quadibloc@servername.invalid> schrieb:

On Wed, 03 Jan 2024 07:29:52 +0000, Anton Ertl wrote:

The 68000 is worse than IA-32, because it does not have general-purpose
registers, while IA-32 does. And the 68000 then grew baroque extensions
in the 68020, at a time when the rest of the world already knew that
such things are more hindrance than help.

It is true that in addition to eight general registers, the 68000 also
had eight _address_ registers. But in the addressing modes that used
address registers, there was a bit to use a general register instead,
so I don't think one can say that the 68000 didn't have general registers.

That used the _contents_ of the register, not where it was pointing.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Wed Jan 3 17:44:57 2024

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes: >>>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Quadibloc <quadibloc@servername.invalid> writes:

Essential? How often do you use a reg+reg+disp addressing mode where >>>>>the displacement does not fit in 8 bits?

...

232294: 48 89 85 08 d7 ff ff mov %rax,-0x28f8(%rbp)
23229b: 48 8d 95 08 d7 ff ff lea -0x28f8(%rbp),%rdx
2322a2: 48 8b 85 70 cd ff ff mov -0x3290(%rbp),%rax

Why isn't 0x3290 in the output of the grep?

Because the grep is intended to pick up only reg+reg+disp addressing
(with optional scaling), not reg+disp addressing. So it works as
intended.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Anton Ertl on Wed Jan 3 18:58:20 2024

On Wed, 03 Jan 2024 15:17:16 +0000, Anton Ertl wrote:

Essential? How often do you use a reg+reg+disp addressing mode where
the displacement does not fit in 8 bits?

Every time I access an array element!

Because presumably the array will be in somewhere in a 64K byte chunk of
memory with an associated USING statement, so I need base register + 16
bit displacement to specify the start of the array, and an index register
to point to the element within the array.

Otherwise, I would need to use an extra instruction prior to the array
access to add two things together, and put the result in an index
register.

As for your memory: another post here explained what I missed. The bit
which I thought indicated using a data register used its contents.

However, that doesn't make sense to me for an instruction which also has
a *displacement*, since then the displacement must be ignored. Unless
it's an immediate add to the value...

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Wed Jan 3 20:55:23 2024

BGB wrote:

On 1/3/2024 11:05 AM, MitchAlsup wrote:

My main concern with PRED is that it seems like it will involve some
amount of implicit architectural state which would need to be dealt
with somehow in interrupt handlers (and "pipeline state" is extra hairy). >>

PRED state is 8-bits in thread-header.

Yeah, but presumably it is a mask that shifts 1 bit per every
instruction in the pipeline. If an interrupt occurs, then whatever state
gets captured needs to be correct WRT the pipeline stage that the
interrupt is captured off of.

Granted, I guess this isn't really too much different (in premise) than needing to get PC / SR / registers into a coherent state.

It is no harder than getting IP correct at an interrupt.

Well, and also "make hardware do all of this stuff" isn't really part
of my philosophy. Or, effectively, any state that may exist, the
interrupt handler needs to make sure it can save/restore it correctly.

Note: HW is responsible for saving and restoring state in My 66000, not SW. >>

I did it full software in my case, but mostly to try to save cost on a mechanism that is used comparably infrequently.

I used the same mechanism for prologue and epilogue sequences, so it gets
used often.

Like, need to try to find the cheapest possible mechanism that still
allows state to be saved/restored well enough that the program doesn't
just explode whenever an interrupt occurs.

Doing it in HW eliminates the need for "a couple of" control registers
to access the "stack" when control arrives at exception or interrupt dispatcher. But realistically, thread-state is 5 cache lines of thread
specific data with a known thread-specific virtual address--so this all
looks like a cache with 5-contiguous lines of state which one can
"remember" with a single physical address.......

Though, I sometimes wonder if defining everything up-front, and then
allowing for implementations to use subsets, may make the ISA spec
seem more threatening.

This is my plan ! And it makes the ISA way cleaner than "anyone can add
an extension" RISC-V model.

Yeah.

Consistency at the tradeoff of now people have to see a full ISA spec,
rather than say:
Integer ISA spec;
FPU ISA spec;
Privileged Mode spec;
...

All as separate specification documents.

I have an ISA specification document, how unprivileged SW uses ISA as a document, and how privileged SW uses ISA as a document; all with cross
document pointers. Having separate documents allows the noon-proprietary
ISA to be distributed allowing full access to ISA but knowledge of
privileged state. {{There are no privileged instructions, but there
is privileged state.}} I still have the privileged document under NDA.

Options like, "well, write this magic bit pattern to this MMIO
register, read it back, and see how the bits are set" is a little tacky.

Cores are devices and have a configuration page in configuration space
you can directly read core capabilities from here. L2s are similar.
So, CPUID is merely a LD to config space.

Each "block" around the chip contains 8 performance counters, and other
control registers. The counters can be sampled en masé using LDM and
reset en masé using STM. So, one has {CPU, L2, interconnect, L3, DRAM, Hostbridge, IOMMU} 8 performance counters.
The high resolution counter/timer is one of these counters.

Traditional way configuration worked as I understood it on older systems
was say:

Attempt a read access to an I/O page, if read returns device is present
if read times out no device
Read kind, vendor, and device from IO page.
Use these to access driver from table.
then::

Write values to IO ports;
Read values back;
See if response is what is expected (say, if you only get 00 or FF,
assume hardware is absent or doesn't work);
Hope that some other hardware isn't at that address which totally owns
the PC.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Chris M. Thomasson on Wed Jan 3 22:42:17 2024

Chris M. Thomasson wrote:

On 1/3/2024 10:58 AM, Quadibloc wrote:

On Wed, 03 Jan 2024 15:17:16 +0000, Anton Ertl wrote:

Essential? How often do you use a reg+reg+disp addressing mode where
the displacement does not fit in 8 bits?

Compilers can go through multiple clever steps to hoist indexing out of
loops {consuming registers} and get the need down under about 5%. However,
if you have [base+index<<scale+displacement] it ends up getting used around
8% of the time.

Every time I access an array element!

Because presumably the array will be in somewhere in a 64K byte chunk of
memory with an associated USING statement, so I need base register + 16
bit displacement to specify the start of the array, and an index register
to point to the element within the array.

In My 66000 case, when you use scaled indexing, you have access to 32-bit
and 64-bit displacements. So
LDD R7,[R19,R5<<3]
is 1 word, but:
LDD R7,[R19,R5<<3,DISP32]
is 2 words and 1 instruction, and:
LDD R7,[R19,R5<<3,DISP64]
is 3 words and 1 instruction.

You also have the ability to do::
STD #3.1415926535892145,[SP,16]
as a 3 word instruction that stores 2 words on the stack as a single instruction. This form is used a lot, so while it is not "indexing"
it is highly useful.

Not sure if this is relevant. If the 64K byte chunk was aligned on a 64K
byte boundary, then we can round a pointer to somewhere in the chunk
down to the nearest 64K byte boundary. This gives us a pointer to the beginning of the chunk. I used this trick in some of my per-thread
memory allocators. To free memory a thread would round the address down
to the nearest chunk size an push the memory into a list. Memory
allocations had to be at least the size of a word, or they would get
rounded up to word size.

It is best to avoid the 64KB limitations altogether; allowing .data
to be "significantly" far away while still allowing single instruction
access. {This is what universal constants brings to the party}
In numerics code one sees:
LDD R7,[IP,R3<<3,.LBB_002345_foo-.]
where foo[] can reside within ±2GB of the LDD instruction, as a 2 word instruction.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Thu Jan 4 01:15:40 2024

On Wed, 03 Jan 2024 22:42:17 +0000, MitchAlsup wrote:

It is best to avoid the 64KB limitations altogether; allowing .data to
be "significantly" far away while still allowing single instruction
access.

I agree with that. However, my solution to that is a different
one, which indeed is not so efficient.

Immediates in my design are strictly for immediate mode operations,
and can't also be used as absolute addresses, as you are doing.

Instead, what I have is "array mode", which is a kind of post-indexed
indirect addressing (array addresses are put in a short segment that
a special base register points to). So the array address is referenced
by a short displacement, but that means an extra memory access is
needed, instead of the address being in the instruction stream.

Can I modify my instruction format to allow for instead using your
more efficient solution to this problem? There probably is room;
change a 12-bit displacement to an 11-bit displacement, and 11
bits is plenty when I only need six bits...

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Thu Jan 4 01:33:07 2024

Quadibloc wrote:

On Wed, 03 Jan 2024 22:42:17 +0000, MitchAlsup wrote:

It is best to avoid the 64KB limitations altogether; allowing .data to
be "significantly" far away while still allowing single instruction
access.

I agree with that. However, my solution to that is a different
one, which indeed is not so efficient.

Immediates in my design are strictly for immediate mode operations,
and can't also be used as absolute addresses, as you are doing.

LDD R7,IP,R3<<3,.L00BK123.foo - .]

Is not an absolute address! IP is added as the base register and "-." ,
as part of the displacement, subtracts that very same IP value. So,
the displacement is not absolute, but a trick is used to make it smell
as if it were.

Instead, what I have is "array mode", which is a kind of post-indexed indirect addressing (array addresses are put in a short segment that
a special base register points to). So the array address is referenced
by a short displacement, but that means an extra memory access is
needed, instead of the address being in the instruction stream.

Can I modify my instruction format to allow for instead using your
more efficient solution to this problem? There probably is room;
change a 12-bit displacement to an 11-bit displacement, and 11
bits is plenty when I only need six bits...

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Thu Jan 4 03:41:25 2024

BGB wrote:

On 1/3/2024 2:55 PM, MitchAlsup wrote:

I did it full software in my case, but mostly to try to save cost on a
mechanism that is used comparably infrequently.

I used the same mechanism for prologue and epilogue sequences, so it gets
used often.

OK.

Though, not having either is technically the cheapest option.

You can buy chips with 64 GBOoO 4-to-6-wide cores on them, and you are
worrying about a sequencer made of 100-odd gates !?

All as separate specification documents.

I have an ISA specification document, how unprivileged SW uses ISA as a
document, and how privileged SW uses ISA as a document; all with cross
document pointers. Having separate documents allows the noon-proprietary
ISA to be distributed allowing full access to ISA but knowledge of
privileged state. {{There are no privileged instructions, but there is
privileged state.}} I still have the privileged document under NDA.

Hmm...

My stuff is all public (in my GitHub repository), had assumed that
anyone that might want to do their own implementation would be free to
do so.

Also made an effort to avoid anything which lacks prior art from at
least 20 years ago.

Yes, over my 35+ year career I was exposed to 10s of thousands of patents.
I tried rigorously to avoid the ones still in effect. I did borrow a few
of my patents knowing their expiration dates. I also have a clean record
of my <potential> inventions identifying when they were first conceived.

<snip>

OK, I don't have any real performance counters at the ISA level.

This is the advantage of define everything and subset certain things back
out.

The microsecond counter was mostly so that programs using functions like "clock()" wouldn't burn too much CPU time with system calls (for some
types of programs, it is not uncommon to make rapid-fire calls trying to
get the current time in milliseconds or microseconds).

I have been in discussions as to whether a RNG is used to add white noise
to the high precision timer to make side-channels harder to utilize......

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Thu Jan 4 06:14:09 2024

On Thu, 04 Jan 2024 01:33:07 +0000, MitchAlsup wrote:

LDD R7,IP,R3<<3,.L00BK123.foo - .]

Is not an absolute address! IP is added as the base register and "-." ,
as part of the displacement, subtracts that very same IP value. So, the displacement is not absolute, but a trick is used to make it smell as if
it were.

That would never have occurred to me.

I do use program counter relative addressing in my instruction set -
in the 16 bit instructions (which are now removed from the main
instruction set, but still exist as 17-bit instructions in blocks
of variable-length instructions) there are conditional branch
instructions (inspired by the PDP-11 and TI 9900) with 8-bit
signed program counter relative displacements.

But that's it.

The reason it would never have occured to me to make full-size
addresses program counter relative instead of absolute is because
now the linking loader would have to handle them differently. It
couldn't just _ignore_ them because the relative positions of the
code segment and data segment of a program aren't determined at
compile time; the operating system needs to be free to allocate
them separately.

The loader can relocate programs by adding the value of the
appropriate segment start location to a full-size address within
the code. That might be an address constant in the data segment,
or it could be something else. But I don't want to ask the loader
to do *anything else* for purposes of relocation.

I have found opcode space - not the space I originally speculated
about using - for this addition, so

http://www.quadibloc.com/arch/cw01.htm

has been revised.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Thu Jan 4 08:31:55 2024

On Wed, 03 Jan 2024 17:07:43 +0000, MitchAlsup wrote:

Architecture is as much about what to leave out as what to put in.

This is very true, and of course the major flaw in Concertina II is that
my choice is, in so far as it is at all possible, is to leave nothing
out - to do, in a single instruction, anything that almost any other
computer ever was able to do in a single instruction.

With a few exceptions in order to pretend to remain within reason.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Thu Jan 4 08:36:43 2024

On Wed, 03 Jan 2024 02:47:11 +0000, MitchAlsup wrote:

Quadibloc wrote:

Right now, though, there's no real motive for people to go from x86 to
ARM.

You know, as much as I hate Intel and x86, I hate Apple even more.

I can appreciate the sentiment, as the restrictions on Apple's App Store
mean that iOS devices are simply not an option I can consider. And, of
course, Macs tend not to be upgradeable, and this seems to be so that
Apple can charge higher prices.

But there's also Windows on ARM. And there's the whole smartphone
ecosystem of Android. But all these things together don't provide an
incentive to leave x86.

PowerPC and SPARC also exist as RISC alternatives, besides ARM and
RISC-V, but they've been forgotten, bypassed, sidelined, or whatever.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Thu Jan 4 09:19:41 2024

Quadibloc <quadibloc@servername.invalid> writes:

On Tue, 02 Jan 2024 20:41:07 +0000, MitchAlsup wrote:

16-bit instructions take 3/4ths of the OpCode Map of RISC-V. If you
dropped the compressed instructions, I can fit then entire My 66000 ISA
into the vacated space.....

Ouch! 16-bit instructions took up 1/4th of the opcode space of Concertina
II, and that turned out to be too much, and I had to drop them.

But then, RISC-V was designed with little or no regard for code density,

I think that the fact that they left 3/4ths of the encoding space to
16-bit instructions shows that they care quite a bit for encoding
size. If they did not, they would not have the C extension (the
16-bit instructions) at all.

How successful are they? Let's update my code-size measurements with
current data.

ARCHS="amd64 arm64 armel armhf i386 mips64el ppc64el riscv64 s390x"
for i in $ARCHS; do
wget http://ftp.at.debian.org/debian/pool/main/b/bash/bash_5.2.21-2_$i.deb
wget http://ftp.at.debian.org/debian/pool/main/b/bash/bash_5.2.21-2+b1_$i.deb
wget http://ftp.at.debian.org/debian/pool/main/g/grep/grep_3.11-4~exp1_$i.deb
wget http://ftp.at.debian.org/debian/pool/main/g/gzip/gzip_1.12-1_$i.deb
wget http://ftp.at.debian.org/debian/pool/main/g/gzip/gzip_1.12-1+b2_$i.deb done
for i in $ARCHS; do
for j in bash_5.2.21-2_$i.deb bash_5.2.21-2+b1_$i.deb grep_3.11-4~exp1_$i.deb gzip_1.12-1_$i.deb gzip_1.12-1+b2_$i.deb; do
if test -f $j; then
binary=bin/${j%%_*}
if test "$binary" = "bin/grep"; then
binary=usr/bin/grep
fi
ar x $j; tar xfJ data.tar.xz ./$binary; objdump -h $binary|awk --non-decimal-data '/[.]text/ {printf("%8d ","0x"$3)}'
fi
done
echo $i
done|sort -nk1

This produces:

bash grep gzip
595204 107636 46744 armhf
599832 101102 46898 riscv64
796501 144926 57729 amd64
829776 134784 56868 arm64
853892 152068 61124 i386
891128 158544 68500 armel
892688 168816 64664 s390x
1020720 170736 71088 mips64el
1168104 194900 83332 ppc64el

So RV64GC beats every other 64-bit instruction set in code density by
a wide margin and the code density is similar to the 32-bit ARM
A32/T32 instruction set. Given this evidence, it seems to me that
RV64GC (and it's basis RISC-V) was designed with a lot of
consideration for code density.

One difference between armhf and armel is that armhf uses T32/A32
(Thumb2 instructions) while armel uses only A32 (fixed-width 32-bin instructions). This probably accounts the most for the size
difference between armhf and armel.

It's interesting that A32/T32 and RV64GC with their fixed-width base
and compressed extension beat the variable-width AMD64, i386, and
S390x by such a wide margin.

In case of armhf vs i386, you cannot even make the legacy argument,
because ARM A32 was designed at the same time as i386, and T32 only
tacked on later; ok, you may consider i386 to be tacked on to the
slightly older 8086 instruction set, but given that 8086 code does not
work in an i386 binary unless you set some mode flags first, while A32
code runs without setting a mode bit on an A32/T32-capable CPU, the
situation is not quite the same.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Quadibloc on Thu Jan 4 04:32:42 2024

Quadibloc <quadibloc@servername.invalid> writes:

Yes, the PDP-8 did have a small and simple instruction set.

But that is _not_ what the meaning of RISC is commonly understood
to be.

My comments about the PDP-8 and RISC were not about what the
meaning of RISC is comonly understood (or commonly misunderstood)
to be. Rather they are about the meaning of RISC as described
by the people who originally defined the term. Please see my
longer response to John Levine.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Tim Rentsch on Thu Jan 4 14:26:40 2024

On Thu, 04 Jan 2024 04:32:42 -0800, Tim Rentsch wrote:

Quadibloc <quadibloc@servername.invalid> writes:

Yes, the PDP-8 did have a small and simple instruction set.

But that is _not_ what the meaning of RISC is commonly understood to
be.

My comments about the PDP-8 and RISC were not about what the meaning of
RISC is comonly understood (or commonly misunderstood)
to be. Rather they are about the meaning of RISC as described by the
people who originally defined the term. Please see my longer response
to John Levine.

I'm not sure how this helps you, because the original definition
includes the current common understanding, being a superset of it.

Current common understanding:

All instructions the same length.
Load-store architecture.
Relatively large register file (32 or more registers)

Original definition:

All the above, plus:
All instructions execute in one cycle.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to MitchAlsup on Thu Jan 4 10:09:06 2024

MitchAlsup wrote:

BGB wrote:

Also made an effort to avoid anything which lacks prior art from at
least 20 years ago.

Yes, over my 35+ year career I was exposed to 10s of thousands of patents.
I tried rigorously to avoid the ones still in effect. I did borrow a few
of my patents knowing their expiration dates. I also have a clean record
of my <potential> inventions identifying when they were first conceived.

IANAL

With the rule change from "first to invent" to "first to file"
is having a date record of inventions any use?

There is also the question of whether writing about something
on the internet counts as "publication" and might block patenting.

A quicky search finds this:

How Publications Affect Patentability https://www.utoledo.edu/research/TechTransfer/Publish_and_Perish.html

"The Internet: A message describing an invention on a web site or to a
public newsgroup will be considered as published on the day prior to
the posting"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Quadibloc on Thu Jan 4 18:18:26 2024

On Thu, 4 Jan 2024 14:26:40 -0000 (UTC)
Quadibloc <quadibloc@servername.invalid> wrote:

On Thu, 04 Jan 2024 04:32:42 -0800, Tim Rentsch wrote:

Quadibloc <quadibloc@servername.invalid> writes:

Yes, the PDP-8 did have a small and simple instruction set.

But that is _not_ what the meaning of RISC is commonly understood
to be.

My comments about the PDP-8 and RISC were not about what the
meaning of RISC is comonly understood (or commonly misunderstood)
to be. Rather they are about the meaning of RISC as described by
the people who originally defined the term. Please see my longer
response to John Levine.

I'm not sure how this helps you, because the original definition
includes the current common understanding, being a superset of it.

Current common understanding:

All instructions the same length.
Load-store architecture.
Relatively large register file (32 or more registers)

Original definition:

All the above, plus:
All instructions execute in one cycle.

John Savard

Current common understanding by whom?
If you'd ask an average embedded programmer or engineer whether cores
of his dare Cortex-M microcontrollers are RISC or not then an absolute
majority among those who would be able to understand your question
(which by themselves will likely be in minority) will say "Yes, they
are".
As you know, the only instruction set supported by Cortex-M cores
(except M0) has instructions of two lengths and 16 general-purpose
registers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Anton Ertl on Thu Jan 4 16:31:59 2024

On Thu, 04 Jan 2024 09:19:41 +0000, Anton Ertl wrote:

I think that the fact that they left 3/4ths of the encoding space to
16-bit instructions shows that they care quite a bit for encoding size.
If they did not, they would not have the C extension (the 16-bit instructions) at all.

That is a good point. I think I confused efficient use of RAM with
efficient use of opcode space.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Thu Jan 4 16:38:23 2024

On Wed, 03 Jan 2024 14:41:51 +0000, Quadibloc wrote:

On Tue, 02 Jan 2024 20:41:07 +0000, MitchAlsup wrote:

16-bit instructions take 3/4ths of the OpCode Map of RISC-V. If you
dropped the compressed instructions, I can fit then entire My 66000 ISA
into the vacated space.....

Ouch! 16-bit instructions took up 1/4th of the opcode space of
Concertina II, and that turned out to be too much, and I had to drop
them.

And this, of course, highlights another flaw of Concertina II, especially
when contrasted with MY 66000.

Concertina II uses virtually every scrap of available opcode space within
the 32-bit instruction word. Just recently, I came up with an ingenious
way to add one bit to the available (non-prefix) portion of the
zero-overhead instruction/header (which lets me sneak in an operate
instruction using a pseudo-immediate without using a whole 32-bit
instruction slot to provide the three bits needed to reserve space for
the pseudo-immediate value)... which allowed the set of opcodes I
wanted to provide, _and_ allowed me to do a zero-overhead version of
the new extra-long absolute address instruction (only for loads and
stores) as well.

Another recent change to the architecture was including instructions
longer than 32 bits as part of the basic 32-bit instruction set without
headers (through "composed instructions")... because I knew I needed
a larger opcode space desperately and couldn't just restrict its
availability to where it could be implemented efficiently.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Thu Jan 4 18:47:06 2024

Quadibloc wrote:

On Thu, 04 Jan 2024 04:32:42 -0800, Tim Rentsch wrote:

Quadibloc <quadibloc@servername.invalid> writes:

Yes, the PDP-8 did have a small and simple instruction set.

But that is _not_ what the meaning of RISC is commonly understood to
be.

My comments about the PDP-8 and RISC were not about what the meaning of
RISC is comonly understood (or commonly misunderstood)
to be. Rather they are about the meaning of RISC as described by the
people who originally defined the term. Please see my longer response
to John Levine.

I'm not sure how this helps you, because the original definition
includes the current common understanding, being a superset of it.

Current common understanding:

All instructions the same length.
Load-store architecture.
Relatively large register file (32 or more registers)

Original definition:

All the above, plus:
All instructions execute in one cycle.

Which precludes FP calculations.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Thu Jan 4 18:56:11 2024

Quadibloc wrote:

On Wed, 03 Jan 2024 14:41:51 +0000, Quadibloc wrote:

On Tue, 02 Jan 2024 20:41:07 +0000, MitchAlsup wrote:

16-bit instructions take 3/4ths of the OpCode Map of RISC-V. If you
dropped the compressed instructions, I can fit then entire My 66000 ISA
into the vacated space.....

Ouch! 16-bit instructions took up 1/4th of the opcode space of
Concertina II, and that turned out to be too much, and I had to drop
them.

And this, of course, highlights another flaw of Concertina II, especially when contrasted with MY 66000.

Concertina II uses virtually every scrap of available opcode space within
the 32-bit instruction word.

Whereas My 66000 has 21 slots freely available at the Major OpCode level.
and 6 permanently reserved to prevent jumping into data and executing,
out of the allocated 64 slots. {1,2,3}-Operand calculation instructions
use 1 slot each. In essence I reserve 1/3rd of the OpCode space for the
future and pre-reserved 1/10 of the OpCode Space for security.

Just recently, I came up with an ingenious
way to add one bit to the available (non-prefix) portion of the
zero-overhead instruction/header (which lets me sneak in an operate instruction using a pseudo-immediate without using a whole 32-bit
instruction slot to provide the three bits needed to reserve space for
the pseudo-immediate value)... which allowed the set of opcodes I
wanted to provide, _and_ allowed me to do a zero-overhead version of
the new extra-long absolute address instruction (only for loads and
stores) as well.

Another recent change to the architecture was including instructions
longer than 32 bits as part of the basic 32-bit instruction set without headers (through "composed instructions")... because I knew I needed
a larger opcode space desperately and couldn't just restrict its
availability to where it could be implemented efficiently.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Thu Jan 4 19:00:46 2024

Quadibloc wrote:

On Wed, 03 Jan 2024 02:47:11 +0000, MitchAlsup wrote:

Quadibloc wrote:

Right now, though, there's no real motive for people to go from x86 to
ARM.

You know, as much as I hate Intel and x86, I hate Apple even more.

I can appreciate the sentiment, as the restrictions on Apple's App Store
mean that iOS devices are simply not an option I can consider. And, of course, Macs tend not to be upgradeable, and this seems to be so that
Apple can charge higher prices.

But there's also Windows on ARM. And there's the whole smartphone
ecosystem of Android. But all these things together don't provide an incentive to leave x86.

I would really like MS to go back to windows 7 {last one I liked}.....

PowerPC and SPARC also exist as RISC alternatives, besides ARM and
RISC-V, but they've been forgotten, bypassed, sidelined, or whatever.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Thu Jan 4 19:16:32 2024

EricP wrote:

MitchAlsup wrote:

BGB wrote:

Also made an effort to avoid anything which lacks prior art from at
least 20 years ago.

Yes, over my 35+ year career I was exposed to 10s of thousands of patents. >> I tried rigorously to avoid the ones still in effect. I did borrow a few
of my patents knowing their expiration dates. I also have a clean record
of my <potential> inventions identifying when they were first conceived.

IANAL

With the rule change from "first to invent" to "first to file"
is having a date record of inventions any use?

There is also the question of whether writing about something
on the internet counts as "publication" and might block patenting.

A quicky search finds this:

How Publications Affect Patentability https://www.utoledo.edu/research/TechTransfer/Publish_and_Perish.html

"The Internet: A message describing an invention on a web site or to a
public newsgroup will be considered as published on the day prior to
the posting"

If you describe how something works it loses its patentability.
If you describe what something does abstractly it does not.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Thu Jan 4 19:20:02 2024

BGB wrote:

On 1/4/2024 9:09 AM, EricP wrote:

MitchAlsup wrote:

BGB wrote:

Also made an effort to avoid anything which lacks prior art from at
least 20 years ago.

Yes, over my 35+ year career I was exposed to 10s of thousands of
patents.
I tried rigorously to avoid the ones still in effect. I did borrow a few >>> of my patents knowing their expiration dates. I also have a clean
record of my <potential> inventions identifying when they were first
conceived.

IANAL

With the rule change from "first to invent" to "first to file"
is having a date record of inventions any use?

There is also the question of whether writing about something
on the internet counts as "publication" and might block patenting.

A quicky search finds this:

How Publications Affect Patentability
https://www.utoledo.edu/research/TechTransfer/Publish_and_Perish.html

"The Internet: A message describing an invention on a web site or to a
public newsgroup will be considered as published on the day prior to
the posting"

My concern was more with the possibility of lawyers being jerks...

I can alleviate you concerns--they are.

But, if one mostly sticks to design features that were already in use
20-30 years ago; there isn't much the lawyers can do...

And written in books or published in papers.

Granted, one could argue that this does not cover every possible way in
which these features could be combined, which is a possible area for
concern.

Though, for the most part, it seems that the "enforcement" is mostly
used against either direct re-implementations of a patented technology,
or against popular common-use technologies that can be "interpreted" to somehow infringe on a patent (even if the artifact described is often
almost entirely different), rather than going after ex-nihilo hobby
projects or similar.

Also note: if you are not making money by using something claimed in their patent, they can sue but they cannot get any money. So, it is not worth
their time.....

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Fri Jan 5 02:43:43 2024

On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:

I would really like MS to go back to windows 7 {last one I liked}.....

Finally, something we both agree on!

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Quadibloc on Fri Jan 5 14:25:27 2024

Quadibloc <quadibloc@servername.invalid> writes:

On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:

I would really like MS to go back to windows 7 {last one I liked}.....

Finally, something we both agree on!

Really, there has never been an usable Windows release.....

Unix forever! :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Scott Lurndal on Fri Jan 5 15:01:05 2024

On Fri, 05 Jan 2024 14:25:27 +0000, Scott Lurndal wrote:

Quadibloc <quadibloc@servername.invalid> writes:

On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:

I would really like MS to go back to windows 7 {last one I liked}.....

Finally, something we both agree on!

Really, there has never been an usable Windows release.....

Unix forever! :-)

It certainly is true that Linux has some major advantages. People
have had to put up with Windows, though, because some software is
only available for it.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Fri Jan 5 15:37:50 2024

On Fri, 05 Jan 2024 15:18:46 +0000, Quadibloc wrote:

To make it slightly less evil, have the address in the workspace pointer point into an on-chip static RAM instead of extenal DRAM.

And then when you're switching from one virtualized operating
system to another, you have to do a "big context switch" where
you save and restore all the registers _and_ that on-chip
static RAM!

However, that can be cured. Since the feature is specifically
*for* stuff like data acquisition programs that run straight on
the hardware, treat it as an optional feature... which is *not
included* on any virtual machine.

Which is great, of course, unless you would like to virtualize
some data acquisition software for purposes of debugging. So
instead a more approprate response is perhaps to _allow_
including fast context switching through slow mode in virtual
machines... with a warning in the manual that this is only
to be done when necessary, as it comes with a huge performance
hit.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Fri Jan 5 15:18:46 2024

On Fri, 24 Nov 2023 03:11:17 +0000, MitchAlsup wrote:

This is headed in the right direction. Make context switching something
easy to pull off.

Oh, dear. You've just given me an evil idea.

On a System/360, context switching wasn't too bad. You just save
and restore the 16 general registers and the floating-point
registers.

On a more recent CPU, you might have to save and restore the
general registers, the floating-point registers, and the SIMD
registers.

On Concertina II, in addition to 32 integer registers, 32
floating-point registers, 16 SIMD registers, there are also
eight 64-element vector registers!

On the Texas Instruments TI 9900, there were 16 general registers
which were 16 bits long - but they were in memory, so context
switching was _really_ fast, you just saved and restored the
workspace pointer!

So the evil idea is...

while the CPU does have real registers in order to run at an
acceptable speed, allow it to also run in "slow mode" with
a workspace pointer and all the registers in RAM!

To make it slightly less evil, have the address in the
workspace pointer point into an on-chip static RAM instead
of extenal DRAM.

And have a second bank of real registers, into which the
register contents are gradually migrated as the program
is running - I think the 990/10 or at least the 990/12
actually used the technique of gradually migrating registers
in RAM into real registers in the CPU for better performance,
so that's not new.

Of course, code that doesn't know it's running in slow mode
will wastefully save and restore those in-memory registers,
so the feature would be primarily recommended for use with
special programs specifically designed for coping with things
like a high frequency of interrupts.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Fri Jan 5 19:49:18 2024

Quadibloc wrote:

On Fri, 05 Jan 2024 15:18:46 +0000, Quadibloc wrote:

To make it slightly less evil, have the address in the workspace pointer
point into an on-chip static RAM instead of extenal DRAM.

And then when you're switching from one virtualized operating
system to another, you have to do a "big context switch" where
you save and restore all the registers _and_ that on-chip
static RAM!

I submit the proper place for memory resident register files and
thread-state is in DRAM. Then, writing a single control register
can switch between user threads, and writing 2 control registers
switches between GuestOSs,.....

However, that can be cured.

Yes, by placing the data in the right place at the beginning.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Fri Jan 5 19:46:51 2024

Quadibloc wrote:

On Fri, 24 Nov 2023 03:11:17 +0000, MitchAlsup wrote:

This is headed in the right direction. Make context switching something
easy to pull off.

Oh, dear. You've just given me an evil idea.

On a System/360, context switching wasn't too bad. You just save
and restore the 16 general registers and the floating-point
registers.

On a more recent CPU, you might have to save and restore the
general registers, the floating-point registers, and the SIMD
registers.

On Concertina II, in addition to 32 integer registers, 32
floating-point registers, 16 SIMD registers, there are also
eight 64-element vector registers!

One of the reasons My 66000 only has 32 GPRs is context switch time.
5 cache lines go out, 5 cache lines come in, presto you are in an
entirely different context--with no more smarts added than a cache.

On the Texas Instruments TI 9900, there were 16 general registers
which were 16 bits long - but they were in memory, so context
switching was _really_ fast, you just saved and restored the
workspace pointer!

Remembering where those 5 cache lines came from means you can
deposit the data where it belongs long term rather than on
the system/kernel stack.

So the evil idea is...

while the CPU does have real registers in order to run at an
acceptable speed, allow it to also run in "slow mode" with
a workspace pointer and all the registers in RAM!

Just treat the registers as if they were a cache from an area
in memory no other thread will be accessing.

To make it slightly less evil, have the address in the
workspace pointer point into an on-chip static RAM instead
of extenal DRAM.

Unless you can get all levels of privilege in that RAM you
just added complexity and complexity management to context
switch.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Fri Jan 5 21:43:05 2024

mitchalsup@aol.com (MitchAlsup) writes:

Quadibloc wrote:

On Fri, 05 Jan 2024 15:18:46 +0000, Quadibloc wrote:

To make it slightly less evil, have the address in the workspace pointer >>> point into an on-chip static RAM instead of extenal DRAM.

And then when you're switching from one virtualized operating
system to another, you have to do a "big context switch" where
you save and restore all the registers _and_ that on-chip
static RAM!

I submit the proper place for memory resident register files and
thread-state is in DRAM. Then, writing a single control register
can switch between user threads, and writing 2 control registers
switches between GuestOSs,.....

Doesn't this cost at least one cache line in L1?

Intel and AMD do this for the virtual machine state, but there's
an access cost to read from dram. ARM64 keeps all the VM
state in a small number of system registers that the HV can
swap as necessary.

However, that can be cured.

Yes, by placing the data in the right place at the beginning.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Fri Jan 5 23:44:35 2024

mitchalsup@aol.com (MitchAlsup) writes:

Scott Lurndal wrote:

ARM64 keeps all the VM
state in a small number of system registers that the HV can
swap as necessary.

My 66000 memory maps all control registers so even a remote CPU
can diddle with stuff a local CPU will see instantaneously
{mainly for debug of dead core}.

ARM64 cores have a similar feature.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Fri Jan 5 23:15:21 2024

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Quadibloc wrote:

On Fri, 05 Jan 2024 15:18:46 +0000, Quadibloc wrote:

To make it slightly less evil, have the address in the workspace pointer >>>> point into an on-chip static RAM instead of extenal DRAM.

And then when you're switching from one virtualized operating
system to another, you have to do a "big context switch" where
you save and restore all the registers _and_ that on-chip
static RAM!

I submit the proper place for memory resident register files and >>thread-state is in DRAM. Then, writing a single control register
can switch between user threads, and writing 2 control registers
switches between GuestOSs,.....

Doesn't this cost at least one cache line in L1?

No, because HW is doing the reads and writes, the data streams around
the L1D. That is, it may have to pass by L1 on the way in, and it can
pass by L1 on the way out, it does not interact with the footprint of
data or inst already in L1. {I am leaning on not storing in L2 on the
way out but in L3}. Inbound access probe caches so if data is present
it gets used. Outbound accesses probe caches and be written on hits.

One can in principle bypass the caches on the way in and on the way out.
DRAM <-> core registers
or even
DRAM -> core registers -> DRAM
where newly arriving registers push out the existing registers.

You are not expecting the 5 to be needed any time soon.

Intel and AMD do this for the virtual machine state, but there's
an access cost to read from dram.

The important point about using the word DRAM is that this 5-cache
line structure has a fixed PA. It can be cached anywhere and that
when that thread in not in control all its thread-state appears to
be is in that PA.

ARM64 keeps all the VM
state in a small number of system registers that the HV can
swap as necessary.

My 66000 memory maps all control registers so even a remote CPU
can diddle with stuff a local CPU will see instantaneously
{mainly for debug of dead core}.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Fri Jan 5 23:57:14 2024

On Fri, 05 Jan 2024 19:49:18 +0000, MitchAlsup wrote:

Quadibloc wrote:

On Fri, 05 Jan 2024 15:18:46 +0000, Quadibloc wrote:

To make it slightly less evil, have the address in the workspace
pointer point into an on-chip static RAM instead of extenal DRAM.

And then when you're switching from one virtualized operating system to
another, you have to do a "big context switch" where you save and
restore all the registers _and_ that on-chip static RAM!

I submit the proper place for memory resident register files and
thread-state is in DRAM. Then, writing a single control register can
switch between user threads, and writing 2 control registers switches
between GuestOSs,.....

That would certainly make my "evil idea" less evil.

But, at first glance, that seems like something that
couldn't possibly be true. Registers are in constant
use by the processor, so accessing them should be very
fast. DRAM is slow!

Of course, though, a little bit of context shows that
you're not as badly wrong as you might seem at first
glance. Any computer these days with any pretensions
to efficiency has cache.

Oops: I missed reading "memory-resident" above; you did
not claim that _all_ register files belong in RAM, just
that my idea of having a special internal memory to allow
putting registers in memory was a bad one (which I won't
try to deny).

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Sat Jan 6 00:18:23 2024

On Fri, 05 Jan 2024 23:15:21 +0000, MitchAlsup wrote:

My 66000 memory maps all control registers so even a remote CPU can
diddle with stuff a local CPU will see instantaneously {mainly for debug
of dead core}.

Oh, darn. I was going to save money by not providing proper cache
coherency hardware in implementations of Concertina II, but that
means I couldn't provide this useful feature!

Just kidding... sort of.

Mapping control registers to RAM is something I would never have
thought of, but I would indeed put pins on the package, the function
of which would be openly documented, to allow accessing chip internals.

My perverted purpose in doing so, though, was not so much for legitimate debugging as to permit my chips to be used in retrocomputing toys...

A computer with a *real front panel* just like in the old days, not
just one like on the Altair that only handles the external memory bus!

As for cache coherency... well, of course that has to be supported
for a computer to actually work the way it's supposed to without
error. However, the way I would handle it is like this:

The CPU only bothers about cache coherency for cached data from
memory that has been _explicitly marked as shared_. So the
CPUs connected to the same memory have a message bus between them;
when one requests some memory to be shared, it sends a message
out about that, and doesn't use that memory until it gets acknowledged;
_then_ the CPUs that are sharing a certain area of memory notify
each other when they write to that area of memory.

The CPUs have to be told - they don't try to keep track of everything
anyone else might be doing on the bus.

However, I haven't really thought through this aspect of CPU chip
design. Since a microprocessor needs to handle the full speed of the
memory bus in order to talk to memory, possibly bus monitoring is
simpler than a conversational protocol handling only the memory that
"needs" to be monitored.

Come to think of it, though, perhaps a CPU needs to be able to do this
both ways - bus monitoring for normal multi-CPU motherboards, and a conversational protocol so the chips can also be used in NUMA systems.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Sat Jan 6 01:35:02 2024

Quadibloc wrote:

On Fri, 05 Jan 2024 19:49:18 +0000, MitchAlsup wrote:

Quadibloc wrote:

On Fri, 05 Jan 2024 15:18:46 +0000, Quadibloc wrote:

To make it slightly less evil, have the address in the workspace
pointer point into an on-chip static RAM instead of extenal DRAM.

And then when you're switching from one virtualized operating system to
another, you have to do a "big context switch" where you save and
restore all the registers _and_ that on-chip static RAM!

I submit the proper place for memory resident register files and
thread-state is in DRAM. Then, writing a single control register can
switch between user threads, and writing 2 control registers switches
between GuestOSs,.....

That would certainly make my "evil idea" less evil.

But, at first glance, that seems like something that
couldn't possibly be true. Registers are in constant
use by the processor, so accessing them should be very
fast. DRAM is slow!

Normally you are not as dense as you display tonight.

Registers have a PA but can be in a core or somewhere
in the memory hierarchy {not not config, not MMI/O }
and normal caching rules COULD apply.

Of course, though, a little bit of context shows that
you're not as badly wrong as you might seem at first
glance. Any computer these days with any pretensions
to efficiency has cache.

Oops: I missed reading "memory-resident" above; you did
not claim that _all_ register files belong in RAM, just
that my idea of having a special internal memory to allow
putting registers in memory was a bad one (which I won't
try to deny).

All registers have a landing zone where they can be put back
or brought forth defined by a PA. HW is responsible for
obtaining new thread-state and of storing old thread-state.

BUT BECAUSE thread-state is completely defined by a single
PA, HW can change from one thread to another by writing
the control register holding that context PA.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Sat Jan 6 01:41:42 2024

Quadibloc wrote:

On Fri, 05 Jan 2024 23:15:21 +0000, MitchAlsup wrote:

My 66000 memory maps all control registers so even a remote CPU can
diddle with stuff a local CPU will see instantaneously {mainly for debug
of dead core}.

Oh, darn. I was going to save money by not providing proper cache
coherency hardware in implementations of Concertina II, but that
means I couldn't provide this useful feature!

Just kidding... sort of.

Mapping control registers to RAM is something I would never have
thought of, but I would indeed put pins on the package, the function
of which would be openly documented, to allow accessing chip internals.

My perverted purpose in doing so, though, was not so much for legitimate debugging as to permit my chips to be used in retrocomputing toys...

A computer with a *real front panel* just like in the old days, not
just one like on the Altair that only handles the external memory bus!

As for cache coherency... well, of course that has to be supported
for a computer to actually work the way it's supposed to without
error. However, the way I would handle it is like this:

The CPU only bothers about cache coherency for cached data from
memory that has been _explicitly marked as shared_.

So, shared instruction sections are marked exclusive ?!?
So, thread-local-storage is marked shared if a pointer to its cats
is constructed !?!
Can a Hypervisor share code sections with Guest OS ??
,...

Conversely, My 66000 allows one to map ROM (coherence and order free)
onto DRAM, to provide relief from coherence traffic.

So the
CPUs connected to the same memory have a message bus between them;
when one requests some memory to be shared, it sends a message
out about that, and doesn't use that memory until it gets acknowledged; _then_ the CPUs that are sharing a certain area of memory notify
each other when they write to that area of memory.

The CPUs have to be told - they don't try to keep track of everything
anyone else might be doing on the bus.

But certainly, when writing a buffer in VA[k] to disk, the core caches
have to be snooped so the disk gets the correct data.

However, I haven't really thought through this aspect of CPU chip
design. Since a microprocessor needs to handle the full speed of the
memory bus in order to talk to memory, possibly bus monitoring is
simpler than a conversational protocol handling only the memory that
"needs" to be monitored.

Come to think of it, though, perhaps a CPU needs to be able to do this
both ways - bus monitoring for normal multi-CPU motherboards, and a conversational protocol so the chips can also be used in NUMA systems.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Sat Jan 6 09:01:02 2024

On Sat, 06 Jan 2024 01:41:42 +0000, MitchAlsup wrote:

Quadibloc wrote:

The CPU only bothers about cache coherency for cached data from memory
that has been _explicitly marked as shared_.

So, shared instruction sections are marked exclusive ?!?
So, thread-local-storage is marked shared if a pointer to its cats is constructed !?!
Can a Hypervisor share code sections with Guest OS ??

Presumably, when I get around to designing this part of the hardware,
I would check on what industry-standard practice is. I may indeed
have failed to properly think some things through.

But I can still try to answer your questions, I think.

The first question:

Instruction sections aren't normally writeable. So cache coherency
becomes a given; it's only lost when you write. Presumably, then,
a shared instruction section would be...

part of the OS,

a shared library,

a permanently resident popular program (i.e. a FORTRAN compiler on
an ancient mainframe)

and these areas would have only been written to by the operating system.

So an OS thread would be its "owner", but other threads could read it.

Your second question:

The pointer can exist; the memory has to be readable only if the pointer
is actually used. And its marked shared if it's used for writing as well
as reading.

Your third question:

At first, I thought that this was something you would never want to do.

But actually, it's quite common: there might be multiple instances of
one particular guest OS running, and so one might as well start them off
with all permanently resident parts of the OS loaded - and that memory
might as well be shared by all the instances (and, initially, at least,
by the parent hypervisor as well) to avoid duplication.

Stuff that is only shared for reading isn't a coherency issue.

The CPUs have to be told - they don't try to keep track of everything
anyone else might be doing on the bus.

But certainly, when writing a buffer in VA[k] to disk, the core caches
have to be snooped so the disk gets the correct data.

If you've designated an area in memory to be a buffer for DMA...

then you need to treat it like video memory inside a video card.
Mark it non-cacheable. So I do _not_ expect DMA controllers to
have a cache snoop capability; as for the CPUs, I was thinking
in terms of them always broadcasting any changes to shared memory,
so it's always "push" and never a "pull" so that snoop would be
needed. But cache snoop is common, so I guess it reduces message
traffic for cache coherency, which means I'll need to study how
this is done some more.

But you have reminded me of something I had forgotten. I thought
that if the CPUs, because they have to work with the memory bus
at its full speed, can monitor every write to memory, and so
maintain cache coherency that way, as an option, even if that
wasn't my preferred option.

But unless you always and only have write-through caches, the
actual value of a location in memory can change before a hint
of that gets out to the bus. In the case where the CPUs talk
directly to each other about everything that happens in shared
memory, that isn't a problem - but if they were to just monitor
the bus without direct communication, they would miss recent
updates to shared memory.

Actually, even _with_ a write-through cache, there would still
be a certain slight delay of a few cycles in a write, which
would be entirely sufficient to cause occasional problems.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Quadibloc on Sat Jan 6 10:16:14 2024

Quadibloc <quadibloc@servername.invalid> schrieb:

On Thu, 04 Jan 2024 04:32:42 -0800, Tim Rentsch wrote:

Quadibloc <quadibloc@servername.invalid> writes:

Yes, the PDP-8 did have a small and simple instruction set.

But that is _not_ what the meaning of RISC is commonly understood to
be.

My comments about the PDP-8 and RISC were not about what the meaning of
RISC is comonly understood (or commonly misunderstood)
to be. Rather they are about the meaning of RISC as described by the
people who originally defined the term. Please see my longer response
to John Levine.

I'm not sure how this helps you, because the original definition
includes the current common understanding, being a superset of it.

Current common understanding:

All instructions the same length.

So, Power10, RISC-V and 32-bit ARM (which has Thumb) are not RISC.
Good to know.

Load-store architecture.
Relatively large register file (32 or more registers)

... and the 801, the original ARM v2 (without Thumb) weren't,
either.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Sat Jan 6 11:12:37 2024

On Sat, 06 Jan 2024 09:01:02 +0000, Quadibloc wrote:

Stuff that is only shared for reading isn't a coherency issue.

Ouch. Stuff that is only being read isn't a coherency issue.

But if even one CPU writes to an area of memory, with all the
other CPUs to which it is shared only reading, clearly when
those CPUs read, they may need to be sure of reading up-to-date
information when they read it.

Of course, though, if the read appears to have taken place
earlier than it actually did, something else would have to
have happened that contradicts that for there to be a real
inconsistency, but the additional interaction that could
lead to that could also be in the form of a read in the same
direction rather than a write in the other direction.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to MitchAlsup on Sat Jan 6 12:42:30 2024

MitchAlsup <mitchalsup@aol.com> schrieb:

BGB wrote:

On 1/4/2024 9:09 AM, EricP wrote:

MitchAlsup wrote:

BGB wrote:

Also made an effort to avoid anything which lacks prior art from at
least 20 years ago.

Yes, over my 35+ year career I was exposed to 10s of thousands of
patents.
I tried rigorously to avoid the ones still in effect. I did borrow a few >>>> of my patents knowing their expiration dates. I also have a clean
record of my <potential> inventions identifying when they were first
conceived.

IANAL

With the rule change from "first to invent" to "first to file"
is having a date record of inventions any use?

There is also the question of whether writing about something
on the internet counts as "publication" and might block patenting.

A quicky search finds this:

How Publications Affect Patentability
https://www.utoledo.edu/research/TechTransfer/Publish_and_Perish.html

"The Internet: A message describing an invention on a web site or to a
public newsgroup will be considered as published on the day prior to
the posting"

My concern was more with the possibility of lawyers being jerks...

I can alleviate you concerns--they are.

But, if one mostly sticks to design features that were already in use
20-30 years ago; there isn't much the lawyers can do...

And written in books or published in papers.

Granted, one could argue that this does not cover every possible way in
which these features could be combined, which is a possible area for
concern.

Though, for the most part, it seems that the "enforcement" is mostly
used against either direct re-implementations of a patented technology,
or against popular common-use technologies that can be "interpreted" to
somehow infringe on a patent (even if the artifact described is often
almost entirely different), rather than going after ex-nihilo hobby
projects or similar.

Also note: if you are not making money by using something claimed in their patent, they can sue but they cannot get any money. So, it is not worth
their time.....

At least in Germany, there are exceptions to patent protectiion,
among them using a patent privately for non-commercial purposes
and doing research (commercial or otherwise) on the subject of
the patent (§ 11 Patentgesetz). The latter is very important if,
for example, people want to try out if what is claimed in the
patent actually works.

Not sure what the situation in the US is.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Sat Jan 6 16:21:35 2024

On Mon, 04 Dec 2023 20:03:47 +0000, MitchAlsup wrote:

Quadibloc wrote:

Since out-of-order is so expensive in power and transistors, though, if
mitigations do exact a performance cost, then going to a simple CPU
that is not out-of-order might be a way to accept a loss of
performance, but gain big savings in power and die size, whereas
mitigations make those worse.

18 years ago, when I quit building CPUs professionally, GBOoO
performance was 2× what an 1-wide IO could deliver. In those 18 years
the CPU makers have gone from 2× to 3× performance while the execution window has grown from 48 to 300 instructions.
Clearly an unsustainable µArchitectural direction.

Yes, the law of diminishing returns means that even if Moore's Law
still lives on, they can't go _much_ further in that direction.

But do they have any other directions they can go in to get more
performance?

We have heard of a few:

1) Switch to a new, faster, semiconductor material if it becomes
possible.

2) Add new instructions, so as to make some additional operations
faster. Better yet, put something like an FPGA in the CPU, so
the chip can do anything quickly!

3) If we can't make the processors faster, provide more of them.
This is being done - first they put two CPUs on a chip, then four,
and now we're seeing quite a few.

Since we don't yet _have_ a new, faster semiconductor material
we can use, and since single-thread performance is what is most
ardently desired because software tends to be largely serial...
taking out-of-order to extreme lengths, despite diminishing returns,
has continued to be the most attractive option. Yes, that will have
to come to an end, but before it does, it may go at least a little
further, to a point which will seem even more like wretched excess
to you and many others.

And this brings me to

4) Adopt a new ISA, based on a design that does much of what OoO
does without OoO, based on DSP designs using VLIW and so on. Then,
with that as a base, also apply OoO, and one should need _less
extreme_ OoO for the same performance. And get more performance
when reaching the same level of wretched excess as was tolerated
before.

My Concertina II, with its VLIW features, and even (optional)
instructions to use banks of 128 registers is an attempt to
show what such an ISA might look like. Or how about an OoO
implementation of the Itanium? Or even, after the Mill becomes
popular, a way might be figured out to apply OoO techniques
to implementing that design, however revolting the thought may
be to Ivan Godard and its other designers!

Thanks to the end of Dennard scaling, until a new semiconductor
material comes along, the pressure to find some way to increase
performance still more is likely to lead to many novel designs,
at least some of which will be weird and grotesque.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Sat Jan 6 17:15:01 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

All instructions the same length.

So, Power10, RISC-V and 32-bit ARM (which has Thumb) are not RISC.
Good to know.

RISC-V without the C extension would be, but the C extension would
make it non-RISC. Likewise ARM A32 would be, but A32/T32 would not.

Power has instructions that are not 32 bits in size? Since when?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Quadibloc on Sat Jan 6 09:35:58 2024

Quadibloc <quadibloc@servername.invalid> writes:

On Thu, 04 Jan 2024 04:32:42 -0800, Tim Rentsch wrote:

Quadibloc <quadibloc@servername.invalid> writes:

Yes, the PDP-8 did have a small and simple instruction set.

But that is _not_ what the meaning of RISC is commonly understood to
be.

My comments about the PDP-8 and RISC were not about what the meaning of
RISC is comonly understood (or commonly misunderstood)
to be. Rather they are about the meaning of RISC as described by the
people who originally defined the term. Please see my longer response
to John Levine.

I'm not sure how this helps you, because the original definition
includes the current common understanding, being a superset of it.

Current common understanding:

All instructions the same length.
Load-store architecture.
Relatively large register file (32 or more registers)

Original definition:

All the above, plus:
All instructions execute in one cycle.

It seems you are talking about the definition of an early
RISC processor.

What I'm talking about is the orginal description of the
RISC concept.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Sat Jan 6 17:49:21 2024

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

All instructions the same length.

So, Power10, RISC-V and 32-bit ARM (which has Thumb) are not RISC.
Good to know.

RISC-V without the C extension would be, but the C extension would
make it non-RISC. Likewise ARM A32 would be, but A32/T32 would not.

Power has instructions that are not 32 bits in size? Since when?

Since version 3.1 of the ISA (vulgo Power10), they have the prefixed instructions, which take up two 32-bit words. An example:

[tkoenig@cfarm120 ~]$ cat add.c
unsigned long int foo(unsigned long x)
{
return x + 0xdeadbeef;
}
[tkoenig@cfarm120 ~]$ gcc -c -O3 -mcpu=power10 add.c
[tkoenig@cfarm120 ~]$ objdump -d add.o

add.o: file format elf64-powerpcle

Disassembly of section .text:

0000000000000000 <foo>:
0: ad de 00 06 paddi r3,r3,3735928559
4: ef be 63 38
8: 20 00 80 4e blr

There is a restriction that the prefixed instructions cannot
cross a 64-byte boundary.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to BGB on Sat Jan 6 17:50:18 2024

BGB <cr88192@gmail.com> schrieb:

On 1/5/2024 9:01 AM, Quadibloc wrote:

On Fri, 05 Jan 2024 14:25:27 +0000, Scott Lurndal wrote:

Quadibloc <quadibloc@servername.invalid> writes:

On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:

I would really like MS to go back to windows 7 {last one I liked}.....

Finally, something we both agree on!

Really, there has never been an usable Windows release.....

Unix forever! :-)

It certainly is true that Linux has some major advantages. People
have had to put up with Windows, though, because some software is
only available for it.

Windows merits:
More software support;
Has nearly all of the games;

Steam doesn't do too badly with Linux.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Sat Jan 6 19:52:13 2024

On Sat, 06 Jan 2024 17:15:01 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

All instructions the same length.

So, Power10, RISC-V and 32-bit ARM (which has Thumb) are not RISC.
Good to know.

RISC-V without the C extension would be, but the C extension would
make it non-RISC. Likewise ARM A32 would be, but A32/T32 would not.

Power has instructions that are not 32 bits in size? Since when?

- anton

A32 wouldn't be, even without T2. Too few registers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Sat Jan 6 18:50:46 2024

BGB wrote:

On 1/5/2024 9:01 AM, Quadibloc wrote:

On Fri, 05 Jan 2024 14:25:27 +0000, Scott Lurndal wrote:

Quadibloc <quadibloc@servername.invalid> writes:

On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:

I would really like MS to go back to windows 7 {last one I liked}.....

Finally, something we both agree on!

Really, there has never been an usable Windows release.....

Unix forever! :-)

It certainly is true that Linux has some major advantages. People
have had to put up with Windows, though, because some software is
only available for it.

Windows merits:
More software support;
Has nearly all of the games;
No endless fights with trying to get the GPU and sound hardware working;
Much less needing to fight with hardware driver issues in general;
....

Linux merits:
You can mount nearly anything anywhere;
Can do low-level HDD copies, have more freedom for how to partition and format drives, more available filesystems, ...

You can back the whole thing up such that recovery is but a DD away.

Accessing files on Linux is generally significantly faster (though, allegedly, this isn't so much because of the filesystem itself, but
rather because antivirus software and Windows Defender tend to hook the filesystem access and scan every file that is being read/written, ...).

Though, in a Windows style environment, it is generally preferable to
have a small number of comparably large files, than a large number of
small files.

General coding experience is not that much different either way.
If one sticks to mainstream languages and writes code in a portable way,
they can use mostly similar code on either (apart from code dealing with
the parts that differ).

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Sat Jan 6 21:00:38 2024

BGB wrote:

On 1/6/2024 12:50 PM, MitchAlsup wrote:

BGB wrote:

On 1/5/2024 9:01 AM, Quadibloc wrote:

On Fri, 05 Jan 2024 14:25:27 +0000, Scott Lurndal wrote:

Quadibloc <quadibloc@servername.invalid> writes:

On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:

I would really like MS to go back to windows 7 {last one I
liked}.....

Finally, something we both agree on!

Really, there has never been an usable Windows release.....

Unix forever! :-)

It certainly is true that Linux has some major advantages. People
have had to put up with Windows, though, because some software is
only available for it.

Windows merits:
More software support;
Has nearly all of the games;
No endless fights with trying to get the GPU and sound hardware working; >>> Much less needing to fight with hardware driver issues in general;
....

Linux merits:
You can mount nearly anything anywhere;
Can do low-level HDD copies, have more freedom for how to partition
and format drives, more available filesystems, ...

You can back the whole thing up such that recovery is but a DD away.

I often use Linux + DD to do low level copies of HDDs, which mostly
works (and can often get an OK drive copy), except in cases where people ignored the drive failing for long enough that it is basically entirely failed, and then this is turned into a massive pain (modern Linux seems
to drop drives about as soon as it encounters and irrecoverable IO
error, which is super annoying for data recovery tasks).

Consider that the alternative is a 4+ hour process (reloading and configuring W11); then reloading all your applications, passwords--and it never ends up "like it was".

For my main PC, mostly still running Windows.
For the most part, "everything just works", except when MS is doing
something annoying.

May or may not "jump ship" at some point though unless MS backs off on
some of the stuff they pulled with Win11 (if/when Win10 starts to get unusable).

Jumping ship, to me, is a dual system {1 Linux, 1 W <as low a number as possible}
connected by ethernet chassis to chassis.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Sat Jan 6 21:12:05 2024

BGB wrote:

On 1/6/2024 10:21 AM, Quadibloc wrote:

On Mon, 04 Dec 2023 20:03:47 +0000, MitchAlsup wrote:

Quadibloc wrote:

Since out-of-order is so expensive in power and transistors, though, if >>>> mitigations do exact a performance cost, then going to a simple CPU
that is not out-of-order might be a way to accept a loss of
performance, but gain big savings in power and die size, whereas
mitigations make those worse.

18 years ago, when I quit building CPUs professionally, GBOoO
performance was 2× what an 1-wide IO could deliver. In those 18 years
the CPU makers have gone from 2× to 3× performance while the execution >>> window has grown from 48 to 300 instructions.
Clearly an unsustainable µArchitectural direction.

Yes, the law of diminishing returns means that even if Moore's Law
still lives on, they can't go _much_ further in that direction.

Yes.

And, even then, 2x .. 3x vs a 1-wide isn't THAT big of an advantage,
given the GBOoO is going to use a lot more die area and power.

But do they have any other directions they can go in to get more
performance?

We have heard of a few:

1) Switch to a new, faster, semiconductor material if it becomes
possible.

<snip>

3) If we can't make the processors faster, provide more of them.
This is being done - first they put two CPUs on a chip, then four,
and now we're seeing quite a few.

Software continues to tell us that they cannot use 100+ cores, and
the 3,4,5,6 they can use need to be as fast as one can figure out
how to do. It is easily possible to put 256+ R3000 cores (plus FP)
on a single die all of them running 3GHz+.

This is where I had assumed small static scheduled CPUs could have merit.

OoO costs roughly 3× In Order power and provides 1.4× performance (hand waving accuracy). GB, on the other hand, costs roughly 4× and provides
1.4× performance. So, overall, the last factor of 2× in performance costs 12× in area and power and are generally surrounded with larger caches to
keep up with the larger throughput raising the area (but not so much the
power) again.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Sat Jan 6 22:43:58 2024

BGB wrote:

This is where I had assumed small static scheduled CPUs could have merit. >>

OoO costs roughly 3× In Order power and provides 1.4× performance (hand
waving accuracy). GB, on the other hand, costs roughly 4× and provides
1.4× performance. So, overall, the last factor of 2× in performance
costs 12× in area and power and are generally surrounded with larger
caches to
keep up with the larger throughput raising the area (but not so much the
power) again.

OK.

I guess the question is, say, the cost/benefit tradeoffs between OoO vs static-scheduled 'LIW' (granted, 'LIW' (*) is probably fairly similar to
an in-order superscalar, except possibly a little cheaper since it can
leaves out one of the expensive parts of an in-order superscalar...).

*: Say, designed for a maximum of 2 or 3 instructions/clock, with
explicit tagging for parallel execution (where the 'V' in 'VLIW'
seemingly tends to also imply wider execution often with an absence of
useful things like interlock handling or register forwarding...).

Also, assuming that one has a "doesn't suck" compiler for it...

Is there a question here ??

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Sun Jan 7 06:07:21 2024

On Sat, 06 Jan 2024 22:43:58 +0000, MitchAlsup wrote:

BGB wrote:

I guess the question is, say, the cost/benefit tradeoffs between OoO vs
static-scheduled 'LIW' (granted, 'LIW' (*) is probably fairly similar
to an in-order superscalar, except possibly a little cheaper since it
can leaves out one of the expensive parts of an in-order
superscalar...).

*: Say, designed for a maximum of 2 or 3 instructions/clock, with
explicit tagging for parallel execution (where the 'V' in 'VLIW'
seemingly tends to also imply wider execution often with an absence of
useful things like interlock handling or register forwarding...).

Also, assuming that one has a "doesn't suck" compiler for it...

Is there a question here ??

I think it's clear what the _answer_ is:

"You just described the Itanium. It failed big time, so your answer
is no."

Now, if you don't know the question, but you do have the answer, if it's something as enigmatic as "42", and you only have a vague description of
the question: "The great question of life, the Universe, and everything",
then the process of recovering the actual working of the question can be
very convoluted, involving pan-dimensional beings disguising themselves
as white mice.

However, in this case, I don't think it's that difficult.

To be, or not to be, that is the question.

Whether 'tis nobler in the mind to suffer the thermal issues and
excessive power consumption resulting from the outrageous transistor
counts of Great Big Out-of-Order microarchitectures,

or to oppose them with an ISA which directly handles the pipeline in
VLIW or even RISC fashion, and by opposing them, end them...

I recall that I derived the following understanding of _your_
answer to this question some time ago, but I may have misunderstood
what you were writing:

(begin my description of what I think your answer is)
VLIW-style ISAs have failed to serve as a replacement for OoO
execution.

But that does not mean we are without hope of finding something
better. The problem is that the standard textbooks have failed to
properly represent what OoO is _for_.

The scoreboard in the Control Data 6600 is just briefly mentioned,
and then it's noted that it couldn't solve all the hazards related
to RAW and WAR and so on, and then the Tomasulo came along for the
IBM System/360 Model 91, and did it _right_.

That misses the fact that register hazards aren't the only thing
that OoO execution helps with. It also helps with *cache misses*.

And the 6600-style scoreboard is adequate to deal with cache misses.

Therefore, if you want to make a computer that replaces today's
bloated GBOoO designs, without the transistor bloat, but which
offers performance that competes with them, what you need to do
is indeed take care of the register hazards the way RISC
architectures have done... but then, instead of abolishing OoO
from your design after you've done that, keep the basic and
reasonable 6600-style scoreboard so that cache misses don't
kill your performance.
(end description)

I may have gotten it badly wrong, as I pieced it together from
little things you wrote here and there on various occasions.

But at least now we have a straw man to point at and debate.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to BGB on Sun Jan 7 09:21:47 2024

On Sun, 07 Jan 2024 01:14:52 -0600, BGB wrote:

What if the goal isn't "fastest single-thread performance", but instead,
best performance relative to die area and per watt?...

If _that_ were the case, we would _already_ be using in-order CPUs, and
the wasteful nature of out-of-order execution would have precluded its
adoption entirely.

As you've pointed out, where that _is_ the goal, things like Cortex A53
cores are still doing just fine.

But when it comes even to the humble low-end laptop, Intel found it
necessary to redesign their Atom processor to be a lightweight OoO
chip, instead of the in-order design it originally had.

As the saying goes, nine women working together can't have a baby in
one month. Most computational problems aren't "embarassingly parallel",
so they don't scale well enough to avoid the situation we're in today:
people want their programs to run as fast as the current state of the
art in technology allows, and to attain that, they need the maximum single-thread performance attainable.

The path to that which we currently have available involves out-of-order execution.

I have no quarrel with OoO as a useful tool, but I also acknowledge that,
as Mitch has pointed out, today's desktop microprocessors have taken it
to the point of wretched excess.

Humanity could survive in a world where video games had to be written to
run acceptably on computers with a clock speed no higher than a single gigahertz!

And OoO isn't the _only_ wretchedly excessive thing about today's microprocessors. The small feature sizes that allow a single die to
contain eight complete CPUs with a great big out-of-order design
are attained by means of chip fabs that cost billions of dollars to
build. Couldn't we have just stopped at, say, 33nm or something.

fThe commpetitive demands Intel and AMD face - the desires of us as
consumers - are what prevents this from happening, and I see no hope
for the world to change to what might be seen as th path of virtue in
this area.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Sun Jan 7 09:30:31 2024

BGB <cr88192@gmail.com> writes:

I can also note that I am still using a cellphone with running on
in-order Cortex A53 cores...

Like, seemingly ARM has gone one direction, moving to primarily OoO
cores for newer designs, but then a lot of cellphones are seemingly
like, "Meh, whatever, we will just stick with a 8x Cortex-A53 chip from >MediaTek..."

But, if OoO were clearly superior, presumably people would have stopped
using the Cortex-A53 ...

But, there were still chips being released in 2023 using exclusively A53 >cores (and they appear to still be popular in cellphones).

Say, for example: https://en.wikipedia.org/wiki/Moto_E7
(Though, this is a model from 2020/2021).

A Cortex-A53 is cheap, in both area and licensing fees to ARM. And
the smartphones that use SoCs with these cores usually are cheap, too.
If they were the same price, people would probably go for a smartphone
with the Mediatek Dimensity 9300 with 4 OoO Cortex-X4 and 4 OoO
Cortex-A720 or with an Apple A11 or later (not sure about A9 and A10,
but A7 and A8 also used only OoO cores), neither of them with any of
those in-order cores that you get with Qualcomm offerings.

And if the users do not need the increased performance of the OoO
cores, why should they pay more to get it?

So, rather than (V)LIW competing against OoO, maybe it can compete
against in-order superscalar? ...

Not in smartphones, where software compatibility is a required
feature.

Or, with the higher end of the microcontroller space?...

Even there, the benefits of a common platform means that the industry
is consolidating on ARM; e.g., Philips (now NXP) made the Trimedia
processors (VLIW), but terminated development in 2010. Some users,
such as WD defecting to RISC-V to avoid the ARM tax, but RISC-V still
provides a common platform. Are you (or anyone else) able to provide
a VLIW platform that outcompetes ARM and RISC-V?

My thinking is not so much that one should have an ISA that mandates
VLIW, but instead, focuses on avoiding a few of the expensive parts of >in-order superscalar (namely the logic for figuring out whether
instructions can be executed in parallel).

Apparently that logic is not as expensive as you think.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to BGB on Sun Jan 7 10:30:00 2024

In article <undj1h$10fej$1@dont-email.me>, cr88192@gmail.com (BGB) wrote:

Like, seemingly ARM has gone one direction, moving to primarily OoO
cores for newer designs, but then a lot of cellphones are seemingly
like, "Meh, whatever, we will just stick with a 8x Cortex-A53 chip
from MediaTek..."

But, if OoO were clearly superior, presumably people would have
stopped using the Cortex-A53 ...

OoO is currently superior for achieving high performance per clock, but in-order allows better performance per watt. The Cortex-A53 has
successors, in Cortex-A55, Cortex-A510, and Cortex-A520, which have ISA upgrades, better power efficiency and options for bigger caches.
Interestingly, they run at lower clock speeds for beter performance.

<https://en.wikipedia.org/wiki/ARM_Cortex-A520#Architecture_comparison>

However, the -A53 seems to be cheaper to license, so it still gets used.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Sun Jan 7 14:30:53 2024

Quadibloc <quadibloc@servername.invalid> writes:

On Sun, 07 Jan 2024 01:14:52 -0600, BGB wrote:

What if the goal isn't "fastest single-thread performance", but instead,
best performance relative to die area and per watt?...

If _that_ were the case, we would _already_ be using in-order CPUs, and
the wasteful nature of out-of-order execution would have precluded its >adoption entirely.

Reality check: There are areas where parallelism is embarrassing, or
at least abundant, such as supercomputing and whatever people are
using the 192-core AmpereOne, the 128-Core Bergamo, and the upcoming
288-core Sierra Forest for. And yet Intel switched from the in-order
Knight's corner to the OoO Knight's Landing and eventually replaced
this line with AVX-512-enhanced mainline Xeons (wide OoO). And they
also use the OoO Gracemont (or its successor) for Sierra Forest rather
than building something that has a larger number of in-order cores.

My guess is that the overhead of a shared-memory interface is so big
that it does not pay to replace one such interface and a medium to big
OoO core with, say, two such interfaces on and two tiny in-order
cores, because the in-order cores are slower by more than a factor of
2. And the fact that the in-order core itself is only 1/12 the size
of the OoO core (or whatever number) does not really help because the
core plus the shared-memory interface are not that much smaller.

And OoO isn't the _only_ wretchedly excessive thing about today's >microprocessors. The small feature sizes that allow a single die to
contain eight complete CPUs with a great big out-of-order design
are attained by means of chip fabs that cost billions of dollars to
build. Couldn't we have just stopped at, say, 33nm or something.

That would be a wretched excess. Intel uses the denser processes to
reduce its production costs. Admittedly, with increasing wafer
processing costs of recent processes that may no longer work (or maybe
the wafter costs we read about just reflect the fact that TSMC now has
a monopoly on the densest processes.

fThe commpetitive demands Intel and AMD face - the desires of us as
consumers - are what prevents this from happening, and I see no hope
for the world to change to what might be seen as th path of virtue in
this area.

Nobody forces you to replace your CPU with one with a denser process.
If you want to use a 32nm CPU, get, e.g., an Intel Sandy Bridge. Or
you can get a Raspi 3, where the SoC is made in 40nm (according to <https://wikimovel.com/index.php?title=Broadcom_BCM2837>), and which
uses in-order processing as well.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Sun Jan 7 17:52:59 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Power has instructions that are not 32 bits in size? Since when?

Since version 3.1 of the ISA (vulgo Power10), they have the prefixed >instructions, which take up two 32-bit words. An example:

[tkoenig@cfarm120 ~]$ cat add.c
unsigned long int foo(unsigned long x)
{
return x + 0xdeadbeef;
}
[tkoenig@cfarm120 ~]$ gcc -c -O3 -mcpu=power10 add.c
[tkoenig@cfarm120 ~]$ objdump -d add.o

add.o: file format elf64-powerpcle

Disassembly of section .text:

0000000000000000 <foo>:
0: ad de 00 06 paddi r3,r3,3735928559
4: ef be 63 38
8: 20 00 80 4e blr

Interesting. Maybe somebody read the long-constant advocacy in this
group.

There is a restriction that the prefixed instructions cannot
cross a 64-byte boundary.

Ouch. This means that Power with prefixed instructions is the second instruction set (after MIPS with its architectural delayed loads)
where concatenating instruction blocks between two labels may result
in invalid code; on all other (~10) instruction sets I looked at this
works fine, including IA-64. Fortunately, for Power that's easy to
fix by compiling with -mno-prefixed, while for MIPS with its multiple extravaganzas (apart from the load delay slots, the jump and call
encoding is problematic) our solution was to just disable all
optimizations based on this concatenation.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andreas Eder@21:1/5 to Scott Lurndal on Sun Jan 7 20:10:06 2024

On Fr 05 Jan 2024 at 14:25, scott@slp53.sl.home (Scott Lurndal) wrote:

Quadibloc <quadibloc@servername.invalid> writes:

On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:

I would really like MS to go back to windows 7 {last one I liked}.....

Finally, something we both agree on!

Really, there has never been an usable Windows release.....

Unix forever! :-)

+1

'Andreas

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Sun Jan 7 19:21:20 2024

Quadibloc wrote:

On Sat, 06 Jan 2024 22:43:58 +0000, MitchAlsup wrote:

BGB wrote:

I guess the question is, say, the cost/benefit tradeoffs between OoO vs
static-scheduled 'LIW' (granted, 'LIW' (*) is probably fairly similar
to an in-order superscalar, except possibly a little cheaper since it
can leaves out one of the expensive parts of an in-order
superscalar...).

*: Say, designed for a maximum of 2 or 3 instructions/clock, with
explicit tagging for parallel execution (where the 'V' in 'VLIW'
seemingly tends to also imply wider execution often with an absence of
useful things like interlock handling or register forwarding...).

Also, assuming that one has a "doesn't suck" compiler for it...

Is there a question here ??

I think it's clear what the _answer_ is:

"You just described the Itanium. It failed big time, so your answer
is no."

Now, if you don't know the question, but you do have the answer, if it's something as enigmatic as "42", and you only have a vague description of
the question: "The great question of life, the Universe, and everything", then the process of recovering the actual working of the question can be
very convoluted, involving pan-dimensional beings disguising themselves
as white mice.

However, in this case, I don't think it's that difficult.

To be, or not to be, that is the question.

Whether 'tis nobler in the mind to suffer the thermal issues and
excessive power consumption resulting from the outrageous transistor
counts of Great Big Out-of-Order microarchitectures,

or to oppose them with an ISA which directly handles the pipeline in
VLIW or even RISC fashion, and by opposing them, end them...

I recall that I derived the following understanding of _your_
answer to this question some time ago, but I may have misunderstood
what you were writing:

(begin my description of what I think your answer is)
VLIW-style ISAs have failed to serve as a replacement for OoO
execution.

But that does not mean we are without hope of finding something
better. The problem is that the standard textbooks have failed to
properly represent what OoO is _for_.

The scoreboard in the Control Data 6600 is just briefly mentioned,
and then it's noted that it couldn't solve all the hazards related
to RAW and WAR and so on, and then the Tomasulo came along for the
IBM System/360 Model 91, and did it _right_.

Thornton SB for CDC 6600 is 11,000 gates for the whole thing.
Tomasulo RS for IBM 360/91 is 11,000 gates per entry.

That misses the fact that register hazards aren't the only thing
that OoO execution helps with. It also helps with *cache misses*.

One CAN solve the other hazards with another SB should, one choose.

And the 6600-style scoreboard is adequate to deal with cache misses.

Therefore, if you want to make a computer that replaces today's
bloated GBOoO designs, without the transistor bloat, but which
offers performance that competes with them, what you need to do
is indeed take care of the register hazards the way RISC
architectures have done... but then, instead of abolishing OoO
from your design after you've done that, keep the basic and
reasonable 6600-style scoreboard so that cache misses don't
kill your performance.
(end description)

I may have gotten it badly wrong, as I pieced it together from
little things you wrote here and there on various occasions.

But at least now we have a straw man to point at and debate.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Sun Jan 7 20:39:37 2024

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

There is a restriction that the prefixed instructions cannot
cross a 64-byte boundary.

Ouch. This means that Power with prefixed instructions is the second instruction set (after MIPS with its architectural delayed loads)
where concatenating instruction blocks between two labels may result
in invalid code; on all other (~10) instruction sets I looked at this
works fine, including IA-64. Fortunately, for Power that's easy to
fix by compiling with -mno-prefixed,

Or by inserting NOPs in the right places; otherwise you lose the
functionality for Power10.

Fortunately, the assembler will do this for you:

[tkoenig@cfarm120 ~]$ cat foo.s
.file "add.c"
.machine power10
.abiversion 2
.section ".text"
.align 2
.p2align 4,,15
.globl foo
.type foo, @function
foo:
.LFB0:
.cfi_startproc
.localentry foo,1
addi 3,3,0
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
blr
.long 0
.byte 0,0,0,0,0,0,0,0
.cfi_endproc
[tkoenig@cfarm120 ~]$ gcc -c foo.s
[tkoenig@cfarm120 ~]$ objdump -d foo.o

foo.o: file format elf64-powerpcle

Disassembly of section .text:

0000000000000000 <foo>:
0: 00 00 63 38 addi r3,r3,0
4: ad de 00 06 paddi r3,r3,3735928559
8: ef be 63 38
c: ad de 00 06 paddi r3,r3,3735928559
10: ef be 63 38
14: ad de 00 06 paddi r3,r3,3735928559
18: ef be 63 38
1c: ad de 00 06 paddi r3,r3,3735928559
20: ef be 63 38
24: ad de 00 06 paddi r3,r3,3735928559
28: ef be 63 38
2c: ad de 00 06 paddi r3,r3,3735928559
30: ef be 63 38
34: ad de 00 06 paddi r3,r3,3735928559
38: ef be 63 38
3c: 00 00 00 60 nop
40: ad de 00 06 paddi r3,r3,3735928559
44: ef be 63 38

So, unless you prefer to write direct machine code, this should
not be an issue.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Anton Ertl on Mon Jan 8 00:13:49 2024

On Sun, 07 Jan 2024 14:30:53 +0000, Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> tried to write:

The competitive demands Intel and AMD face - the desires of us as
consumers - are what prevents this from happening, and I see no hope for >>the world to change to what might be seen as the path of virtue in this >>area.

Nobody forces you to replace your CPU with one with a denser process. If
you want to use a 32nm CPU, get, e.g., an Intel Sandy Bridge. Or you
can get a Raspi 3, where the SoC is made in 40nm (according to <https://wikimovel.com/index.php?title=Broadcom_BCM2837>), and which
uses in-order processing as well.

What I wrote didn't contradict what you are saying in your response.

I am not saying that Intel and AMD are forcing us to buy newer and
faster microprocessors. (I could say that *Microsoft* is forcing us
to buy newer and faster microprocessors, by refusing to continue
issuing security updates for Windows 7, or, for that matter,
Windows XP, Windows 98, or even Windows 3.1. Then I would be
disagreeing with you, but I wasn't getting into that part of
the issue.)

Instead, what I wrote said that we, as consumers, are so greedy
for ever faster computers that we are the ones to blame for forcing
Intel and AMD to resort to techniques that require expensive fabs
to make the chips, and that require the chips to have enormous
numbers of transistors for each individual core.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Mon Jan 8 00:23:28 2024

On Sun, 07 Jan 2024 19:21:20 +0000, MitchAlsup wrote:

Quadibloc wrote:

That misses the fact that register hazards aren't the only thing that
OoO execution helps with. It also helps with *cache misses*.

One CAN solve the other hazards with another SB should, one choose.

Now that is something I did not know.

In fact, if I am understanding what you are saying here correctly:

It is possible to design an out-of-order CPU which addresses all the
basic types of register hazard, just as those designed using the
Tomasulo algorithm or those which equivalently use register renaming
instead, by using a modified form of the scoreboard of the Control
Data 6600.

Doing so would be more efficient, as the transistor count would be significantly lower.

...then, of course, my questiion is why isn't this what everyone is
doing already?

I mean, the answer *could* be that:

Only I, Mitch Alsup, know how this can be done. The world will have
to await my patent filing to find out how...

which is, in fact, a fair answer; you deserve to be paid for such
a valuable invention...

but if that _isn't_ the answer, then what the answer could possibly
be that could explain such counter-productive behavior evades me
completely.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Mon Jan 8 00:29:36 2024

On Mon, 08 Jan 2024 00:23:28 +0000, Quadibloc wrote:

but if that _isn't_ the answer, then what the answer could possibly be
that could explain such counter-productive behavior evades me
completely.

Further reflection allowed me to recognize that there _was_ another
possible answer:

I wasn't saying that there was any free lunch here. Remember, there is
simple basic out-of-order execution, and Great Big out-of-order
execution.

A basic out-of-order functional unit designed based on a second 6600-style scoreboard wouldn't actually be all that much different from one that
uses register renaming. In particular, they would be similar in the
speedup achieved, and in their transistor counts.

If _that_ is the answer, then my initial response resulted from me
having misunderstood what you had written.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Thomas Koenig on Mon Jan 8 00:40:16 2024

Thomas Koenig wrote:

Fortunately, the assembler will do this for you:

[tkoenig@cfarm120 ~]$ cat foo.s
.file "add.c"
.machine power10
.abiversion 2
.section ".text"
.align 2
.p2align 4,,15
.globl foo
.type foo, @function
foo:
..LFB0:
.cfi_startproc
.localentry foo,1
addi 3,3,0
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
blr
.long 0
.byte 0,0,0,0,0,0,0,0
.cfi_endproc
[tkoenig@cfarm120 ~]$ gcc -c foo.s
[tkoenig@cfarm120 ~]$ objdump -d foo.o

foo.o: file format elf64-powerpcle

Disassembly of section .text:

0000000000000000 <foo>:
0: 00 00 63 38 addi r3,r3,0
4: ad de 00 06 paddi r3,r3,3735928559
8: ef be 63 38
c: ad de 00 06 paddi r3,r3,3735928559
10: ef be 63 38
14: ad de 00 06 paddi r3,r3,3735928559
18: ef be 63 38
1c: ad de 00 06 paddi r3,r3,3735928559
20: ef be 63 38
24: ad de 00 06 paddi r3,r3,3735928559
28: ef be 63 38
2c: ad de 00 06 paddi r3,r3,3735928559
30: ef be 63 38
34: ad de 00 06 paddi r3,r3,3735928559
38: ef be 63 38
3c: 00 00 00 60 nop
40: ad de 00 06 paddi r3,r3,3735928559
44: ef be 63 38

How did 17 adds become 8 ??

So, unless you prefer to write direct machine code, this should
not be an issue.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Mon Jan 8 00:44:23 2024

Quadibloc wrote:

On Sun, 07 Jan 2024 14:30:53 +0000, Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> tried to write:

The competitive demands Intel and AMD face - the desires of us as >>>consumers - are what prevents this from happening, and I see no hope for >>>the world to change to what might be seen as the path of virtue in this >>>area.

Nobody forces you to replace your CPU with one with a denser process. If
you want to use a 32nm CPU, get, e.g., an Intel Sandy Bridge. Or you
can get a Raspi 3, where the SoC is made in 40nm (according to
<https://wikimovel.com/index.php?title=Broadcom_BCM2837>), and which
uses in-order processing as well.

What I wrote didn't contradict what you are saying in your response.

I am not saying that Intel and AMD are forcing us to buy newer and
faster microprocessors. (I could say that *Microsoft* is forcing us
to buy newer and faster microprocessors, by refusing to continue
issuing security updates for Windows 7, or, for that matter,
Windows XP, Windows 98, or even Windows 3.1. Then I would be
disagreeing with you, but I wasn't getting into that part of
the issue.)

I am calling Strawman on this::

I am of the opinion that the SW that arrives with a box/laptop be
the same over the lifetime of the product. I turn all updating of
SW off and remove power at time I am not using the device to prevent
MS from updating things I DON'T want updated--this includes security
patches.

Instead, what I wrote said that we, as consumers, are so greedy
for ever faster computers that we are the ones to blame for forcing
Intel and AMD to resort to techniques that require expensive fabs
to make the chips, and that require the chips to have enormous
numbers of transistors for each individual core.

In 07 I bought a W7 machine which worked for 9+years and died of
a power transistor blowing out. I would still be using that machine
today if it had not blown up. I reached the end of "chasing performance"
more than a decade ago.....

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Mon Jan 8 01:12:32 2024

On Mon, 08 Jan 2024 00:44:23 +0000, MitchAlsup wrote:

Quadibloc wrote:

I am not saying that Intel and AMD are forcing us to buy newer and
faster microprocessors. (I could say that *Microsoft* is forcing us to
buy newer and faster microprocessors, by refusing to continue issuing
security updates for Windows 7, or, for that matter, Windows XP,
Windows 98, or even Windows 3.1. Then I would be disagreeing with you,
but I wasn't getting into that part of the issue.)

I am calling Strawman on this::

I am of the opinion that the SW that arrives with a box/laptop be the
same over the lifetime of the product. I turn all updating of SW off and remove power at time I am not using the device to prevent MS from
updating things I DON'T want updated--this includes security patches.

The fact that I am seeing your posts here means that you _have_ tried connecting a computer to the Internet. Which invalidates the first
response to that which comes to mind.

Perhaps you use Linux or something. But as Windows users know very well
through sad experience is that if you don't keep your computer up-to-date,
to patch vulnerabilities in Windows as soon as they are discovered...
your computer could end up infected within minutes of being connected to
the Internet.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Mon Jan 8 00:58:24 2024

Quadibloc wrote:

On Sun, 07 Jan 2024 19:21:20 +0000, MitchAlsup wrote:

Quadibloc wrote:

That misses the fact that register hazards aren't the only thing that
OoO execution helps with. It also helps with *cache misses*.

One CAN solve the other hazards with another SB should, one choose.

Now that is something I did not know.

In fact, if I am understanding what you are saying here correctly:

It is possible to design an out-of-order CPU which addresses all the
basic types of register hazard, just as those designed using the
Tomasulo algorithm or those which equivalently use register renaming
instead, by using a modified form of the scoreboard of the Control
Data 6600.

Yes, and you can include timing such that you can forward results as
operands, too. The only thing a SB mandates you do is the read RF
after launch (which most GBOoO machines do today anyway.)

Doing so would be more efficient, as the transistor count would be significantly lower.

One has to be careful as a SB has a quadratic component where
Tomasulo has (heavy weight) linear component.

CDC 6600 SB partitioned the registers in to 3 files of 8 each,
and had unpipelined (but concurrent) function units. These
lead to the small number of instruction waiting for launch.

....then, of course, my questiion is why isn't this what everyone is
doing already?

The std textbook (H&P) basically says SB == bad use Thomasulo.

I mean, the answer *could* be that:

Only I, Mitch Alsup, know how this can be done. The world will have
to await my patent filing to find out how...

I gave it away when Luke ask for it.

which is, in fact, a fair answer; you deserve to be paid for such
a valuable invention...

I did it for fun, actually, because I wanted to really know.

but if that _isn't_ the answer, then what the answer could possibly
be that could explain such counter-productive behavior evades me
completely.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Mon Jan 8 01:18:06 2024

On Mon, 08 Jan 2024 00:58:24 +0000, MitchAlsup wrote:

One has to be careful as a SB has a quadratic component where Tomasulo
has (heavy weight) linear component.

Ah. Given that OoO as currently used in desktop processors is of the GBOoO variety, that quadratic component would loom large, and thus this is a
big part of the answer of why everyone isn't doing it that way.

But as you noted, this was a scheme unique to you - I had begun to
speculate that perhaps register renaming, instead of being Tomasulo
in disguise, could have been scoreboard-based at its very outset, and
was doing a literature search for more information to see if that was
how it went. But no, it wasn't.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Mon Jan 8 02:49:13 2024

On Mon, 08 Jan 2024 01:12:32 +0000, Quadibloc wrote:

But as Windows users know very well
through sad experience is that if you don't keep your computer
up-to-date,
to patch vulnerabilities in Windows as soon as they are discovered...
your computer could end up infected within minutes of being connected to
the Internet.

I was working in a call centre when MS 08-067 struck.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to BGB on Sun Jan 7 23:41:53 2024

On Sat, 6 Jan 2024 11:48:40 -0600, BGB <cr88192@gmail.com> wrote:

Windows merits:
More software support;
Has nearly all of the games;
No endless fights with trying to get the GPU and sound hardware working;
Much less needing to fight with hardware driver issues in general;

- DLLs can have private heaps

...

Windows demerits:

- essentially no POSIX compliance. all the function is there but with
different APIs

- possible multiple instances of a DLL's code in memory
[much less likely with 64-bit, but still possible]

Linux merits:
You can mount nearly anything anywhere;
Can do low-level HDD copies, have more freedom for how to partition and >format drives, more available filesystems, ...

- only one instance of any DLL's code in memory

Linux demerits:

- more difficult to give a DLL a private heap

- mmap/mprotect/madvise are a crappy 1-button interface

- no easy way to monitor VMM pages for writes (e.g., for GC)

for contrast see
https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/

Though, in a Windows style environment, it is generally preferable to
have a small number of comparably large files, than a large number of
small files.

Depends on the cache configuration:

Workstations default to what essentially is a (small) private cache
per process: if a second process opens the same file, it gets its own
cache copy. Even if lots of memory is available, once a process fills
up its own little cache, it starts to thrash.

Servers default to a single combined cache for all processes.

It is possible to change the cache sizes, to run workstations with a
single combined cache, or to run servers with per process private
caches ... in each case you just have know what to diddle in the
registry.

General coding experience is not that much different either way.
If one sticks to mainstream languages and writes code in a portable way,
they can use mostly similar code on either (apart from code dealing with
the parts that differ).

...

The problem is that you are quite limited in what you can do without
using Windows own APIs. Although it can be (and has been) done, it is difficult for a simple abstraction to paper over the differences
between POSIX and Windows.
[Asynch IO in particular is completely different.]

YMMV.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Mon Jan 8 09:59:22 2024

BGB <cr88192@gmail.com> writes:

On 1/7/2024 3:30 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

So, rather than (V)LIW competing against OoO, maybe it can compete
against in-order superscalar? ...

Not in smartphones, where software compatibility is a required
feature.

In smartphones, the program is typically being AOT'ed from a VM (such as >Dalvik), rather than distributing binaries as native ARM code.

From the POV of a Dalvik style VM, it shouldn't really matter that much.

If all programs used just Dalvik, yes, you would "just" need to write
a Dalvik implementation for your VLIW. But reality is, that there are
enough programs that are written or have components distributed as
native code to make your non-ARM architecture uncompetetive, even with
a working binary translator.

Even there, the benefits of a common platform means that the industry
is consolidating on ARM; e.g., Philips (now NXP) made the Trimedia
processors (VLIW), but terminated development in 2010. Some users,
such as WD defecting to RISC-V to avoid the ARM tax, but RISC-V still
provides a common platform. Are you (or anyone else) able to provide
a VLIW platform that outcompetes ARM and RISC-V?

Trimedia (and the TMS320C6x) line differ partly in that they were true
VLIW, rather than "LIW". So, in this case, I was imagining something
more similar to the ESP32 (LX6) or Qualcomm Hexagon or similar.

If even VLIW could not compete with ARM in a certain embedded niche,
why should LIW?

But, if RISC-V is run with similar restrictions on the pipeline, for
some of the programs tested (such as Doom), it seems to require
executing around twice as many instructions for a similar amount of work
(*).

The design philosphy of RISC-V favours having simple instructions and
combining them in the decoder over providing combined instructions, so
one would expect more executed instructions for RV64G(C) than for ARM
A64, which favours a fixed 32-bit format with instructions that do as
much as fits in 32 bits (i.e., precombined instructions). But if
combining in the decoder works, that does not mean that the programs
take longer to execute with a similarly capable back-end.

Though, this is not true of Dhrystone, where seemingly RISC-V executes
fewer instructions.

Than what?

Here's the instruction counts I get for gforth-fast onebench.fs:

2244358492 AMD64
1897389481 ARM A64
2170142765 rv64gc

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Mon Jan 8 10:21:21 2024

Quadibloc <quadibloc@servername.invalid> writes:

On Sun, 07 Jan 2024 14:30:53 +0000, Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> tried to write:

The competitive demands Intel and AMD face - the desires of us as >>>consumers - are what prevents this from happening, and I see no hope for >>>the world to change to what might be seen as the path of virtue in this >>>area.

Nobody forces you to replace your CPU with one with a denser process. If
you want to use a 32nm CPU, get, e.g., an Intel Sandy Bridge. Or you
can get a Raspi 3, where the SoC is made in 40nm (according to
<https://wikimovel.com/index.php?title=Broadcom_BCM2837>), and which
uses in-order processing as well.

What I wrote didn't contradict what you are saying in your response.

I am not saying that Intel and AMD are forcing us to buy newer and
faster microprocessors. (I could say that *Microsoft* is forcing us
to buy newer and faster microprocessors, by refusing to continue
issuing security updates for Windows 7, or, for that matter,
Windows XP, Windows 98, or even Windows 3.1.

Windows 10 works on pretty old hardware. However, for Windows 11
tricks are required to make it run on everything but relatively recent hardware.

As for "us", speak for yourself. If Microsoft does not support the
hardware I own on any supported Windows, I certainly won't buy new
hardware for it. It's only the game operating system for me.

Instead, what I wrote said that we, as consumers, are so greedy
for ever faster computers that we are the ones to blame for forcing
Intel and AMD to resort to techniques that require expensive fabs
to make the chips, and that require the chips to have enormous
numbers of transistors for each individual core.

There is certainly something to that, because the highest-performing
CPUs are bought at a big premium compared to slightly slower ones.
And in particular, you can buy cheap CPUs (like the Ryzen 5600G) or
cheap systems like the GIGABYTE Brix GB-BMCE-4500C. But guess what,
these also use TSMC 7nm or Intel 7 processes, so going for these
processes does not seem that excessive.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Mon Jan 8 14:31:00 2024

mitchalsup@aol.com (MitchAlsup) writes:

Thomas Koenig wrote:

Fortunately, the assembler will do this for you:

[tkoenig@cfarm120 ~]$ cat foo.s
.file "add.c"
.machine power10
.abiversion 2
.section ".text"
.align 2
.p2align 4,,15
.globl foo
.type foo, @function
foo:
..LFB0:
.cfi_startproc
.localentry foo,1
addi 3,3,0
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
blr
.long 0
.byte 0,0,0,0,0,0,0,0
.cfi_endproc
[tkoenig@cfarm120 ~]$ gcc -c foo.s
[tkoenig@cfarm120 ~]$ objdump -d foo.o

foo.o: file format elf64-powerpcle

Disassembly of section .text:

0000000000000000 <foo>:
0: 00 00 63 38 addi r3,r3,0
4: ad de 00 06 paddi r3,r3,3735928559
8: ef be 63 38
c: ad de 00 06 paddi r3,r3,3735928559
10: ef be 63 38
14: ad de 00 06 paddi r3,r3,3735928559
18: ef be 63 38
1c: ad de 00 06 paddi r3,r3,3735928559
20: ef be 63 38
24: ad de 00 06 paddi r3,r3,3735928559
28: ef be 63 38
2c: ad de 00 06 paddi r3,r3,3735928559
30: ef be 63 38
34: ad de 00 06 paddi r3,r3,3735928559
38: ef be 63 38
3c: 00 00 00 60 nop
40: ad de 00 06 paddi r3,r3,3735928559
44: ef be 63 38

How did 17 adds become 8 ??

So, unless you prefer to write direct machine code, this should
not be an issue.

It didn't. Thomas only showed the first cache line.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to BGB on Mon Jan 8 14:39:26 2024

BGB <cr88192@gmail.com> writes:

On 1/7/2024 10:41 PM, George Neuner wrote:

On Sat, 6 Jan 2024 11:48:40 -0600, BGB <cr88192@gmail.com> wrote:

Windows merits:
More software support;
Has nearly all of the games;
No endless fights with trying to get the GPU and sound hardware working; >>> Much less needing to fight with hardware driver issues in general;

- DLLs can have private heaps

Pros/cons it seems. I would have considered this a con.

Indeed. A huge con.

FWIW, it's not that difficult to implement a private heap in unix
if needed (e.g. a pool allocator built on brk() or mmap()).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to MitchAlsup on Mon Jan 8 18:54:51 2024

MitchAlsup <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Fortunately, the assembler will do this for you:

[tkoenig@cfarm120 ~]$ cat foo.s
.file "add.c"
.machine power10
.abiversion 2
.section ".text"
.align 2
.p2align 4,,15
.globl foo
.type foo, @function
foo:
..LFB0:
.cfi_startproc
.localentry foo,1
addi 3,3,0
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
blr
.long 0
.byte 0,0,0,0,0,0,0,0
.cfi_endproc
[tkoenig@cfarm120 ~]$ gcc -c foo.s
[tkoenig@cfarm120 ~]$ objdump -d foo.o

foo.o: file format elf64-powerpcle

Disassembly of section .text:

0000000000000000 <foo>:
0: 00 00 63 38 addi r3,r3,0
4: ad de 00 06 paddi r3,r3,3735928559
8: ef be 63 38
c: ad de 00 06 paddi r3,r3,3735928559
10: ef be 63 38
14: ad de 00 06 paddi r3,r3,3735928559
18: ef be 63 38
1c: ad de 00 06 paddi r3,r3,3735928559
20: ef be 63 38
24: ad de 00 06 paddi r3,r3,3735928559
28: ef be 63 38
2c: ad de 00 06 paddi r3,r3,3735928559
30: ef be 63 38
34: ad de 00 06 paddi r3,r3,3735928559
38: ef be 63 38
3c: 00 00 00 60 nop
40: ad de 00 06 paddi r3,r3,3735928559
44: ef be 63 38

How did 17 adds become 8 ??

I didn't paste the rest because I felt it was irrelevant to the
main point illustrated: The assembler will insert nops as
required.

(You might also note the lack of the final BLR).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Mon Jan 8 21:00:05 2024

BGB wrote:

On 1/8/2024 3:59 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

If even VLIW could not compete with ARM in a certain embedded niche,
why should LIW?

Because:
True VLIW relies heavily on being able to extract a good amount of ILP,
but falls on its face if not enough ILP is available;

Note:: GBOoO machines can extract parallelism with instructions hundreds
of instructions apart, whereas VLIW compilers cannot. Most of this para-
lelism is causal not absolute (2 potentially aliasing addresses did not actually alias this iteration).

A vaguely RISC-like LIW design (similar to ESP32 or similar), can still
be performance competitive even with fairly meager ILP (where
effectively it functions like a normal RISC just with explicit tagging
rather than a superscalar fetch).

But, if RISC-V is run with similar restrictions on the pipeline, for
some of the programs tested (such as Doom), it seems to require
executing around twice as many instructions for a similar amount of work >>> (*).

The design philosphy of RISC-V favours having simple instructions and
combining them in the decoder over providing combined instructions, so
one would expect more executed instructions for RV64G(C) than for ARM
A64, which favours a fixed 32-bit format with instructions that do as
much as fits in 32 bits (i.e., precombined instructions). But if
combining in the decoder works, that does not mean that the programs
take longer to execute with a similarly capable back-end.

Combining stuff in the decoder is expensive though...

Only in power and area.....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to All on Mon Jan 8 21:22:43 2024

I have made another change to the instruction formats, despite
feeling that they are now largely finished, so that I'm ready
to start defining the opcodes for all the instructions.

Earlier, given that it was noted that my 15-bit short instructions
would be difficult for a compiler to work with, given the
restriction that the source and destination registers belong to the
same group of eight registers within a bank of 32 registers, I
replaced them with 17-bit short instructions, only available within
blocks of variable-length instructions.

I've brought back the 15-bit short instructions, but now only within
an alternate or supplementary set of 32-bit instructions. This way,
the main benefit of getting rid of the 15-bit instructions from my
perspective - avoiding any address mode restrictions on the basic
load-store memory-reference instructions - is retained.

I felt that as the restriction on the 15-bit instructions was a good
fit to the ISA having VLIW capabilities - implying that under some circumstances, a coding style suited to dealing with an exposed pipeline
would be used - they were useful; and having short instructions
available within code composed of 32-bit instructions rather than code
with variable-length instructions would also promote compact code, and
also complement the earlier feature of composed instructions, that makes instructions longer than 32 bits available without going to variable-length instructions.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Tue Jan 9 00:38:35 2024

On Mon, 08 Jan 2024 21:22:43 +0000, Quadibloc wrote:

I've brought back the 15-bit short instructions, but now only within an alternate or supplementary set of 32-bit instructions.

Not being able to restrain myself when there are yet further depths
of wretched excess to be plunged into, I have now added two additional alternate sets of 32-bit instructions, for a total of three.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to quadibloc@servername.invalid on Mon Jan 8 21:55:30 2024

On Mon, 8 Jan 2024 01:12:32 -0000 (UTC), Quadibloc <quadibloc@servername.invalid> wrote:

On Mon, 08 Jan 2024 00:44:23 +0000, MitchAlsup wrote:

Quadibloc wrote:

I am not saying that Intel and AMD are forcing us to buy newer and
faster microprocessors. (I could say that *Microsoft* is forcing us to
buy newer and faster microprocessors, by refusing to continue issuing
security updates for Windows 7, or, for that matter, Windows XP,
Windows 98, or even Windows 3.1. Then I would be disagreeing with you,
but I wasn't getting into that part of the issue.)

I am calling Strawman on this::

I am of the opinion that the SW that arrives with a box/laptop be the
same over the lifetime of the product. I turn all updating of SW off and
remove power at time I am not using the device to prevent MS from
updating things I DON'T want updated--this includes security patches.

The fact that I am seeing your posts here means that you _have_ tried >connecting a computer to the Internet. Which invalidates the first
response to that which comes to mind.

Perhaps you use Linux or something. But as Windows users know very well >through sad experience is that if you don't keep your computer up-to-date,
to patch vulnerabilities in Windows as soon as they are discovered...
your computer could end up infected within minutes of being connected to
the Internet.

John Savard

Worse then that ... if you don't keep Windows up to date, sooner or
later you find some application software won't update, or won't work
after it does update. And new software that won't install, or worse
installs but won't run, because it depends on some feature introduced
by a "minor" update.

And if you do keep Windows up to date, you're likely to find devices
that stop working.

None of this requires an OS major *upgrade* - just an update.

Recall the fun with NT4 SP1? How about SP3 or SP6a?
2K with SP2?
XP with SP1 and SP3?
Win7 following the April 2015 service stack update?
Win10 "versions" 1709 and 2004?

[Didn't use Win8.x and not likely to touch Win11.]

YMMV,
George

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Tue Jan 9 06:50:00 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

There is a restriction that the prefixed instructions cannot
cross a 64-byte boundary.

Ouch. This means that Power with prefixed instructions is the second
instruction set (after MIPS with its architectural delayed loads)
where concatenating instruction blocks between two labels may result
in invalid code; on all other (~10) instruction sets I looked at this
works fine, including IA-64. Fortunately, for Power that's easy to
fix by compiling with -mno-prefixed,

Or by inserting NOPs in the right places; otherwise you lose the >functionality for Power10.

The instruction blocks are opaque for this technique, so there is no
way to know where "the right places" would be. And the benefit we get
from code-block copying and everything that builds on it far exceeds
what the prefix instructions are likely to buy. E.g., on Power 10
(numbers are times in seconds):

sieve bubble matrix fib fft
0.075 0.099 0.042 0.110 0.032 with code-block copying
0.181 0.184 0.123 0.230 0.119 without code-block copying

Fortunately, the assembler will do this for you:

It does not, because we copy (binary) machine-code blocks.

So, unless you prefer to write direct machine code, this should
not be an issue.

Yes, we copy machine code.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Tue Jan 9 07:01:12 2024

BGB <cr88192@gmail.com> writes:

On 1/7/2024 3:21 AM, Quadibloc wrote:

But when it comes even to the humble low-end laptop, Intel found it
necessary to redesign their Atom processor to be a lightweight OoO
chip, instead of the in-order design it originally had.

Though, to be fair:
Without OoO, x86 performance is effectively dog-crap.

Without OoO, performance is much lower on all architectures; e.g.,
from our LaTeX benchmark:

Alpha:
i 21164 600 MHz CPU, 2M L3-Cache, Redhat-Linux (a5) 8.1
o Compaq XP1000 21264 500MHz 4M L2 (a7) 5.5

AMD64:
i Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Debian 9 64bit 2.368
o AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216
o Celeron J1900 (Silvermont) 2416MHz (Shuttle XS35V4) Ubuntu16.10 1.052
o Xeon W-1370P (=Core i7-11700K), 5200MHz, Debian 11 (64-bit) 0.175

ARM A64:
i Rock 5B (1805MHz A55) Debian 11 (texlive-latex-recommended) 2.105
o Odroid N2 (1800MHz Cortex A73) Ubuntu 18.04 1.224
o Apple M1 Firestorm 3000MHz Asahi Linux Debian pre12 0.27

"i" stands for in-order, and I present the best in-order result for
the respective architecture. "o" stands for OoO, and I present both
results with OoO cores with width comparable to the fastest in-order
core on the same architecture, and the fastest core available.

The 21164 and 21264 both are 4-wide. The Atom 330, E-450, and
Silvermont are all 2-wide. The Cortex-A55 and Cortex-A73 are both
2-wide.

For many other ISA's, like 64-bit ARM, the performance holds up a lot
better, and the up-front performance gains from in-order to OoO seems to
be comparably smaller.

Really? If we believe these results, the Cortex-A55 and the Intel
Atom 330 show exactly the same performance/MHz (caveat: the Debian 11
version of LaTeX, especially with the recommended extensions, probably
does more work). The Cortex-A73's speedup over the A55 is slightly
smaller than that of the E-450 oder Silvermont over the Atom 330, but
then the A73 is a slightly older architecture than the A55 (and
manufactured in a less advanced process), while the E-450 and
Silvermont are younger (and manufactured in a more advanced process)
than the Atom 330.

Since, throw a crappy codegen at an x86, and it will happily accept it
and run at nearly the same speed as the better codegen;

What makes you think so?

but throw it at
an A53, and one find that it seemingly performs 3x-5x worse than the
code that GCC produces

I have seen speed differences by a factor of 3 or more between gcc -O0
and gcc -O on various IA-32 and AMD64 implementations, as well as on
other architectures.

I have noticed, though that Intel and AMD engineers have worked over
the years to get rid of some of the performance kinks that exist in
older implementations and that are more often seen in implementations
of other architectures. One example that I can think of is the
performance of unaligned accesses. But code that keeps variables in
memory rather than registers still is slow; yes, zero-cycle
store-to-load forwarding helps, but even a Golden Cove can only
perform IIRC 2 loads and 2 stores per cycle, whereas it can perform
at least 10 (architectural) register reads and 5 register writes per
cycle.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Tue Jan 9 07:56:21 2024

BGB <cr88192@gmail.com> writes:

On 1/8/2024 3:59 AM, Anton Ertl wrote:

If all programs used just Dalvik, yes, you would "just" need to write
a Dalvik implementation for your VLIW. But reality is, that there are
enough programs that are written or have components distributed as
native code to make your non-ARM architecture uncompetetive, even with
a working binary translator.

Possibly so.

But, there were things like Atom-based Android devices, and stuff still >worked there as well, so...

Yes, and at one point we got a report from someone who used such a
tablet, and saw ARM code when he asked Gforth to show the code for a
primitive. It turned out that, despite Android having fat binaries
(which include code for multiple architectures) or somesuch, and us
building them, for some reason the ARM code was run under emulation
rather than the native code for Intel CPU. The emulation apparently
provided the functionality just fine, but I doubt that the performance
was good in the case of Gforth (lots of indirect branches, a worst
case for binary translators).

In any case, Intel found that they could not compete (with a profit)
in that area and stopped developing new SoCs for smartphones and
tablets. Which supports my claim.

Trimedia (and the TMS320C6x) line differ partly in that they were true
VLIW, rather than "LIW". So, in this case, I was imagining something
more similar to the ESP32 (LX6) or Qualcomm Hexagon or similar.

If even VLIW could not compete with ARM in a certain embedded niche,
why should LIW?

Because:
True VLIW relies heavily on being able to extract a good amount of ILP,
but falls on its face if not enough ILP is available;
A vaguely RISC-like LIW design (similar to ESP32 or similar), can still
be performance competitive even with fairly meager ILP (where
effectively it functions like a normal RISC just with explicit tagging
rather than a superscalar fetch).

Ok, so in embedded systems there are too few widely-used application
scenarios with wide ILP to make VLIW development profitable. That
would not surprise me. I guess, that just like SIMD was the form of
explicit parallelism that provided a large part of the benefits in the
niche where IA-64 shone, but with less cost, the same happened with
TriMedia.

Still, if VLIW could not compete, why should LIW? The benefit of not
having to check for register dependencies is small for 2-wide CPUs.

It seems to me that the thing that sells the ESP32 and ESP32-S2/S3 was
not the architecture of their core but the SoCs they are in, and
especially the Wi-Fi capability.

The design philosphy of RISC-V favours having simple instructions and
combining them in the decoder over providing combined instructions, so
one would expect more executed instructions for RV64G(C) than for ARM
A64, which favours a fixed 32-bit format with instructions that do as
much as fits in 32 bits (i.e., precombined instructions). But if
combining in the decoder works, that does not mean that the programs
take longer to execute with a similarly capable back-end.

Combining stuff in the decoder is expensive though...

Apparently cheap enough that the RISC-V people decided that that is
preferable to having more instructions or more addressing modes.

But, what if one can have something that is at least a little more >performance competitive, but also "free and open" like RISC-V.

Performance is a property of the implementation, not the architecture. Especially these days. I expect that one could even make a performance-competetive VAX these days. Not that RISC-V poses the
same kind of hurdles to the implementor as VAX.

But, still kinda "glass cannon" performance on the A53, it seems to
behave like it does something like:
Look at two instructions;
Can we run these in parallel?
If yes, do so.
If no, execute each sequentially.
With full latency penalties if you try to load something and then
immediately do arithmetic on it, ...

Sure, that's what you get with an in-order implementation. You can
twist the pipeline like the i486 and Pentium and especially the
Bonnell (Atom) have done: On the Bonnell you can load a value and have
a zero-cycle latency to the computation. But if you make a
computation and use the result as address in a load, that costs IIRC 4
cycles on the Bonnell.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Tue Jan 9 09:25:10 2024

On Tue, 09 Jan 2024 00:38:35 +0000, Quadibloc wrote:

On Mon, 08 Jan 2024 21:22:43 +0000, Quadibloc wrote:

I've brought back the 15-bit short instructions, but now only within an
alternate or supplementary set of 32-bit instructions.

Not being able to restrain myself when there are yet further depths of wretched excess to be plunged into, I have now added two additional
alternate sets of 32-bit instructions, for a total of three.

I have now done something more important: after showing the prefix
bits which make for Composed Instructions, I now show the formats
of those instructions themselves on the page

http://www.quadibloc.com/arch/cw010201.htm

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Tue Jan 9 08:47:51 2024

Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

There is a restriction that the prefixed instructions cannot
cross a 64-byte boundary.

Ouch. This means that Power with prefixed instructions is the second
instruction set (after MIPS with its architectural delayed loads)
where concatenating instruction blocks between two labels may result
in invalid code; on all other (~10) instruction sets I looked at this
works fine, including IA-64. Fortunately, for Power that's easy to
fix by compiling with -mno-prefixed,

Or by inserting NOPs in the right places; otherwise you lose the
functionality for Power10.

The instruction blocks are opaque for this technique, so there is no
way to know where "the right places" would be. And the benefit we get
from code-block copying and everything that builds on it far exceeds
what the prefix instructions are likely to buy. E.g., on Power 10
(numbers are times in seconds):

sieve bubble matrix fib fft
0.075 0.099 0.042 0.110 0.032 with code-block copying
0.181 0.184 0.123 0.230 0.119 without code-block copying

Fortunately, the assembler will do this for you:

It does not, because we copy (binary) machine-code blocks.

So, unless you prefer to write direct machine code, this should
not be an issue.

Yes, we copy machine code.

- anton

How about for POWER10 prefixed instructions always emit them as

prefix
inst
nop

Then when you copy the code block check the 64B boundary.
If the prefix and inst cross it then move the nop up and prefix,inst down

nop
prefix
inst

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Tue Jan 9 16:37:16 2024

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

[Code-block copying]

The instruction blocks are opaque for this technique, so there is no
way to know where "the right places" would be.

How about for POWER10 prefixed instructions always emit them as

prefix
inst
nop

Then when you copy the code block check the 64B boundary.
If the prefix and inst cross it then move the nop up and prefix,inst down

nop
prefix
inst

As mentioned, the code blocks are opaque to the copying technique; the
program that copies knows nothing about the instructions in the code
block, and in particular it would not know whether it contains a
Power3.1 prefix instruction and where. It also does not know whether
it ends in a MIPS load instruction (another problematic case).

Fortunately it is easy to avoid the prefix instructions altogether, so
that's what we have done. The MIPS case is harder, and MIPS also
causes other trouble, so we just disabled code-copying there. Maybe
for more mainstream architectures we would have gone to greater
lengths, but they no longer are mainstream.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Tue Jan 9 19:19:46 2024

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

[Code-block copying]

The instruction blocks are opaque for this technique, so there is no
way to know where "the right places" would be.

How about for POWER10 prefixed instructions always emit them as

prefix
inst
nop

Then when you copy the code block check the 64B boundary.
If the prefix and inst cross it then move the nop up and prefix,inst down

nop
prefix
inst

As mentioned, the code blocks are opaque to the copying technique; the program that copies knows nothing about the instructions in the code
block, and in particular it would not know whether it contains a
Power3.1 prefix instruction and where.

The difficulty of recognizing a Power Prefix instruction is low: It
has major opcode 1.

However, changing the position of instructions requires handling
relocations in branches, which is probably not what you want to do.

I have to say that your application is the first one I ever
heard about that just pastes binary blobs of executables
together. How do you manage branches which exceed the normal
range (or is this something that cannot happen)?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Tue Jan 9 22:26:24 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

The difficulty of recognizing a Power Prefix instruction is low: It
has major opcode 1.

The difficulty of using -mno-prefixed is lower:-)

However, changing the position of instructions requires handling
relocations in branches, which is probably not what you want to do.

If we wanted to cater to prefixed instructions, the way to go would be
to insert enough noops before a code block containing a prefixed
instruction that none of the prefixed instructions in the code block
would violate the 64-byte-boundary restriction; at worst this means
inserting as many noops as needed to have it aligned in the same way
as in its original place.

Concerning relocation, we copy only relocatable blocks. That is
checked by compiling the same source code for the block twice, in two functions, with one function having padding between the blocks. If
the resulting code blocks contain the same bytes, they are relocatable
and can be used for this technique. If not, one has to fall back to
jumping to the original code block for this piece of code.

I have to say that your application is the first one I ever
heard about that just pastes binary blobs of executables
together. How do you manage branches which exceed the normal
range (or is this something that cannot happen)?

A code block may have internal branches, but otherwise all control
flow is performed through indirect branches. The branch targets, just
like any other virtual-machine-level immediate data, is accessed
through a VM instruction pointer.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Wed Jan 10 04:12:44 2024

On Tue, 09 Jan 2024 09:25:10 +0000, Quadibloc wrote:

On Tue, 09 Jan 2024 00:38:35 +0000, Quadibloc wrote:

On Mon, 08 Jan 2024 21:22:43 +0000, Quadibloc wrote:

I've brought back the 15-bit short instructions, but now only within
an alternate or supplementary set of 32-bit instructions.

Not being able to restrain myself when there are yet further depths of
wretched excess to be plunged into, I have now added two additional
alternate sets of 32-bit instructions, for a total of three.

I have now done something more important: after showing the prefix bits
which make for Composed Instructions, I now show the formats of those instructions themselves on the page

http://www.quadibloc.com/arch/cw010201.htm

And now I'm getting even more serious about completing the description
of instruction formats, so as to move on to listing the opcodes. On the
page

http://www.quadibloc.com/arch/cw01.htm

I have now shown the layouts of the 32-bit forms of the short vector and
long vector instructions.

To find opcode space for them, I've had to put them in the third alternate
set of 32-bit instructions, so a block header is required to access them.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Wed Jan 10 05:08:19 2024

On Wed, 10 Jan 2024 04:43:29 +0000, Quadibloc wrote:

If the first bit of an instruction prefix is 1, it will be a leftward
decoded prefix. Leftward decoded prefixes allow opcode space to be
shared between prefixes that make different kinds of modifications to an instruction depending on what kind of instruction it is.

One way of trying to make it clear what a leftward-decoded prefix
is about is to describe how it would look in the book describing the processor's instruction set.

Rightward-decoded prefixes could be mostly described in their own
section of the manual. Either they do the same thing to all
instructions, or they create their own entirely new instruction set.

On the other hand, in the section on leftward-decoded prefixes, all
one could really give is the bit patterns that define a 16-bit prefix,
a 32-bit prefix, a 48-bit prefix, and so on (should prefixes longer than
48 bits ever be desired!)

Instead, under the description of *each instruction*, there would be
a section saying "if a 16-bit prefix is applied to this instruction,
it will have these fields in this order, and their functions will be..."
and the same for any other prefix length that applies to the instruction.

That way, the complexity of the instruction set doesn't have to be
duplicated in additional bits in the prefixes themselves: the instruction prefixes the prefix before the now-defined prefix prefixes the instruction.

Exactly in what way is instruction decoding supposed to be "simple"
in Concertina II, you might ask. In this way: decoding is strictly
linear. Not linear in the sense of a straight-line; no, the path of
decoding may be truly labyrinthine. But when you fetch a 256-bit block,
you start by checking for a header, and then you do exactly what it
tells you to do... at each point, the decoder is told what to do, and
where to look next.

No backtracking, no speculation.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Wed Jan 10 04:43:29 2024

On Tue, 09 Jan 2024 00:38:35 +0000, Quadibloc wrote:

On Mon, 08 Jan 2024 21:22:43 +0000, Quadibloc wrote:

I've brought back the 15-bit short instructions, but now only within an
alternate or supplementary set of 32-bit instructions.

Not being able to restrain myself when there are yet further depths of wretched excess to be plunged into, I have now added two additional
alternate sets of 32-bit instructions, for a total of three.

As I've noted, the third alternate set of 32-bit instructions has
turned out to be quite useful, as it finally provided the opcode
space needed for short vector and long vector instructions!

I also added a mechanism whereby these additional sets of 32-bit
instructions may be used from within blocks of variable-length
instructions. A set of fourteen *convert* bits indicate, when
set to 1, that the corresponding 16-bit part of the block is the
start of a 32-bit instruction - and the *prefix* bits of the
portion of the header that made the block a block of variable-length instructions which correspond to that part of the block now indicate
which 32-bit instruction set the instruction belongs to, instead
of having their usual function.

Having the convert bit equal to 1 and the prefix bits equal to 00
would be a redundant way of indicating a regular 32-bit instruction,
already indicated by the prefix bits equal to 10 when the convert
bit is not used.

Originally, I noted that this _could_ be used instead for an extra
set of 32-bit instructions, unique to blocks of variable-length
instructions. However, I also thought that this use would be kind of extravagant and silly.

Now I've come up with a way to use this bit combination that is instead
more specifically relevant to blocks of variable-length instructions.

When the convert bit is 1, and the prefix bits are 00, let that indicate
that the 16 bits referenced are the start of an _instruction prefix_.

The instruction being prefixed will have to have its own prefix bits set
to 11 all the way through, including at its first 16 bits, so that no
attempt will be made to decode it without taking the instruction prefix
into account.

Because I want decoding to be extremely straightforward in Concertina II,
aside from the complexity caused by the vast number of instruction formats,
I have realized that I am going to need to define two general categories
of instruction prefixes.

If the first bit of an instruction prefix is 0, it will be a rightward
decoded prefix. Such a prefix can have functions like: selecting a set
of additional instructions completely unrelated to anything existing in
the ISA, or just doing something extremely simple, like adding opcode bits
to whatever instruction it follows.

If the first bit of an instruction prefix is 1, it will be a leftward
decoded prefix. Leftward decoded prefixes allow opcode space to be shared between prefixes that make different kinds of modifications to an
instruction depending on what kind of instruction it is. Such prefixes
are useful for dealing with the cases where I had to severely limit the addressing modes of a kind of instruction to fit it into 32 bits; instead
of having bits in the prefix to indicate which of a large number of cases
of this the prefix addresses, it can be indicated by the nature of the instruction being modified.

Both leftward-decoded and rightward-decoded prefixes may be longer than
16 bits (in the case of the leftward decoded kind, this has to be
decoded within the prefix before leftward decoding starts) and may act
on instructions of lengths longer than 32 bits. But they may not act
on 17-bit instructions, since the prefix field corresponding to the
start of a prefixed instruction is forced to be 11, and thus is not
available to indicate the presence of a 17-bit instruction (as well
as its first bit).

John Savard
modes of certain types of instructions

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Wed Jan 10 07:48:57 2024

On Wed, 10 Jan 2024 05:08:19 +0000, Quadibloc wrote:

the
instruction prefixes the prefix before the now-defined prefix prefixes
the instruction.

Which may make you think of this famous work of art:

https://www.artchive.com/artwork/drawing-hands-maurits-cornelis-escher-1948/

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Wed Jan 10 14:56:17 2024

On Wed, 10 Jan 2024 04:43:29 +0000, Quadibloc wrote:

When the convert bit is 1, and the prefix bits are 00, let that indicate
that the 16 bits referenced are the start of an _instruction prefix_.

The instruction being prefixed will have to have its own prefix bits set
to 11 all the way through, including at its first 16 bits, so that no
attempt will be made to decode it without taking the instruction prefix
into account.

I have decided to indeed add instruction prefixes for use with
variable-length instructions to the instruction set, but *not* to
require the use of the additional header with the "convert" bit
for them.

Instead, instruction prefixes will use some of the unused space at the
end of the opcode space of the 17-bit short instructions.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Wed Jan 10 17:02:20 2024

On Wed, 10 Jan 2024 14:56:17 +0000, Quadibloc wrote:

I have decided to indeed add instruction prefixes for use with variable-length instructions to the instruction set, but *not* to
require the use of the additional header with the "convert" bit
for them.

Instead, instruction prefixes will use some of the unused space at the
end of the opcode space of the 17-bit short instructions.

At least, in a minor, token victory for sanity, I decided that instruction prefixes longer than 16 bits (17 bits? 13 bits?) will not be entertained.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Jan 10 16:57:38 2024

I do not see any hope for ISA excellence.

Why? MY 66000 exists, and it is excellent.

Though, the real proof would be if it can be implemented effectively on
a typical Spartan or Artix class FPGA and also deliver on some of the other claims while doing so (and at a decent clock speed).

History has shown (RISC-vs-CISC being a prime example) that changes to
the underlying technology affect which ISA performs best.
I have the impression that My 66000 is probably not best suited for
an FPGA implementation.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Jan 10 17:17:10 2024

The idioms recognized in My 66150 core:
CMP Rt,--,-- ; BBit Rt,label
Calk Rd,--,-- ; BCnd Rd,label
LD Rd,[--] ; BCnd Rd,label
ST Rd,[--] ; Calk --,--,--
CALL Label ; BR Label
These all CoIssue (both instruction pass through the pipeline

Sorry, what's "Calk"?

Oh, and what's "BR" (oh, wait, do you mean that the two "Label"s don't
have to be the same, so you're talking about calling Label1 and setting
the return address to Label2? Right, yes, that must be it, sorry for
being dense).

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Stefan Monnier on Wed Jan 10 22:32:50 2024

Stefan Monnier wrote:

The idioms recognized in My 66150 core:
CMP Rt,--,-- ; BBit Rt,label
Calk Rd,--,-- ; BCnd Rd,label
LD Rd,[--] ; BCnd Rd,label
ST Rd,[--] ; Calk --,--,--
CALL Label ; BR Label
These all CoIssue (both instruction pass through the pipeline

Sorry, what's "Calk"?

A calculation instruction {ADD, AND, ...}

Oh, and what's "BR" (oh, wait, do you mean that the two "Label"s don't
have to be the same, so you're talking about calling Label1 and setting
the return address to Label2? Right, yes, that must be it, sorry for
being dense).

Yes, call somewhere and change the return address to that of the BR.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Jan 10 18:10:12 2024

In time, something surely will happen to change matters, and new
computer architectures will rise up to prominence. Right now, though,
signs of movement away from x86 to something else are few.

Really? AFAIK x86 is mostly popular for "personal computers", but the
21 century has moved back to "mainframes" (farms of servers, where x86
is still common, but ARM is a serious competitor and ), accessed from
"weak" devices (smartphones and tablets, mostly using ARM), via network
devices (using a variety of dedicated hardware, where x86 doesn't seem particularly popular).

The x86 is not about to disappear, but I think there is a clear movement
away from it.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Thu Jan 11 08:02:18 2024

Stefan Monnier <monnier@iro.umontreal.ca> writes:

History has shown (RISC-vs-CISC being a prime example) that changes to
the underlying technology affect which ISA performs best.

Has it? RISCs were first (in general-purpose computing) in the 1980s
with pipelining, first with in-order superscalar implementations in
the early 1990s (SuperSPARC, 88110, 21064), but OOO was introduced in
IA-32 (with the Pentium Pro) one day earlier than in HPPA (PA-8000).

The underlying technology may have made the architectural advantages
of RISCs smaller, but I think that economic and project management
aspects caused IA-32 to gain the performance advantage that they have
had between about 2000-2020. The economic advantage was that there
was more revenue in IA-32 than especially in the split landscape of
the RISCs.

Concerning project management, the early RISC implementations could be
designed by small teams quickly. Designing superscalar and OoO
implementations meant that teams had to become larger, and the
projects longer, and project delays showed the teething problems that
some of the teams had.

And given these problems, several of the companies were just too happy to
jump ship when the supposed saviour in the form of IA-64 appeared.

The funny thing is that HP, MIPS and DEC already had OoO CPUs with
SIMD instructions (the technologies that outcompeted IA-64). Maybe if
Intel had made a cleaned-up 64-bit i960 instead of IA-64 as the one architecture to rule them all, and then did an OoO implementation of
that, we all would be using i960-64 nowadays; but I guess that IA-64
had the better roadmaps, and that it was easier to convince people in
HP, MIPS and DEC with IA-64 with its promising new features, while
just another RISC would have caused the reaction "we already have a
fine 64-bit RISC, why switch to i960-64"? OTOH, switching to another
RISC worked for Motorola.

The irony is that the i960 team was redirected in 1990 to design the
P6 (Pentium Pro).

Anyway, back to the original question: Their economic model caused ARM
T32 (and later A64) to become *the* smartphone architecture, and the
demands of smartphone apps caused economic pressure for higher
performance at low power, and despite Intels attempts to break into
that market, they failed to make substantial inroads and eventually
gave up; this may have been due to network effects, but they also seem
to have problems reaching comparable performance at mobile power
points.

And if you compare the performance of the Apple Firestorm (ARM A64) to
Intel and AMD P-cores at comparable power points, Firestorm looks
pretty good. Even comparing to CPUs with a desktop power budget, the
Firestorm does not fall far behind <https://images.anandtech.com/doci/16192/spec2006_A14_575px.png> (from <https://www.anandtech.com/show/16192/the-iphone-12-review/2>; I think
Andrei Frumansu provided other measurements where the distance was
even smaller; ah, there we are: <https://images.anandtech.com/doci/16983/SPECint-energy_575px.png>
from <https://www.anandtech.com/print/16983/the-apple-a15-soc-performance-review-faster-more-efficient>;
note how he does not give Joule results for the AMD64
implementations). Maybe the people who designed the Firestorm are
more capable than those who designed Skylake and Zen3, or maybe the
advantages of the ARM A64 architecture allow them to provide more
performance than AMD64 at the same power.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to BGB on Thu Jan 11 19:42:11 2024

BGB <cr88192@gmail.com> writes:

On 1/11/2024 2:02 AM, Anton Ertl wrote:

Auto-increment:
One has an operation that either needs to do something weird with the >register ports, or, more likely, needs to be decoded as two operations
in the pipeline. It is also comparably infrequent, as "*ptr++" isn't
used *that* often.

There are certainly far more use cases for post increment than
*ptr++ (which is actually a fairly common construct in library code
like the string functions or a naive memcpy).

ARM64, for example uses it to push the frame pointer and link
register in a function prologue.

401008: a9b87bfd stp x29, x30, [sp,#-128]!

Other examples (from the compiled elf for the coremark program):

40edf0: a9c4382d ldp x13, x14, [x1,#64]!
40ee9c: f85f8c23 ldr x3, [x1,#-8]!
40eea0: f81f8cc3 str x3, [x6,#-8]!
40eea8: b85fcc23 ldr w3, [x1,#-4]!
40eeac: b81fccc3 str w3, [x6,#-4]!
40eeb4: 785fec23 ldrh w3, [x1,#-2]!
40eeb8: 781fecc3 strh w3, [x6,#-2]!
40eedc: f85f8c23 ldr x3, [x1,#-8]!
40eee0: f81f8cc3 str x3, [x6,#-8]!
40eee8: b85fcc23 ldr w3, [x1,#-4]!
40eeec: b81fccc3 str w3, [x6,#-4]!
40eef4: 785fec23 ldrh w3, [x1,#-2]!
40eef8: 781fecc3 strh w3, [x6,#-2]!
40ef00: 385ffc23 ldrb w3, [x1,#-1]!
40ef04: 381ffcc3 strb w3, [x6,#-1]!
40ef18: a9fc2027 ldp x7, x8, [x1,#-64]!
40ef28: a9bc20c7 stp x7, x8, [x6,#-64]!
40ef8c: a9fc382d ldp x13, x14, [x1,#-64]!
40efa8: a9bc38cd stp x13, x14, [x6,#-64]!
40efac: a9fc382d ldp x13, x14, [x1,#-64]!
40efc4: a9bc38cd stp x13, x14, [x6,#-64]!
40f110: a9c3382d ldp x13, x14, [x1,#48]!
40f12c: a98438cd stp x13, x14, [x6,#64]!
40f130: a9c4382d ldp x13, x14, [x1,#64]!
40f234: a9841d07 stp x7, x7, [x8,#64]!
40f350: a9bd7bfd stp x29, x30, [sp,#-48]!
40f410: a9bb7bfd stp x29, x30, [sp,#-80]!
40f4f8: a9b97bfd stp x29, x30, [sp,#-112]!
40f748: a9bd7bfd stp x29, x30, [sp,#-48]!
40f778: a9bb7bfd stp x29, x30, [sp,#-80]!
40f8c8: b8404cc5 ldr w5, [x6,#4]!
40f918: b85fcc22 ldr w2, [x1,#-4]!
40f940: a9bb7bfd stp x29, x30, [sp,#-80]!
40fa50: a9ba7bfd stp x29, x30, [sp,#-96]!
40fbe0: b85fcc43 ldr w3, [x2,#-4]!
40fbe4: b85fcc24 ldr w4, [x1,#-4]!
40fc10: a9bb7bfd stp x29, x30, [sp,#-80]!
40fd20: b85fcc81 ldr w1, [x4,#-4]!
40fdd8: a9ba7bfd stp x29, x30, [sp,#-96]!
40ff20: a9ba7bfd stp x29, x30, [sp,#-96]!
40ff6c: b81f8c14 str w20, [x0,#-8]!
410050: a9bb7bfd stp x29, x30, [sp,#-80]!
4101e8: b85fcc20 ldr w0, [x1,#-4]!
410248: a9b87bfd stp x29, x30, [sp,#-128]!
4103c0: f8410c60 ldr x0, [x3,#16]!
410510: f8410c60 ldr x0, [x3,#16]!
410644: f8410ec0 ldr x0, [x22,#16]!
...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Terje Mathisen on Fri Jan 12 04:09:30 2024

On Mon, 13 Nov 2023 16:10:20 +0100, Terje Mathisen wrote:

MitchAlsup wrote:

Chris M. Thomasson wrote:

Think of LL/SC... If one did not honor the reservation granule....
well... Shit.. False sharing on a reservation granule can cause live
lock and damage forward progress wrt some LL/SC setups.

One should NEVER (N. E. V. E. R.) attempt ATOMIC stuff on an unaligned
container. Only aligned containers possess ATOMIC-smelling properties.

This is so obviously correct that you should not have needed to mention
it. Hammering HW with unaligned (maybe even page-straddling) LOCKed
updates is something that should only ever be done for testing purposes.

While older machines used an "exchange" instruction for something
atomic, the IBM 360 had the "Test and Set" instruction which had a
single-byte operand, to avoid the issue.

However, qualifications are needed to make the statement "obviously
correct". Basically, one should never attempt an atomic operation on
an unaligned value in memory... on a machine that does paging. Because
the unaligned value _might_ cross a page boundary.

Otherwise, there's no problem. And a computer certainly _could_ be
aware that precautions are needed for atomic instructions, and
proceed with their execution only after all the memory pages involved
were brought into memory, and locked there. That would still mean
the computer would be slowed unnecessarily, but error-free operation
can be guaranteed.

So if someone wanted, they could design a computer which didn't mind
atomic operations on unaligned values all that much.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Fri Jan 12 08:11:06 2024

BGB <cr88192@gmail.com> writes:

Though, as I see it, 64-bit ARM still has a few concerning features in
these areas:
Auto-increment addressing;
ALU status-flag bits;
...

Both are features that the architects of MIPS (and its descendents,
including Alpha and RISC-V) considered so concerning that they do not
feature them in their architecture. The architects of A64 knew these
concerns, yet decided to include these features, so they obviously
were sure that they could implement these features efficiently at
bearable cost.

Auto-increment:
One has an operation that either needs to do something weird with the >register ports, or, more likely, needs to be decoded as two operations
in the pipeline.

Yes. A64 seems to be designed to "do something weird with the
register ports"; it has instructions that write three registers and instructions that read four registers.

It is also comparably infrequent, as "*ptr++" isn't
used *that* often.

Out of 192106 instructions in /bin/bash (in Debian 11), there are 1829 instructions with pre-increment "]!"; most of them are stp (store
pair) instructions, and the increment is usually negative and often
smaller than the size of the two registers, the address register is
usually sp. So the usual use seems to be for saving caller-saved or callee-saved registers.

Out of these 195815 instructions, 3002 use post-increment "],"; most
of them are ldp (load pair) instructions, and the increment is usually positive, and the address register is usually sp (2688 cases). So
most of these cases seem to be due to loading caller-saved registers
after the call or callee-saved registers before the return.

Overall, there are 25197 loads and stores that use sp as address
register, out of 61387 loads and stores.

[a76:~:536] objdump -d /bin/bash|grep "^ "|wc -l
192106
[a76:~:537] objdump -d /bin/bash|grep ']!'|wc -l
1829
[a76:~:538] objdump -d /bin/bash|grep '[[][a-z].*],'|wc -l
3002
[a76:~:539] objdump -d /bin/bash|grep 'sp],'|wc -l
2688
[a76:~:540] objdump -d /bin/bash|grep '[[]sp'|wc -l
25197
[a76:~:541] objdump -d /bin/bash|grep '[[][a-z]'|wc -l
61387

ALU status flags:
The flags themselves are fairly rarely used in practice,

Conditional branches tend to be quite frequent.

but the cost of
keeping these sorts of flags consistent in the pipeline is not so cheap.

Intel uses as many physical flags registers as physical integer
registers (280 each on Tigerlake and Golden Cove), ARM somewhat less
than the integer registers. And the register renamer needs to keep
track of them separately (for AMD64 it needs to keep track of C, O,
and NZP separately; I expect that A64 is better in this respect).
Yes, not cheap, but obviously manageable.

And, possibly the cost difference between a 1-bit status flag and, say,
4 or 5 flag bits, isn't that large. In either case, may make sense to
limit which instructions may update flags (unlike x86)

Actually, updating all flags in every instruction would make the
implementation easier, too: every flag-using instruction would only be
able to use the result of the previous instruction, no need to store
flags longer. However, I guess it might be harder to program with
such a model.

My take is that GPRs should have additional carry and overflow flags
(which are not stored and loaded with the usual store and load
instructions); they have the information of the N and Z flags already.
This makes tracking the flags easy, and also allows programs to deal
with multiple live carry flags, as needed for multi-precision
multiplication.

and possibly only
allow them in "lane 1" or whatever the equivalent is (the secondary ALUs
only doing non-flags-updating forms).

That's an implementation issue. Do ARM A64 implementations have such restrictions?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Anton Ertl on Fri Jan 12 15:14:00 2024

On Fri, 12 Jan 2024 08:11:06 +0000, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

Though, as I see it, 64-bit ARM still has a few concerning features in >>these areas:
Auto-increment addressing;
ALU status-flag bits;
...

Both are features that the architects of MIPS (and its descendents,
including Alpha and RISC-V) considered so concerning that they do not
feature them in their architecture. The architects of A64 knew these concerns, yet decided to include these features, so they obviously
were sure that they could implement these features efficiently at
bearable cost.

As even the PDP-8 had auto-increment addressing, certainly
its cost must be bearable, if cost is thought of as the
number of transistors required to implement it. Although
that feature seens odd for a RISC design even to me.

As for ALU status flag bits, I think they're a feature that
should be kept. But one concern with them relates to the
same reason that some early RISC architectures had branch
delay slots.

So a common way in which this concern is mitigated is for
RISC architectures to include, in instructions that can
affect the condition codes, a bit that controls whether or
not they do so. That way, other operate instructions can
be placed between an instruction that sets the condition
codes and the branch instruction that tests them.

The PowerPC architecture went further, also perhaps
addressing another concern with ALU status bits, by
having multiple sets of condition codes, so that the
condition codes would behave more like registers,
rather than being a unique resource.

To my mind, ALU status bits are at least essential
for things like add-with-carry for multiple-precision
arithmetic. Otherwise, one would need multiple
awkward instructions to perform the same function.

And since RISC typically has only load and store
memory-reference instructions, thus limiting each
instruction to one basic action, a design that
apparently forces operate instructions to be
combined with conditional branch instructions
seems to be the opposite of RISC. I presume that
they _don't_ solve the problem by including a
conditional skip in operate instructions, that
can skip over a jump instruction that follows
them (sort of like a PDP-8!)... clearly, I'll
need to take another look at the MIPS and/or
the Alpha to see what it is that they _are_
doing, to understand how it fits into the
RISC philosophy.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Fri Jan 12 15:19:11 2024

On Fri, 12 Jan 2024 15:14:00 +0000, Quadibloc wrote:

clearly, I'll
need to take another look at the MIPS and/or
the Alpha to see what it is that they _are_
doing, to understand how it fits into the
RISC philosophy.

Oh, silly me. I remembered shortly after: what
is combined with a branch to make a conditional
branch is not an operate instruction, but a
test of the contents of a specified register.

That's clearly basic enough to fit with RISC,
but if the carry out from an operation is what
you want to test, then awkwardness ensues.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Fri Jan 12 15:53:37 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

BGB <cr88192@gmail.com> writes:

Though, as I see it, 64-bit ARM still has a few concerning features in >>these areas:
Auto-increment addressing;
ALU status-flag bits;
...

Both are features that the architects of MIPS (and its descendents,
including Alpha and RISC-V) considered so concerning that they do not
feature them in their architecture. The architects of A64 knew these >concerns, yet decided to include these features, so they obviously
were sure that they could implement these features efficiently at
bearable cost.

Auto-increment:
One has an operation that either needs to do something weird with the >>register ports, or, more likely, needs to be decoded as two operations
in the pipeline.

Yes. A64 seems to be designed to "do something weird with the
register ports"; it has instructions that write three registers and >instructions that read four registers.

It has instructions that read or write 8 registers (64 bytes).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Fri Jan 12 17:08:14 2024

Quadibloc <quadibloc@servername.invalid> writes:

Oh, silly me. I remembered shortly after: what
is combined with a branch to make a conditional
branch is not an operate instruction, but a
test of the contents of a specified register.

MIPS, Alpha and RISC-V all have slightly different answers here:

RISC-V has compare-and-branch instructions, and also compare
instructions slt and sltu that produce 0 or 1.

Alpha has compare instructions cmpeq cmple cmplt (slt on RISC-V)
cmpule cmpult (sltu on RISC-V) that produce 0 or 1 and branch
instructions that compare with 0.

MIPS has slt and sltu, and a compare-equal-and-branch instruction.

That's clearly basic enough to fit with RISC,
but if the carry out from an operation is what
you want to test, then awkwardness ensues.

carry = sum<operand1

That's one sltu/cmpult instruction.

However, an add with carry-in carry-out is five instructions on these architectures.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Fri Jan 12 18:11:17 2024

Quadibloc wrote:

On Mon, 13 Nov 2023 16:10:20 +0100, Terje Mathisen wrote:

MitchAlsup wrote:

Chris M. Thomasson wrote:

Think of LL/SC... If one did not honor the reservation granule....
well... Shit.. False sharing on a reservation granule can cause live
lock and damage forward progress wrt some LL/SC setups.

One should NEVER (N. E. V. E. R.) attempt ATOMIC stuff on an unaligned
container. Only aligned containers possess ATOMIC-smelling properties.

This is so obviously correct that you should not have needed to mention
it. Hammering HW with unaligned (maybe even page-straddling) LOCKed
updates is something that should only ever be done for testing purposes.

While older machines used an "exchange" instruction for something
atomic, the IBM 360 had the "Test and Set" instruction which had a single-byte operand, to avoid the issue.

However, qualifications are needed to make the statement "obviously
correct". Basically, one should never attempt an atomic operation on
an unaligned value in memory... on a machine that does paging. Because
the unaligned value _might_ cross a page boundary.

Even crossing a line boundary exposes interested 3rd parties to inter-
mediate state. Consider a line spanning access to 0x1234567E while at
the same time there is an access to 0x12345680. The first half or the
access to 0x1234567E takes place in parallel with the access to 0x12345680
and then the second half of the access to 0x1234567E is performed. This
is not ATOMIC in any sense of the word.

Otherwise, there's no problem. And a computer certainly _could_ be
aware that precautions are needed for atomic instructions, and
proceed with their execution only after all the memory pages involved
were brought into memory, and locked there. That would still mean
the computer would be slowed unnecessarily, but error-free operation
can be guaranteed.

So if someone wanted, they could design a computer which didn't mind
atomic operations on unaligned values all that much.

Adding lots of complexity for things that should never happen.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Fri Jan 12 17:57:17 2024

Quadibloc <quadibloc@servername.invalid> writes:

As even the PDP-8 had auto-increment addressing, certainly
its cost must be bearable, if cost is thought of as the
number of transistors required to implement it.

What may be cheap in a 10CPI=0.1IPC implementation may be expensive in
an implementation that tries to support up to 10IPC (as in Cortex-X4).
In particular, widely ported register files are expensive. But
apparently ARM, Apple, and others have found ways to implement
auto-increment at bearable cost.

Although
that feature seens odd for a RISC design even to me.

Not particularly: ARM A32, HPPA, Power, and ARM A64 have
auto-increment.

To my mind, ALU status bits are at least essential
for things like add-with-carry for multiple-precision
arithmetic.

It's not essential to have them separate from the rest of the results
of the computation. And if you attach carry and overflow bits to the
register that contains the rest of the results of the computation,
that has advantages: Tasks like multi-precision multiplication that
benefit from having several carry bits become easier to write. And
you can easier deal with these bits in compilers.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Fri Jan 12 18:19:58 2024

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Yes. A64 seems to be designed to "do something weird with the
register ports"; it has instructions that write three registers and >>instructions that read four registers.

It has instructions that read or write 8 registers (64 bytes).

Yes, I think there are crypto instructions or somesuch that handle a
lot of registers in one instruction, and I guess that they take
multiple cycles, so accessing many registers can be distributed across
multiple cycles.

What I had in mind were store-pair with [reg+reg] addressing (4
reads), and load-pair with auto-increment (three writes), and I expect
these instructions to be fast, as in: if the microarchitecture allows
two stores per cycle, I expect to be able to do two store-pair with
[reg+reg], and likewise for load-pair on a microarchitecture that
allows three loads per cycle.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Fri Jan 12 18:24:39 2024

Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

Though, as I see it, 64-bit ARM still has a few concerning features in >>these areas:
Auto-increment addressing;
ALU status-flag bits;
...

Both are features that the architects of MIPS (and its descendents,
including Alpha and RISC-V) considered so concerning that they do not
feature them in their architecture. The architects of A64 knew these concerns, yet decided to include these features, so they obviously
were sure that they could implement these features efficiently at
bearable cost.

Auto-increment:
One has an operation that either needs to do something weird with the >>register ports, or, more likely, needs to be decoded as two operations
in the pipeline.

Yes. A64 seems to be designed to "do something weird with the
register ports"; it has instructions that write three registers and instructions that read four registers.

It is also comparably infrequent, as "*ptr++" isn't
used *that* often.

Out of 192106 instructions in /bin/bash (in Debian 11), there are 1829

1%

instructions with pre-increment "]!"; most of them are stp (store
pair) instructions, and the increment is usually negative and often
smaller than the size of the two registers, the address register is
usually sp. So the usual use seems to be for saving caller-saved or callee-saved registers.

Out of these 195815 instructions, 3002 use post-increment "],"; most

1½%

of them are ldp (load pair) instructions, and the increment is usually positive, and the address register is usually sp (2688 cases). So
most of these cases seem to be due to loading caller-saved registers
after the call or callee-saved registers before the return.

Overall, there are 25197 loads and stores that use sp as address
register, out of 61387 loads and stores.

So we have a use case where the designers were willing to <ahem>
burden their ISA with pre/post-increment/decrement for a 2½% use
while simultaneously screwing with their register porting, and
creating an artificial data-dependency on future access to the
stack (sp).

I will gently suggest one can do better.....

[a76:~:536] objdump -d /bin/bash|grep "^ "|wc -l
192106
[a76:~:537] objdump -d /bin/bash|grep ']!'|wc -l
1829
[a76:~:538] objdump -d /bin/bash|grep '[[][a-z].*],'|wc -l
3002
[a76:~:539] objdump -d /bin/bash|grep 'sp],'|wc -l
2688
[a76:~:540] objdump -d /bin/bash|grep '[[]sp'|wc -l
25197
[a76:~:541] objdump -d /bin/bash|grep '[[][a-z]'|wc -l
61387

ALU status flags:
The flags themselves are fairly rarely used in practice,

Conditional branches tend to be quite frequent.

15%-ish with 3% BR and 2% CALL/RET and 1% JMP (switches)

but the cost of
keeping these sorts of flags consistent in the pipeline is not so cheap.

SPARC showed the way. If you are going to have CCs, have instructions that
do not set the CCs distinct from those that do set CC.

x86 showed what not to do:: C-O-ZAPS must be tracked separately in the
pipeline for efficient use of CC and conditional branching.

Intel uses as many physical flags registers as physical integer
registers (280 each on Tigerlake and Golden Cove), ARM somewhat less
than the integer registers. And the register renamer needs to keep
track of them separately (for AMD64 it needs to keep track of C, O,
and NZP separately; I expect that A64 is better in this respect).
Yes, not cheap, but obviously manageable.

In AMD's versions there is more reservation station logic used to track
CC as 3 independent containers than are used to track the operand
registers themselves.

And, possibly the cost difference between a 1-bit status flag and, say,
4 or 5 flag bits, isn't that large. In either case, may make sense to
limit which instructions may update flags (unlike x86)

Actually, updating all flags in every instruction would make the implementation easier, too: every flag-using instruction would only be
able to use the result of the previous instruction, no need to store
flags longer. However, I guess it might be harder to program with
such a model.

This is what SPARC showed.

My take is that GPRs should have additional carry and overflow flags
(which are not stored and loaded with the usual store and load
instructions); they have the information of the N and Z flags already.
This makes tracking the flags easy, and also allows programs to deal
with multiple live carry flags, as needed for multi-precision
multiplication.

Completely unnecessary and insufficient at the same time.

and possibly only
allow them in "lane 1" or whatever the equivalent is (the secondary ALUs >>only doing non-flags-updating forms).

That's an implementation issue. Do ARM A64 implementations have such restrictions?

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Fri Jan 12 18:55:38 2024

Quadibloc wrote:

As for ALU status flag bits, I think they're a feature that
should be kept. But one concern with them relates to the
same reason that some early RISC architectures had branch
delay slots.

So a common way in which this concern is mitigated is for
RISC architectures to include, in instructions that can
affect the condition codes, a bit that controls whether or
not they do so. That way, other operate instructions can
be placed between an instruction that sets the condition
codes and the branch instruction that tests them.

Having condition codes in GPRs gets you as many codes as
you can every use and for free !! You make CMP instructions
return a bit-vector to a GPR and you have branch on bit
instructions--and you still don't need the cruft of CCs
in your ISA.

The PowerPC architecture went further, also perhaps
addressing another concern with ALU status bits, by
having multiple sets of condition codes, so that the
condition codes would behave more like registers,
rather than being a unique resource.

CCs in GPRs means you have as many CC sets as your compiler
can use.

To my mind, ALU status bits are at least essential
for things like add-with-carry for multiple-precision
arithmetic. Otherwise, one would need multiple
awkward instructions to perform the same function.

Or a CARRY instruction-modifier.

And since RISC typically has only load and store
memory-reference instructions, thus limiting each
instruction to one basic action, a design that
apparently forces operate instructions to be
combined with conditional branch instructions
seems to be the opposite of RISC. I presume that
they _don't_ solve the problem by including a
conditional skip in operate instructions, that
can skip over a jump instruction that follows
them (sort of like a PDP-8!)... clearly, I'll
need to take another look at the MIPS and/or
the Alpha to see what it is that they _are_
doing, to understand how it fits into the
RISC philosophy.

Many RISCs use the CMP-BC style as 1 instruction.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Fri Jan 12 19:50:14 2024

Scott Lurndal wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes: >>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Yes. A64 seems to be designed to "do something weird with the
register ports"; it has instructions that write three registers and >>>>instructions that read four registers.

It has instructions that read or write 8 registers (64 bytes).

Yes, I think there are crypto instructions or somesuch that handle a
lot of registers in one instruction, and I guess that they take
multiple cycles, so accessing many registers can be distributed across >>multiple cycles.

These aren't crypto. They are intended to allow atomic
64-byte transactions initiated by the cpu, generally to
on-chip coprocessors (it's called FEAT_LS64).

They use eight consecutive registers.

Only consecutive before renaming.......

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Fri Jan 12 19:45:36 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Yes. A64 seems to be designed to "do something weird with the
register ports"; it has instructions that write three registers and >>>instructions that read four registers.

It has instructions that read or write 8 registers (64 bytes).

Yes, I think there are crypto instructions or somesuch that handle a
lot of registers in one instruction, and I guess that they take
multiple cycles, so accessing many registers can be distributed across >multiple cycles.

These aren't crypto. They are intended to allow atomic
64-byte transactions initiated by the cpu, generally to
on-chip coprocessors (it's called FEAT_LS64).

They use eight consecutive registers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Fri Jan 12 22:01:19 2024

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

But
apparently ARM, Apple, and others have found ways to implement
auto-increment at bearable cost.

It is certainly possible to crack such an instruction into two
micro-ops, liker POWER does with ldu and friends. If you have a
mechanism for cracking into micro-ops, that cost certainly looks
bearable.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Sat Jan 13 04:38:22 2024

On Tue, 19 Dec 2023 17:47:25 +0000, Quadibloc wrote:

On Tue, 19 Dec 2023 07:22:10 +0000, Quadibloc wrote:

On Tue, 19 Dec 2023 03:36:06 +0000, Quadibloc wrote:

I changed where, in the opcode space, the supplementary
memory-reference instructions were located. This allowed me to have a
few more bits available for them.

I've moved them again, making even more space available... because in my
last change, I made the mistake of using the opcode space that I was
already using for block headers. I couldn't reduce the amount of
information in a block header by two bits, by using a combination of ten
bits instead of eight to indicate a block header, so I had to do my
rearranging in this place instead.

And now, with what I've learned from this experience, I've made further changes. I've increased the length of the opcode field in the supplementary memory-reference instructions that were moved to be among the other memory-reference instructions, so as to have enough for the different
sizes of the various types to be supported.

But in addition, I have now engaged in what some may see as an act of
pure evil.

Once again there are supplementary memory-reference instructions among
the operate instructions as well. *These*, however, provide for the conventional integer and floating-point types, CISC-style memory to
register operate instructions! So even within the basic 32-bit instruction set, although _these_ instructions are highly restricted in register use
and addressing modes, the pretense of being a load-store architecture
has been dropped!

I have made further changes to both of these types of supplementary memory-reference instructions.

The ones that provide load-store memory-reference instructions have had
the restrictions on their addressing modes slightly relaxed by means of shrinking the opcode field by one bit.

The ones that provide memory to register operate instructions of the most common types have had the restrictions on their addressing modes slightly relaxed by restricting them to aligned operands.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Jan 16 16:46:13 2024

MitchAlsup [2024-01-10 22:32:50] wrote:

Stefan Monnier wrote:

MitchAlsup [2024-01-10 22:32:50] wrote:

The idioms recognized in My 66150 core:
CMP Rt,--,-- ; BBit Rt,label
Calk Rd,--,-- ; BCnd Rd,label
LD Rd,[--] ; BCnd Rd,label
ST Rd,[--] ; Calk --,--,--
CALL Label ; BR Label
These all CoIssue (both instruction pass through the pipeline

Sorry, what's "Calk"?

A calculation instruction {ADD, AND, ...}

Hmmm, what's the benefit of co-issuing an ST with a Calk?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Wed Jan 17 00:18:37 2024

Stefan Monnier wrote:

MitchAlsup [2024-01-10 22:32:50] wrote:

Stefan Monnier wrote:

MitchAlsup [2024-01-10 22:32:50] wrote:

The idioms recognized in My 66150 core:
CMP Rt,--,-- ; BBit Rt,label
Calk Rd,--,-- ; BCnd Rd,label
LD Rd,[--] ; BCnd Rd,label
ST Rd,[--] ; Calk --,--,--
CALL Label ; BR Label
These all CoIssue (both instruction pass through the pipeline

Sorry, what's "Calk"?

A calculation instruction {ADD, AND, ...}

Hmmm, what's the benefit of co-issuing an ST with a Calk?

The opportunity is that there are register file ports available.

CoIssuing ST with Calk allows a 1-wide machine to perform 2 inst
in a single beat down the pipeline.

Not all STs can CoIssue with all Calks {this is a register counting
problem.}

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Paul A. Clayton on Sat Jan 20 18:41:35 2024

On Fri, 19 Jan 2024 16:16:07 -0500, Paul A. Clayton wrote:

RISC-V has enough inelegance that considering it a model of
perfection implies, in my opinion, significant noobiness (or
perhaps what I might consider poor taste).

I feel assured that my efforts with regard to Concertina II
are so obscure that there is no real danger that those who
see RISC-V as a model of perfection _compared to it_ will
wield considerable influence in its favor...

Courage often is doing the right thing even when it is (by all
rational examination) pointless.

The duty of the courageous soldier is to work towards
achieving victory. Choose when to fight; avoid wasting
energy, resources, and men in losing battles.

Yes, the common soldier is a tool in the hands of his
general, but when as an individual, one is one's own
general, then one bears the same responsibilities as
a real general.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Quadibloc on Thu Feb 15 20:23:30 2024

On Fri, 12 Jan 2024 04:09:30 -0000 (UTC), Quadibloc wrote:

While older machines used an "exchange" instruction for something
atomic, the IBM 360 had the "Test and Set" instruction which had a single-byte operand, to avoid the issue.

That assumes it doesn’t create a new issue. Like some 16-bit architecture
I remember from the 1980s, that could not do single-byte bus cycles. So
writing a byte involved a read-modify-write sequence of bus operations.
Try doing that atomically ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Feb 15 20:58:55 2024

Lawrence D'Oliveiro wrote:

On Fri, 12 Jan 2024 04:09:30 -0000 (UTC), Quadibloc wrote:

While older machines used an "exchange" instruction for something
atomic, the IBM 360 had the "Test and Set" instruction which had a
single-byte operand, to avoid the issue.

That assumes it doesn’t create a new issue. Like some 16-bit architecture
I remember from the 1980s, that could not do single-byte bus cycles. So writing a byte involved a read-modify-write sequence of bus operations.
Try doing that atomically ...

Do not allow the arbiter to allow any other access between the Rd and the Wt.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Chris M. Thomasson on Fri Feb 16 15:20:40 2024

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 1/10/2024 7:03 PM, Chris M. Thomasson wrote:

On 11/13/2023 7:10 AM, Terje Mathisen wrote:

Actually, it was experimented with wrt artificially triggering a bus
lock on Intel via unaligned access and dummy LOCK RMW (iirc) to > implement a user space RCU wrt remote memory barriers. Dave Dice comes
to mind. I am having trouble trying to find the god damn paper! I know I
read it before.

I need to point out that that unaligned access that would trigger an
actual bus lock is when the access straddled a l2 cache line wrt the
LOCK'ed RMW.

You don't actually _need_ an unaligned access to trigger an actual
bus lock - if you can arrange for sufficient contention to a single
line, the processor may eventually grab the bus lock to make forward
progress.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Chris M. Thomasson on Fri Feb 16 21:39:08 2024

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 2/16/2024 1:05 PM, Chris M. Thomasson wrote:

On 2/16/2024 7:20 AM, Scott Lurndal wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 1/10/2024 7:03 PM, Chris M. Thomasson wrote:

On 11/13/2023 7:10 AM, Terje Mathisen wrote:

Actually, it was experimented with wrt artificially triggering a bus >>>>> lock on Intel via unaligned access and dummy LOCK RMW (iirc) to >
implement a user space RCU wrt remote memory barriers. Dave Dice comes >>>>> to mind. I am having trouble trying to find the god damn paper! I
know I
read it before.

I need to point out that that unaligned access that would trigger an
actual bus lock is when the access straddled a l2 cache line wrt the
LOCK'ed RMW.

You don't actually _need_ an unaligned access to trigger an actual
bus lock - if you can arrange for sufficient contention to a single
line, the processor may eventually grab the bus lock to make forward
progress.

True, but I think a LOCK'ed RMW on unaligned memory that straddles a
cache line triggers one right off the bat? There was something called
QPI that abused this to get remote memory barriers. I got a response a
while back from my friend Dmitry Vyukov that we both read the paper but
it seems to have been taken down. Dave Dice comes to mind.

I remember the Q in QPI was for quiescence.

Quick Path Interface. Like HT (Hyper Transport) but from Intel.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	299
Nodes:	16 (2 / 14)
Uptime:	34:48:10
Calls:	6,682
Files:	12,222
Messages:	5,342,988

Concertina II Progress

Who's Online

System Info