Some progress has been made in advancing a small step towards sanity
in the description of the Concertina II architecture described at
http://www.quadibloc.com/arch/ct17int.htm
As Mitch Alsup has rightly noted, I want to have my cake and eat it
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.
I want memory-reference instructions to still fit in 32 bits, despite
asking for so much more capacity.
So what I had done was, after squeezing as much as I could into a basic instruction format, I provided for switching into alternate instruction formats which made different compromises by using the block headers.
This has now been dropped. Since I managed to get the normal (unaligned) memory-reference instruction squeezed into so much less opcode space that
I also had room for the aligned memory-reference format without compromises in the basic instruction set, it wasn't needed to have multiple instruction formats.
I had to change the instructions longer than 32 bits to get them in the
basic instruction format, so now they're less dense.
Block structure is still used, but now for only the two things it's
actually needed for: reserving part of a block as unused for the pseudo-immediates, and for VLIW features (explicitly indicating
parallelism, and instruction predication).
The ISA is still tremendously complicated, since I've put room in it for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.
As Mitch Alsup has rightly noted, I want to have my cake and eat it
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.
The ISA is still tremendously complicated, since I've put room in it for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.
So, r1 = r2 + r3 + offset.
Three registers is 15 bits plus a 16-bit offset, which gives you 31
bits. You're left with one bit of opcode, one for load and one for
store.
The /360 had 12 bits for three registers plus 12 bits of offset, so 24
bits left eight bits for the opcode (the RX format).
So, if you want to do this kind of thing, why not go for a full 32-bit
offset in a second 32-bit word?
I want to "have my cake and eat it too" - to have a computer that's just
as good as a Power PC or a 68000 or a System/360, even though they have different, incompatible, strengths that conflict with a computer being
able to be good at what each of them is good at simultaneously.
On 11/9/2023 12:50 PM, Thomas Koenig wrote:
So, r1 = r2 + r3 + offset.Oh, that is even worse than I understood it as, namely:
Three registers is 15 bits plus a 16-bit offset, which gives you 31
bits. You're left with one bit of opcode, one for load and one for
store.
LDx Rd, (Rs, Disp16)
...
But, yeah, 1 bit of opcode clearly wouldn't work...
On Thu, 09 Nov 2023 21:38:31 +0000, Quadibloc wrote:
I want to "have my cake and eat it too" - to have a computer that's
just as good as a Power PC or a 68000 or a System/360, even though they
have different, incompatible, strengths that conflict with a computer
being able to be good at what each of them is good at simultaneously.
Actually, it's worse than that, since I also want the virtues of
processors like the TMS320C2000 or the Itanium.
Quadibloc <quadibloc@servername.invalid> schrieb:
As Mitch Alsup has rightly noted, I want to have my cake and eat it
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.
So, r1 = r2 + r3 + offset.
Three registers is 15 bits plus a 16-bit offset, which gives you
31 bits. You're left with one bit of opcode, one for load and
one for store.
The /360 had 12 bits for three registers plus 12 bits of offset, so
24 bits left eight bits for the opcode (the RX format).
So, if you want to do this kind of thing, why not go for a full 32-bit
offset in a second 32-bit word?
[...]
The ISA is still tremendously complicated, since I've put room in it for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.
Have you ever written an assembler for your ISA?
On Thu, 09 Nov 2023 15:36:12 -0600, BGB-Alt wrote:
On 11/9/2023 12:50 PM, Thomas Koenig wrote:
So, r1 = r2 + r3 + offset.Oh, that is even worse than I understood it as, namely:
Three registers is 15 bits plus a 16-bit offset, which gives you 31
bits. You're left with one bit of opcode, one for load and one for
store.
LDx Rd, (Rs, Disp16)
...
But, yeah, 1 bit of opcode clearly wouldn't work...
And indeed, he is correct, that is what I'm trying to do.
But I easily solve _most_ of the problem.
I just use 3 bits for the index register and the base register.
The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.
16-bit register-to-register instructions use eight bits to specify their source and destination registers, so both registers must be from the same group of eight registers.
This lends itself to writing code where four distinct threads are interleaved, helping pipelining in implementations too cheap to have out-of-order executiion.
The index register can be one of registers 1 to 7 (0 means no indexing).
The base register can be one of registers 25 to 31. (24, or a 0 in the three-bit base register field, indicates a special addressing mode.)
This sort of is reminiscent of System/360 coding conventions.
The special addressing modes do stuff like using registers 17 to 23 as
base registers with a 12 bit displacement, so that additional short
segments can be accessed.
As I noted, shaving off two bits each from two fields gives me four more bits, and five bits is exactly what I need for the opcode field.
Unfortunately, I needed one more bit, because I also wanted 16-bit instructions, and they take up too much space. That led me... to some interesting gyrations, but I finally found a compromise that was
acceptable to me for saving those bits, so acceptable that I could drop
the option of using the block header to switch to using "full" instructions instead. Finally!
Some progress has been made in advancing a small step towards sanity<
in the description of the Concertina II architecture described at
http://www.quadibloc.com/arch/ct17int.htm
As Mitch Alsup has rightly noted, I want to have my cake and eat it
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.
I want memory-reference instructions to still fit in 32 bits, despite<
asking for so much more capacity.
So what I had done was, after squeezing as much as I could into a basic instruction format, I provided for switching into alternate instruction formats which made different compromises by using the block headers.<
This has now been dropped. Since I managed to get the normal (unaligned) memory-reference instruction squeezed into so much less opcode space that<
I also had room for the aligned memory-reference format without compromises in the basic instruction set, it wasn't needed to have multiple instruction formats.
I had to change the instructions longer than 32 bits to get them in the<
basic instruction format, so now they're less dense.
Block structure is still used, but now for only the two things it's
actually needed for: reserving part of a block as unused for the pseudo-immediates, and for VLIW features (explicitly indicating
parallelism, and instruction predication).
The ISA is still tremendously complicated, since I've put room in it for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.
John Savard
Actually, it's worse than that, since I also want the virtues of
processors like the TMS320C2000 or the Itanium.
In article <uijjoj$2dc2i$1@dont-email.me>, quadibloc@servername.invalid (Quadibloc) wrote:
Actually, it's worse than that, since I also want the virtues of
processors like the TMS320C2000 or the Itanium.
What do you consider the virtues of Itanium to be?
On 11/9/2023 3:51 PM, Quadibloc wrote:
The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.
Errm, splitting up registers like this is likely to hurt far more than anything that 16-bit displacements are likely to gain.
Quadibloc wrote:
Some progress has been made in advancing a small step towards sanity
in the description of the Concertina II architecture described at
http://www.quadibloc.com/arch/ct17int.htm
As Mitch Alsup has rightly noted, I want to have my cake and eat it<
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.
My 66000 has all of this.
<
I want memory-reference instructions to still fit in 32 bits, despite<
asking for so much more capacity.
The simple/easy ones definitely, the ones with longer displacements no.
<
So what I had done was, after squeezing as much as I could into a basic<
instruction format, I provided for switching into alternate instruction
formats which made different compromises by using the block headers.
Block headers are simply consuming entropy.
<
This has now been dropped. Since I managed to get the normal (unaligned)<
memory-reference instruction squeezed into so much less opcode space that
I also had room for the aligned memory-reference format without
compromises
in the basic instruction set, it wasn't needed to have multiple
instruction
formats.
I never had any aligned memory references. The HW overhead to "fix" the problem is so small as to be compelling.
<
I had to change the instructions longer than 32 bits to get them in
the basic instruction format, so now they're less dense.
Block structure is still used, but now for only the two things it's
actually needed for: reserving part of a block as unused for the
pseudo-immediates, and for VLIW features (explicitly indicating
parallelism, and instruction predication).
The ISA is still tremendously complicated, since I've put room in it for<
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.
Yet, mine remains simple and compact.
<
John Savard
I never had any aligned memory references. The HW overhead to "fix" the problem is so small as to be compelling.
On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
On 11/9/2023 3:51 PM, Quadibloc wrote:
The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.
Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.
For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.
It's only in the 16-bit operate instructions that this splitting of
registers is actively present as a constraint. It is needed to make
16-bit operate instructions possible.
So the cure is that if a compiler finds this too much trouble, it
doesn't have to use the 16-bit instructions.
Of course, if compilers can't use them, that raises the question of
whether 16-bit instructions are worth having. Without them, the
complications that I needed to be happy about my memory-reference instructions could have been entirely avoided.
On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
On 11/9/2023 3:51 PM, Quadibloc wrote:
The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.
Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.
For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.
On Fri, 10 Nov 2023 00:29:00 +0000, John Dallman wrote:
In article <uijjoj$2dc2i$1@dont-email.me>, quadibloc@servername.invalid
(Quadibloc) wrote:
Actually, it's worse than that, since I also want the virtues of
processors like the TMS320C2000 or the Itanium.
What do you consider the virtues of Itanium to be?
On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:<
On 11/9/2023 3:51 PM, Quadibloc wrote:
The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.
Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.
For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.
It's only in the 16-bit operate instructions that this splitting of
registers is actively present as a constraint. It is needed to make
16-bit operate instructions possible.
So the cure is that if a compiler finds this too much trouble, it
doesn't have to use the 16-bit instructions.
Of course, if compilers can't use them, that raises the question of<
whether 16-bit instructions are worth having. Without them, the
complications that I needed to be happy about my memory-reference instructions could have been entirely avoided.
John Savard
Quadibloc <quadibloc@servername.invalid> writes:
On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
On 11/9/2023 3:51 PM, Quadibloc wrote:
The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.
Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.
For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.
As soon as you make 'general purpose registers' not 'general'
you've significantly complicated register allocation in compilers
and likely caused additional memory accesses due to the need to
spill registers unnecessarily.
On 11/9/2023 7:11 PM, MitchAlsup wrote:
Quadibloc wrote:
Good to see you are back on here...
<Some progress has been made in advancing a small step towards sanity<
in the description of the Concertina II architecture described at
http://www.quadibloc.com/arch/ct17int.htm
As Mitch Alsup has rightly noted, I want to have my cake and eat it
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.
My 66000 has all of this.
<
I want memory-reference instructions to still fit in 32 bits, despite<
asking for so much more capacity.
The simple/easy ones definitely, the ones with longer displacements no.
<
Yes.
As noted a few times, as I see it, 9 .. 12 is sufficient.
Much less than 9 is "not enough", much more than 12 is wasting entropy,
at least for 32-bit encodings.
12u-scaled would be "pretty good", say, being able to handle 32K for<
QWORD ops.
So what I had done was, after squeezing as much as I could into a basic<
instruction format, I provided for switching into alternate instruction
formats which made different compromises by using the block headers.
Block headers are simply consuming entropy.
<
Also yes.
<This has now been dropped. Since I managed to get the normal (unaligned) >>> memory-reference instruction squeezed into so much less opcode space that >>> I also had room for the aligned memory-reference format without<
compromises
in the basic instruction set, it wasn't needed to have multiple
instruction
formats.
I never had any aligned memory references. The HW overhead to "fix" the
problem is so small as to be compelling.
<
In my case, it is only for 128-bit load/store operations, which require 64-bit alignment.
Well, and an esoteric edge case:<
if((PC&0xE)==0xE)
You can't use a 96-bit encoding, and will need to insert a NOP if one
needs to do so.
One can argue that aligned-only allows for a cheaper L1 D$, but also<
"sucks pretty bad" for some tasks:
Fast memcpy;
LZ decompression;
Huffman;
...
I had to change the instructions longer than 32 bits to get them in<
the basic instruction format, so now they're less dense.
Block structure is still used, but now for only the two things it's
actually needed for: reserving part of a block as unused for the
pseudo-immediates, and for VLIW features (explicitly indicating
parallelism, and instruction predication).
The ISA is still tremendously complicated, since I've put room in it for >>> a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.
Yet, mine remains simple and compact.
<
Mostly similar.
Though, I guess some people could debate this in my case.
Granted, I specify the entire ISA in a single location, rather than
spreading it across a bunch of different documents (as was the case with RISC-V).
Well, and where there is a lot that is left up to the specific hardware implementations in terms of stuff that one would need to "actually have
an OS run on it", ...
John Savard
BGB wrote:
On 11/9/2023 7:11 PM, MitchAlsup wrote:
Quadibloc wrote:
Good to see you are back on here...
Some progress has been made in advancing a small step towards sanity<
in the description of the Concertina II architecture described at
http://www.quadibloc.com/arch/ct17int.htm
As Mitch Alsup has rightly noted, I want to have my cake and eat it
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.
My 66000 has all of this.
<
I want memory-reference instructions to still fit in 32 bits, despite<
asking for so much more capacity.
The simple/easy ones definitely, the ones with longer displacements no.
<
Yes.
As noted a few times, as I see it, 9 .. 12 is sufficient.<
Much less than 9 is "not enough", much more than 12 is wasting
entropy, at least for 32-bit encodings.
Can you suggest something I could have done by sacrificing 16-bits
down to 12-bits that would have improved "something" in my ISA ??
{{You see I did not have any trouble in having all 16-bits for MEM references--just like having 16-bits for integer, logical, and branch offsets.}}
<
12u-scaled would be "pretty good", say, being able to handle 32K for<
QWORD ops.
IBM 360 found so, EMBench is replete with stack sizes and struct sizes
where My 66000 uses 1×32-bit instruction where RISC-V needs 2×32-bit... Exactly the difference between 12-bits and 14-bits....
So what I had done was, after squeezing as much as I could into a basic >>>> instruction format, I provided for switching into alternate instruction >>>> formats which made different compromises by using the block headers.<
Block headers are simply consuming entropy.
<
Also yes.
This has now been dropped. Since I managed to get the normal<
(unaligned)
memory-reference instruction squeezed into so much less opcode space
that
I also had room for the aligned memory-reference format without
compromises
in the basic instruction set, it wasn't needed to have multiple
instruction
formats.
I never had any aligned memory references. The HW overhead to "fix" the
problem is so small as to be compelling.
<
In my case, it is only for 128-bit load/store operations, which<
require 64-bit alignment.
VVM does all the wide stuff without necessitating the wide stuff in
registers or instructions.
<
Well, and an esoteric edge case:<
if((PC&0xE)==0xE)
You can't use a 96-bit encoding, and will need to insert a NOP if one
needs to do so.
Ehhhhh...
<
One can argue that aligned-only allows for a cheaper L1 D$, but also<
"sucks pretty bad" for some tasks:
Fast memcpy;
LZ decompression;
Huffman;
...
Time found that HW can solve the problem way more than adequately--
obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
<
I had to change the instructions longer than 32 bits to get them in<
the basic instruction format, so now they're less dense.
Block structure is still used, but now for only the two things it's
actually needed for: reserving part of a block as unused for the
pseudo-immediates, and for VLIW features (explicitly indicating
parallelism, and instruction predication).
The ISA is still tremendously complicated, since I've put room in it
for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.
Yet, mine remains simple and compact.
<
Mostly similar.
Though, I guess some people could debate this in my case.
Granted, I specify the entire ISA in a single location, rather than
spreading it across a bunch of different documents (as was the case
with RISC-V).
Well, and where there is a lot that is left up to the specific
hardware implementations in terms of stuff that one would need to
"actually have an OS run on it", ...
John Savard
On 11/10/2023 8:51 AM, Scott Lurndal wrote:
Quadibloc <quadibloc@servername.invalid> writes:
On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
On 11/9/2023 3:51 PM, Quadibloc wrote:
The 32 general registers aren't _quite_ general. They're divided into >>>>> four groups of eight.
Errm, splitting up registers like this is likely to hurt far more than >>>> anything that 16-bit displacements are likely to gain.
For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.
As soon as you make 'general purpose registers' not 'general'
you've significantly complicated register allocation in compilers
and likely caused additional memory accesses due to the need to
spill registers unnecessarily.
Yeah.
Either banks of 8, or an 8 data + 8 address, or ... would kinda "rather suck".
Or, even smaller cases, like, "most instructions can use all the
registers, but these ops only work on a subset" is kind of an annoyance
(this is a big part of why I bothered with the whole XG2 thing).
Much better to have a big flat register space.
On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
On 11/9/2023 3:51 PM, Quadibloc wrote:
The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.
Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.
For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.
Quadibloc <quadibloc@servername.invalid> schrieb:<
On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
On 11/9/2023 3:51 PM, Quadibloc wrote:
The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.
Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.
For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.
This breaks with the central tenet of the /360, the PDP-11,
the VAX, and all RISC architectures: (Almost) all registers are general-purpose registers.
This would make your ISA very un-S/360-like.
On 11/10/2023 12:22 PM, MitchAlsup wrote:<
One can argue that aligned-only allows for a cheaper L1 D$, but also<
"sucks pretty bad" for some tasks:
Fast memcpy;
LZ decompression;
Huffman;
...
Time found that HW can solve the problem way more than adequately--
obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
<
Wait, are you arguing for aligned-only memory ops here?...
But, yeah, for me, a major selling points for unaligned access is mostly<
that I can copy blocks of memory around like:
v0=((uint64_t *)cs)[0];
v1=((uint64_t *)cs)[1];
v2=((uint64_t *)cs)[2];
v3=((uint64_t *)cs)[3];
((uint64_t *)ct)[0]=v0;
((uint64_t *)ct)[1]=v1;
((uint64_t *)ct)[2]=v2;
((uint64_t *)ct)[3]=v3;
cs+=32; ct+=32;
For Huffman, some of the fastest strategies to implement the bitstream reading/writing, tend to be to casually make use of unaligned access (shifting in and loading bytes is slower in comparison).<
Though, all this falls on its face, if encountering a CPU that uses
traps to emulate unaligned access (apparently a lot of the SiFive cores
and similar).
BGB wrote:
On 11/10/2023 12:22 PM, MitchAlsup wrote:
One can argue that aligned-only allows for a cheaper L1 D$, but also<
"sucks pretty bad" for some tasks:
Fast memcpy;
LZ decompression;
Huffman;
...
Time found that HW can solve the problem way more than adequately--
obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
<
Wait, are you arguing for aligned-only memory ops here?...<
No, I am arguing that all memory references are inherently un aligned,
but where
aligned references never suffer a stall penalty; and the the compiler
does not
need to understand if the reference is aligned or unaligned.
<
But, yeah, for me, a major selling points for unaligned access is<
mostly that I can copy blocks of memory around like:
v0=((uint64_t *)cs)[0];
v1=((uint64_t *)cs)[1];
v2=((uint64_t *)cs)[2];
v3=((uint64_t *)cs)[3];
((uint64_t *)ct)[0]=v0;
((uint64_t *)ct)[1]=v1;
((uint64_t *)ct)[2]=v2;
((uint64_t *)ct)[3]=v3;
cs+=32; ct+=32;
MM Rcs,Rct,#length // without the for loop <
For Huffman, some of the fastest strategies to implement the bitstream
reading/writing, tend to be to casually make use of unaligned access
(shifting in and loading bytes is slower in comparison).
Though, all this falls on its face, if encountering a CPU that uses<
traps to emulate unaligned access (apparently a lot of the SiFive
cores and similar).
Traps to perform unaligned are so 1985......either don't allow them at all (SIGSEGV) or treat them as first class citizens. The former fails in the market.
<
Errm, splitting up registers like this is likely to hurt far more than anything that 16-bit displacements are likely to gain.
Well, I think that superscalar operation of microprocessors is a
good thing.
Explicitly indicating which instructions may execute in parallel
is one way to facilitate that. Even if the Itanium was an
unsuccessful implementation of that principle.
This lends itself to writing code where four distinct threads are interleaved, helping pipelining in implementations too cheap to have out-of-order executiion.
On 11/10/2023 12:22 PM, MitchAlsup wrote:
BGB wrote:
One can argue that aligned-only allows for a cheaper L1 D$, but also
"sucks pretty bad" for some tasks:
Fast memcpy;
LZ decompression;
Huffman;
...
Though, all this falls on its face, if encountering a CPU that uses
traps to emulate unaligned access (apparently a lot of the SiFive cores
and similar).
BGB <cr88192@gmail.com> writes:
On 11/10/2023 12:22 PM, MitchAlsup wrote:
BGB wrote:
One can argue that aligned-only allows for a cheaper L1 D$, but also
"sucks pretty bad" for some tasks:
Fast memcpy;
LZ decompression;
Huffman;
...
Hashing
Though, all this falls on its face, if encountering a CPU that uses
traps to emulate unaligned access (apparently a lot of the SiFive cores
and similar).
Let's see what this SiFive U74 does:
[fedora-starfive:~/nfstmp/gforth-riscv:98397] perf stat -e instructions:u -e cycles:u gforth-fast -e "create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x x ! x foo bye "
Performance counter stats for 'gforth-fast -e create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x x ! x foo bye ':
469832112 instructions:u # 0.79 insn per cycle
591015904 cycles:u
0.609751748 seconds time elapsed
0.533195000 seconds user
0.061522000 seconds sys
[fedora-starfive:~/nfstmp/gforth-riscv:98398] perf stat -e instructions:u -e cycles:u gforth-fast -e "create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x 1+ x 1+ ! x 1+ foo bye "
Performance counter stats for 'gforth-fast -e create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x 1+ x 1+ ! x 1+ foo bye ':
53533370273 instructions:u # 0.77 insn per cycle
69304924487 cycles:u
69.368484169 seconds time elapsed
69.256290000 seconds user
0.049997000 seconds sys
So when we do aligned accesses (first command), the code performs 4.7 instructions and 5.9 cycles per load, while for unaligned accesses
(second command) the same code performs 535.3 instructions and 693.0
cycles per load. So apparently an unaligned load triggers >500
additional instructions, confirming your claim. Interestingly, all
that is attributed to user time; maybe the fixup is performed by a
user-level trap or microcode.
Still, the approach of having separate instructions for aligned and
unaligned accesses (typically with several instructionf for the
unaligned case) has been tried and discarded. Software just does not
declare that some access will be unaligned.
A particularly strong evidence for this is that gas generated
non-working code for ustq (unaligned store quadword) on Alpha for
several years, and apparently nobody noticed until I gave an exercise
to my students where they should use ustq (so no production use,
either).
So, every general-purpose architecture, including RISC-V, the
spiritual descendent of MIPS and Alpha (which had the division),
settled on having memory access instructions that perform both aligned
and unaligned accesses (with performance advantages for aligned
accesses).
If RISC-V implementations want to perform well for code that uses
unaligned accesses for memory copying, compression/decompression, or
hashing, they will eventually have to implement unaligned accesses
more efficiently, but at least the code works, and aligned accesses
are fast.
Why would you not go the same way? It would also save on instruction encoding space.
- anton
Let's see what this SiFive U74 does:...
So apparently an unaligned load triggers >500 additional instructions, confirming your claim.
In article <2023Nov11.082221@mips.complang.tuwien.ac.at>, >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
So apparently an unaligned load triggers >500 additional instructions,
confirming your claim.
Wow. I think I'd rather have SIGBUS on unaligned accesses. That is at
least obvious.
On 11/11/2023 1:22 AM, Anton Ertl wrote:
Hashing
Possibly true.
True, but that has been tried out and, in a world (like Linux) where
software is developed on a platform that supports unaligned
accesses, and then compiled by package maintainers (who often are
not that familiar with the software) on a lot of platforms, the end
result was that the kernel by default performed a fixup (and put a
message in the dmesg buffer) instead of delivering a SIGBUS.
On 11/10/2023 10:24 AM, BGB wrote:<
Much better to have a big flat register space.
Yes, but sometimes you just need "another bit" in the instructions. So
an alternative is to break the requirement that all register specifier
fields in the instruction be the same length. So, for example, allow
access to all registers from one source operand position, but say only
half from the other source operand position. So, for a system with 32 registers, you would need 5 plus 5 plus 4 bits. Much of the time, such
as with commutative operations like adds, this doesn't hurt at all.
Yes, this makes register allocation in the compiler harder. And
occasionally you might need an extra instruction to copy a value to the
half size field, but on high end systems, this can be done in the rename stage without taking an execution slot.
A more extreme alternative is to only allow the destination field to
also be one bit smaller. Of course, this makes things even harder for
the compiler, and probably requires extra "copy" instructions more frequently, but sometimes you just gotta do what you gotta do. :-(
BGB <cr88192@gmail.com> writes:[...]
On 11/10/2023 12:22 PM, MitchAlsup wrote:
BGB wrote:
One can argue that aligned-only allows for a cheaper L1 D$, but also
"sucks pretty bad" for some tasks:
Fast memcpy;
LZ decompression;
Huffman;
...
Hashing
Stephen Fuld wrote:
On 11/10/2023 10:24 AM, BGB wrote:
Much better to have a big flat register space.
Yes, but sometimes you just need "another bit" in the instructions.<
So an alternative is to break the requirement that all register
specifier fields in the instruction be the same length. So, for
example, allow
Another way to get a few more bits is to use a prefix-instruction like
CARRY for those seldom needed bits.
<
access to all registers from one source operand position, but say only
half from the other source operand position. So, for a system with 32
registers, you would need 5 plus 5 plus 4 bits. Much of the time,
such as with commutative operations like adds, this doesn't hurt at all.
Yes, this makes register allocation in the compiler harder. And
occasionally you might need an extra instruction to copy a value to
the half size field, but on high end systems, this can be done in the
rename stage without taking an execution slot.
A more extreme alternative is to only allow the destination field to
also be one bit smaller. Of course, this makes things even harder for
the compiler, and probably requires extra "copy" instructions more
frequently, but sometimes you just gotta do what you gotta do. :-(
In article <2023Nov11.112254@mips.complang.tuwien.ac.at>, >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
True, but that has been tried out and, in a world (like Linux) where
software is developed on a platform that supports unaligned
accesses, and then compiled by package maintainers (who often are
not that familiar with the software) on a lot of platforms, the end
result was that the kernel by default performed a fixup (and put a
message in the dmesg buffer) instead of delivering a SIGBUS.
Yup. The software I work on is meant, in itself, to work on platforms
that enforce alignment, and it was a useful catcher for some kinds of bug. >However, I'm now down to one that actually enforces it, in SPARC Solaris,
and that isn't long for this world.
I dug into what it would take to have x86-64 Linux work with alignment >enforcement turned on, and it's a huge job.
On 11/10/2023 11:22 PM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:[...]
On 11/10/2023 12:22 PM, MitchAlsup wrote:
BGB wrote:
One can argue that aligned-only allows for a cheaper L1 D$, but also >>>>> "sucks pretty bad" for some tasks:
Fast memcpy;
LZ decompression;
Huffman;
...
Hashing
Fwiw, proper alignment is very important wrt a programmer to gain some
of the benefits of, basically, virtually "any" target architecture. For instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
the programmer can set up an array that is aligned on a cache line
boundary and pad each element of said array up to the size of a L2 cache line.
Two steps... Align your memory on a proper cache line boundary, and pad
the size of each element up to the size of a single cache line.
Think of LL/SC... If one did not honor the reservation granule....<
well... Shit.. False sharing on a reservation granule can cause live
lock and damage forward progress wrt some LL/SC setups.
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> schrieb:
On 11/10/2023 11:22 PM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:[...]
On 11/10/2023 12:22 PM, MitchAlsup wrote:
BGB wrote:
One can argue that aligned-only allows for a cheaper L1 D$, but also >>>>>> "sucks pretty bad" for some tasks:
Fast memcpy;
LZ decompression;
Huffman;
...
Hashing
Fwiw, proper alignment is very important wrt a programmer to gain some
of the benefits of, basically, virtually "any" target architecture. For
instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
the programmer can set up an array that is aligned on a cache line
boundary and pad each element of said array up to the size of a L2 cache
line.
Two steps... Align your memory on a proper cache line boundary, and pad
the size of each element up to the size of a single cache line.
For smaller elements smaller than a cache line, that makes little
sense. as written. I think there is an unwritten assumption
"for elements larger than cache line" there, or we would all
be using 64-byte bools.
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> schrieb:
On 11/10/2023 11:22 PM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:[...]
On 11/10/2023 12:22 PM, MitchAlsup wrote:
BGB wrote:
One can argue that aligned-only allows for a cheaper L1 D$, but also >>>>>> "sucks pretty bad" for some tasks:
Fast memcpy;
LZ decompression;
Huffman;
...
Hashing
Fwiw, proper alignment is very important wrt a programmer to gain some
of the benefits of, basically, virtually "any" target architecture. For
instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
the programmer can set up an array that is aligned on a cache line
boundary and pad each element of said array up to the size of a L2 cache
line.
Two steps... Align your memory on a proper cache line boundary, and pad
the size of each element up to the size of a single cache line.
For smaller elements smaller than a cache line, that makes little
sense. as written. I think there is an unwritten assumption
"for elements larger than cache line" there, or we would all
be using 64-byte bools.
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> schrieb:<
Fwiw, proper alignment is very important wrt a programmer to gain some
of the benefits of, basically, virtually "any" target architecture. For
instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
the programmer can set up an array that is aligned on a cache line
boundary and pad each element of said array up to the size of a L2 cache
line.
Two steps... Align your memory on a proper cache line boundary, and pad
the size of each element up to the size of a single cache line.
For smaller elements smaller than a cache line, that makes little
sense. as written. I think there is an unwritten assumption
"for elements larger than cache line" there, or we would all
be using 64-byte bools.
jgd@cix.co.uk (John Dallman) writes:
I dug into what it would take to have x86-64 Linux work with
alignment enforcement turned on, and it's a huge job.
It might be easier with AArch64. Just set the A bit (bit 1) in
SCTLR_EL1; it only effects code executing in usermode.
There may even already be some ELF flag that will set it when the
file is exec(2)'d.
What do you consider the virtues of Itanium to be?
Well, I think that superscalar operation of microprocessors is a good
thing. Explicitly indicating which instructions may execute in parallel
is one way to facilitate that. Even if the Itanium was an unsuccessful >implementation of that principle.
I gather it is still useful for embedded or realtime applications which
are fairly regular and for cost or power reasons you want to minimize
the number of transistors.
I dug into what it would take to have x86-64 Linux work with alignment >enforcement turned on, and it's a huge job.
jgd@cix.co.uk (John Dallman) writes:
I dug into what it would take to have x86-64 Linux work with alignment >>enforcement turned on, and it's a huge job.
I did a first attempt in the IA-32 days, and there I found that the
alignment requirements of the hardware were incompatible with the ABI
(which required 4-byte alignment for 8-byte FP numbers).
jgd@cix.co.uk (John Dallman) writes:
I dug into what it would take to have x86-64 Linux work with
alignment enforcement turned on, and it's a huge job.
I did a first attempt in the IA-32 days, and there I found that the
alignment requirements of the hardware were incompatible with the
ABI (which required 4-byte alignment for 8-byte FP numbers).
My second attempt was with AMD64, and there I found that gcc
produced misaligned 16-bit memory accesses for stuff like
strcpy(buf, "a"). I did not try to disable this with a flag
at the time, but maybe -fno-tree-vectorize would help. But
even if I use that for my code, I would also have to recompile
all the libraries with that flag.
I would be surprised if ARM A64 did not have the same problems
(except the idiotic incompatibility between Intel ABI and Intel
hardware).
In article <FlS3N.25739$_Oab.3565@fx15.iad>, scott@slp53.sl.home (Scott >Lurndal) wrote:
jgd@cix.co.uk (John Dallman) writes:
I dug into what it would take to have x86-64 Linux work with
alignment enforcement turned on, and it's a huge job.
It might be easier with AArch64. Just set the A bit (bit 1) in
SCTLR_EL1; it only effects code executing in usermode.
There may even already be some ELF flag that will set it when the
file is exec(2)'d.
I'll take a look, but I doubt glibc on Aarch64 is built to be run with >alignment trapping. Should it be EL0 for usermode?
jgd@cix.co.uk (John Dallman) writes:
In article <FlS3N.25739$_Oab.3565@fx15.iad>, scott@slp53.sl.home(Scott Lurndal) wrote:
jgd@cix.co.uk (John Dallman) writes:
It might be easier with AArch64. Just set the A bit (bit 1) in
SCTLR_EL1; it only effects code executing in usermode.
There may even already be some ELF flag that will set it when the
file is exec(2)'d.
I'll take a look, but I doubt glibc on Aarch64 is built to be run
with alignment trapping. Should it be EL0 for usermode?
The EL1 in the register name describes the minimum exception level
allowed to access the register. SCTLR_EL1 includes control bits
for both EL1 and EL0.
Chris M. Thomasson wrote:
<
Think of LL/SC... If one did not honor the reservation granule....
well... Shit.. False sharing on a reservation granule can cause live
lock and damage forward progress wrt some LL/SC setups.
One should NEVER (N. E. V. E. R.) attempt ATOMIC stuff on an unaligned >container. Only aligned containers possess ATOMIC-smelling properties.
<
Errm, splitting up registers like this is likely to hurt far more than anything that 16-bit displacements are likely to gain.
Unless, maybe, registers were being treated like a stack, but even then,
this is still gonna suck.
Much preferable for a compiler to have a flat space of 32 or 64
registers. Having 16 sorta works, but does still add a bit to spill and
fill.
On 11/9/2023 10:37 PM, Quadibloc wrote:<
For performance optimized cases, I am starting to suspect 16-bit ops are
not worth it.
For size optimization, they make sense; but size optimization also means mostly confining register allocation to R0..R15 in my case, with
heuristics for when to enable additional registers, where enabling the
higher registers effectively hinders the use of 16-bit instructions.
The other option I have found is that, rather than optimizing for<
smaller instructions (as in an ISA with 16 bit instructions), one can
instead optimize for doing stuff in as few instructions as it is
reasonable to do so, which in turn further goes against the use of
16-bit instructions.
On Fri, 10 Nov 2023 01:11:13 +0000, MitchAlsup wrote:<
I never had any aligned memory references. The HW overhead to "fix" the
problem is so small as to be compelling.
Since I have a complete set of memory-reference instructions for which unaligned memory-reference instructions are supported, the problem isn't
that I think unaligned fetches and stores take too many gates.
Rather, being able to only specify aligned accesses saves *opcode space*,
which lets me fit in one complete set of memory-reference instructions that can use all the base registers, all the index registers, and always use all the registers as destination registers.
While the unaligned-capable instructions, that offer also important additional addressing modes, had to have certain restrictions to fit in.
So they use six out of the seven index registers, they can use only half
the registers as destination registers on indexed accesses, and they use
four out of the seven base registers.
Having 16-bit instructions for the possibility of more compact code meant that I had to have at least one of the two restrictions noted above -
having both restrictions meant that I could offer the alternative of aligned-only instructions with neither restriction, which may be far less painful for some.
John Savard
On 11/10/2023 8:51 AM, Scott Lurndal wrote:<
As for register arguments:
* Probably 8 or 16.
** 8 makes the most sense with 32 GPRs.
*** 16 is asking too much.
*** 8 deals with around 98% of functions.
** 16 makes sense with 64 GPRs.
*** Nearly all functions can use exclusively register arguments.
*** Gain is small though, if it only benefits 2% of functions.
*** It is almost a "shoe in", except for cost of fixed spill space
*** 128 bytes at the bottom of every non-leaf stack-frame is noticeable.
*** Though, an ABI could decide to not have a spill space in this way.
Though, admittedly, for a lot of my programs I had still ended up going<
with 8 register arguments with 64 GPRs, mostly as the gains of 16
arguments is small, relative of the cost of spending an additional 64
bytes in nearly every stack frame (and also there are still some
unresolved bugs when using 16 argument mode).
....
Current leaning is also that:
32-bit primary instruction size;
32/64/96 bit for variable-length instructions;
Is "pretty good".
In performance-oriented use cases, 16-bit encodings "aren't really worth
it".
In cases where you need a 32 or 64 bit value, being able to encode them
or load them quickly into a register is ideal. Spending multiple
instructions to glue a value together isn't ideal, nor is needing to
load it from memory (this particularly sucks from the compiler POV).
As for addressing modes:
(Rb, Disp) : ~ 66-75%
(Rb, Ri) : ~ 25-33%
Can address the vast majority of cases.
Displacements are most effective when scaled by the size of the element
type, as unaligned displacements are exceedingly rare. The vast majority
of displacements are also positive.
Not having a register-indexed mode is shooting oneself in the foot, as
these are "not exactly rare".
Most other possible addressing modes can be mostly ignored.
Auto-increment becomes moot if one has superscalar or VLIW;
(Rb, Ri, Disp) is only really applicable in niche cases
Eg, array inside struct, etc.
...
RISC-V did sort of shoot itself in the foot in several of these areas,
albeit with some workarounds in "Bitmanip":
SHnADD, can mimic a LEA, allowing array access in fewer ops.
PACK, allows an inline 64-bit constant load in 5 instructions...
LUI+ADD+LUI+ADD+PACK
...
Still not ideal...
An extra cycle for memory access is not ideal for a close second place addressing mode; nor are 64-bit constants rare enough that one
necessarily wants to spend 5 or so clock cycles on them.
But, still better than the situation where one does not have these instructions.
....
On 11/10/2023 12:22 PM, MitchAlsup wrote:<
BGB wrote:
One can argue that aligned-only allows for a cheaper L1 D$, but also<
"sucks pretty bad" for some tasks:
Fast memcpy;
LZ decompression;
Huffman;
...
Time found that HW can solve the problem way more than adequately--
obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
<
Wait, are you arguing for aligned-only memory ops here?...
jgd@cix.co.uk (John Dallman) writes:
In article <2023Nov11.112254@mips.complang.tuwien.ac.at>, >>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
True, but that has been tried out and, in a world (like Linux) where
software is developed on a platform that supports unaligned
accesses, and then compiled by package maintainers (who often are
not that familiar with the software) on a lot of platforms, the end
result was that the kernel by default performed a fixup (and put a
message in the dmesg buffer) instead of delivering a SIGBUS.
Yup. The software I work on is meant, in itself, to work on platforms
that enforce alignment, and it was a useful catcher for some kinds of bug. >>However, I'm now down to one that actually enforces it, in SPARC Solaris, >>and that isn't long for this world.
I dug into what it would take to have x86-64 Linux work with alignment >>enforcement turned on, and it's a huge job.
It might be easier with AArch64. Just set the A bit (bit 1) in SCTLR_EL1; >it only effects code executing in usermode.
There may even already be some ELF flag that will set it when the
file is exec(2)'d.
On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:...
Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.
...Much preferable for a compiler to have a flat space of 32 or 64
registers. Having 16 sorta works, but does still add a bit to spill and
fill.
But if the 16-bit instructions I'm making room for are useless to
compilers, that's questionable.
In article <FlS3N.25739$_Oab.3565@fx15.iad>,<
Scott Lurndal <slp53@pacbell.net> wrote:
jgd@cix.co.uk (John Dallman) writes:
In article <2023Nov11.112254@mips.complang.tuwien.ac.at>, >>>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
True, but that has been tried out and, in a world (like Linux) where
software is developed on a platform that supports unaligned
accesses, and then compiled by package maintainers (who often are
not that familiar with the software) on a lot of platforms, the end
result was that the kernel by default performed a fixup (and put a
message in the dmesg buffer) instead of delivering a SIGBUS.
Yup. The software I work on is meant, in itself, to work on platforms >>>that enforce alignment, and it was a useful catcher for some kinds of bug. >>>However, I'm now down to one that actually enforces it, in SPARC Solaris, >>>and that isn't long for this world.
I dug into what it would take to have x86-64 Linux work with alignment >>>enforcement turned on, and it's a huge job.
It might be easier with AArch64. Just set the A bit (bit 1) in SCTLR_EL1; >>it only effects code executing in usermode.
There may even already be some ELF flag that will set it when the
file is exec(2)'d.
On Aarch64, with GCC at least, you also need to specify "-mstrict-align"
when compiling all source code, to prevent the compiler from assuming it
can access structure fields in an unaligned way, even if all of your
code accesses are fully aligned. GCC can mess around behind your back, changing ptr->array32[1] = 0 and ptr->array32[2] = 0 into a single
64-bit write of ptr->array32[1] = 0, among other things. If the offset
of array32[1] wasn't 64-bit aligned, it's an alignment trap if
SCTLR_EL1.A=1.
On all Arm system, Device memory accesses must always be aligned. User code in general does not get access to Device memory, so this does not affect regular users.
Kent
I am not buying this. Which takes more opcode space::
a) an ISA with unaligned only LDs and STs (11)
or b) an ISA with unaligned LDs and STs (11) and aligned LDs and STs
(another 11)
Quadibloc <quadibloc@servername.invalid> writes:<
On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:....
Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.
....Much preferable for a compiler to have a flat space of 32 or 64
registers. Having 16 sorta works, but does still add a bit to spill and
fill.
But if the 16-bit instructions I'm making room for are useless to >>compilers, that's questionable.
It works for the RISC-V C (compressed) extension. Some of these
compressed instrutions use registers 8-15 (others use all 32
registers, but have other restrictions). But it works fine exactly
because, if your register usage does not fit the limitations of the
16-bit encoding, you just use the 32-bit version of the instruction.
It seems that they designed the ABI such that registers 8-15 occur
often in the code. Maybe the gcc maintainer also put some work into preferring these registers.
OTOH, ARM who have extensive experience with mixed 32-bit/16-bit
instruction sets with their A32/T32 instruction set(s), designed their
A64 instruction set to strictly use 32-bit instructions.
So if MIPS, SPARC, Power, Alpha, and ARM A64 went for fixed-width
32-bit instructions, why make your task harder by also implementing
short instructions? Of course, if that is your goal or you have fun
with this, why not? But if you want to make progress, it seems to be something that can be skipped.
- anton
On Sun, 12 Nov 2023 21:25:20 +0000, MitchAlsup wrote:<
I am not buying this. Which takes more opcode space::
a) an ISA with unaligned only LDs and STs (11)
or b) an ISA with unaligned LDs and STs (11) and aligned LDs and STs
(another 11)
That is true, *other things being equal*.
However, what I had was:
An ISA with unaligned loads and stores, that could use all 32 destination registers, and all 8 index and base registers. (Call this A)<
That took up too much opcode space to allow 16-bit instructions.
So I made various compromises to shave one bit off the loads and stores,
and then I could have 16 bit instructions. (Call this B)
But I didn't like the compromises.
So I made _more_ compromises, to shave _another_ bit off the loads and stores. This way, I had enough opcode space to add aligned-only loads<
and stores... that could use all 32 destination registers, and all 8
index and base registers. (Call this C)
Since other things _were not equal_, it was perfectly possible for C<
to use less opcode space than A, and about the same amount of opcode
space as B. So I got to use 16-bit instructions AND have a set of loads
and stores that used all 32 destnation registers, and all 8 index and
base registers.
The compromises on the _unaligned_ loads and stores were painful, but<
they were chosen so that code using them wouldn't have to be be
significantly less efficient than code with the set of loads and stores
in A.
John Savard
Does you compiler agree with this assertion ??
BGB wrote:
On 11/10/2023 12:22 PM, MitchAlsup wrote:
BGB wrote:
One can argue that aligned-only allows for a cheaper L1 D$, but also<
"sucks pretty bad" for some tasks:
Fast memcpy;
LZ decompression;
Huffman;
...
Time found that HW can solve the problem way more than adequately--
obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
<
Wait, are you arguing for aligned-only memory ops here?...<
I have not argued for aligned memory references since about 2000 (maybe as early as 1991).
<
A poorly chosen starting point (dark alley)
Back out of the dark alley, and start from first principles again.
On Mon, 13 Nov 2023 00:16:24 +0000, MitchAlsup wrote:<
A poorly chosen starting point (dark alley)
Back out of the dark alley, and start from first principles again.
By the way, I think you mean a _blind_ alley.
A dark alley is just a dangerous place, since robbers can attack you
there without being seen.
A _blind_ alley is one that had no exit, one that is a dead end. That
seems to better fit the context of your remarks.
John Savard
BGB wrote:
On 11/10/2023 8:51 AM, Scott Lurndal wrote:
As for register arguments:<
* Probably 8 or 16.
** 8 makes the most sense with 32 GPRs.
*** 16 is asking too much.
*** 8 deals with around 98% of functions.
** 16 makes sense with 64 GPRs.
*** Nearly all functions can use exclusively register arguments.
*** Gain is small though, if it only benefits 2% of functions.
*** It is almost a "shoe in", except for cost of fixed spill space
*** 128 bytes at the bottom of every non-leaf stack-frame is noticeable.
*** Though, an ABI could decide to not have a spill space in this way.
For the reasons stated above (some clipped) I agree with this whole
block of statements.
<
Since My 66000 has 32 registers, I went with upto 8 arguments in registers, upto 8 results in registers, with the 9th of either on-the-stack in such a way that if the callee is vararg the argument registers can be pushed on
the
stack to form a memory resident vector of arguments {{just perfect for printf().}}
<
With 8 registers covering 98%-ile of calls, there is too little left by making this boundary 12-16 both of which ARE still possible.
<
Though, admittedly, for a lot of my programs I had still ended up<
going with 8 register arguments with 64 GPRs, mostly as the gains of
16 arguments is small, relative of the cost of spending an additional
64 bytes in nearly every stack frame (and also there are still some
unresolved bugs when using 16 argument mode).
It is a delicate balance and it is easy to make the code look better
while actually running slower.
<
....
Current leaning is also that:
32-bit primary instruction size;
32/64/96 bit for variable-length instructions;
Is "pretty good".
In performance-oriented use cases, 16-bit encodings "aren't really
worth it".
In cases where you need a 32 or 64 bit value, being able to encode
them or load them quickly into a register is ideal. Spending multiple
instructions to glue a value together isn't ideal, nor is needing to
load it from memory (this particularly sucks from the compiler POV).
As for addressing modes:
(Rb, Disp) : ~ 66-75%
(Rb, Ri) : ~ 25-33%
Can address the vast majority of cases.
Displacements are most effective when scaled by the size of the
element type, as unaligned displacements are exceedingly rare. The
vast majority of displacements are also positive.
Not having a register-indexed mode is shooting oneself in the foot, as
these are "not exactly rare".
Most other possible addressing modes can be mostly ignored.
Auto-increment becomes moot if one has superscalar or VLIW;
(Rb, Ri, Disp) is only really applicable in niche cases
Eg, array inside struct, etc.
...
RISC-V did sort of shoot itself in the foot in several of these areas,
albeit with some workarounds in "Bitmanip":
SHnADD, can mimic a LEA, allowing array access in fewer ops.
PACK, allows an inline 64-bit constant load in 5 instructions...
LUI+ADD+LUI+ADD+PACK
...
Still not ideal...
An extra cycle for memory access is not ideal for a close second place
addressing mode; nor are 64-bit constants rare enough that one
necessarily wants to spend 5 or so clock cycles on them.
But, still better than the situation where one does not have these
instructions.
....
Chris M. Thomasson wrote:
<
Think of LL/SC... If one did not honor the reservation granule....
well... Shit.. False sharing on a reservation granule can cause live
lock and damage forward progress wrt some LL/SC setups.
One should NEVER (N. E. V. E. R.) attempt ATOMIC stuff on an unaligned container. Only aligned containers possess ATOMIC-smelling properties.
In article <PQ74N.100$ayBd.39@fx07.iad>, scott@slp53.sl.home (Scott
Lurndal) wrote:
jgd@cix.co.uk (John Dallman) writes:
In article <FlS3N.25739$_Oab.3565@fx15.iad>, scott@slp53.sl.home(Scott Lurndal) wrote:
jgd@cix.co.uk (John Dallman) writes:
It might be easier with AArch64. Just set the A bit (bit 1) in
SCTLR_EL1; it only effects code executing in usermode.
There may even already be some ELF flag that will set it when the
file is exec(2)'d.
I'll take a look, but I doubt glibc on Aarch64 is built to be run
with alignment trapping. Should it be EL0 for usermode?
The EL1 in the register name describes the minimum exception level
allowed to access the register. SCTLR_EL1 includes control bits
for both EL1 and EL0.
Aha. It's harder for ARM64: I'd have to be in supervisor mode to set that >bit, and the stuff I work on is strictly application code.
That took up too much opcode space to allow 16-bit instructions.
Quadibloc <quadibloc@servername.invalid> writes:
On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:...
Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.
...Much preferable for a compiler to have a flat space of 32 or 64
registers. Having 16 sorta works, but does still add a bit to spill and
fill.
But if the 16-bit instructions I'm making room for are useless to
compilers, that's questionable.
It works for the RISC-V C (compressed) extension. Some of these
compressed instrutions use registers 8-15 (others use all 32
registers, but have other restrictions). But it works fine exactly
because, if your register usage does not fit the limitations of the
16-bit encoding, you just use the 32-bit version of the instruction.
It seems that they designed the ABI such that registers 8-15 occur
often in the code. Maybe the gcc maintainer also put some work into preferring these registers.
OTOH, ARM who have extensive experience with mixed 32-bit/16-bit
instruction sets with their A32/T32 instruction set(s), designed their
A64 instruction set to strictly use 32-bit instructions.
So if MIPS, SPARC, Power, Alpha, and ARM A64 went for fixed-width
32-bit instructions, why make your task harder by also implementing
short instructions? Of course, if that is your goal or you have fun
with this, why not? But if you want to make progress, it seems to be something that can be skipped.
On Mon, 13 Nov 2023 00:16:24 +0000, MitchAlsup wrote:
A poorly chosen starting point (dark alley)
Back out of the dark alley, and start from first principles again.
By the way, I think you mean a _blind_ alley.
A dark alley is just a dangerous place, since robbers can attack you
there without being seen.
A _blind_ alley is one that had no exit, one that is a dead end. That
seems to better fit the context of your remarks.
John Savard
You might want to try and get fancy in your short instructions by "randomizing" the subset of registers they can access.
E.g. allow both your short LD and ST instruction access 16 registers but
not exactly the same 16.
Or allow your arithmetic instructions to access only 8 registers for
their input and output args but not exactly the same 8 for the two
inputs and/or for the output.
I suspect that if done well, it could give benefits similar to the skewed-associative caches. The other upside is that it makes register allocation *really* interesting, thus opening up opportunities to spend
a few more years working on that subproblem :-)
To up the ante, you could make the set of registers reachable from each instruction depend not just on the opcode but also on the instruction's address, so you can sometimes avoid a spill by swapping two
instructions. This would allow the register allocation to interact in
even more interesting ways with instruction scheduling.
There could be a few more PhDs worth of research there.
Stephen Fuld wrote:
On 11/10/2023 10:24 AM, BGB wrote:
Much better to have a big flat register space.
Yes, but sometimes you just need "another bit" in the instructions.<
So an alternative is to break the requirement that all register
specifier fields in the instruction be the same length. So, for
example, allow
Another way to get a few more bits is to use a prefix-instruction like
CARRY for those seldom needed bits.
On 11/11/2023 10:11 AM, MitchAlsup wrote:<
Stephen Fuld wrote:
On 11/10/2023 10:24 AM, BGB wrote:<
Much better to have a big flat register space.
Yes, but sometimes you just need "another bit" in the instructions.
So an alternative is to break the requirement that all register
specifier fields in the instruction be the same length. So, for
example, allow
Another way to get a few more bits is to use a prefix-instruction like
CARRY for those seldom needed bits.
Good point. A combination of the two ideas could be to have the prefix instruction specify which register to use instead of the one specified
in the reduced register specifier for whichever instructions in its
shadow have the bit set in the prefix.
my original proposal - one extra, not really executed, instruction<
(prefix versus register to register move) for one where you need to use<
it, but this idea might, by allowing the prefix to specify multiple instructions, save more than one extra "instruction". The only downside
is it requires an additional op code.
Stephen Fuld wrote:
On 11/11/2023 10:11 AM, MitchAlsup wrote:
Stephen Fuld wrote:
On 11/10/2023 10:24 AM, BGB wrote:<
Much better to have a big flat register space.
Yes, but sometimes you just need "another bit" in the instructions.
So an alternative is to break the requirement that all register
specifier fields in the instruction be the same length. So, for
example, allow
Another way to get a few more bits is to use a prefix-instruction like
CARRY for those seldom needed bits.
Good point. A combination of the two ideas could be to have the prefix<
instruction specify which register to use instead of the one specified
in the reduced register specifier for whichever instructions in its
shadow have the bit set in the prefix.
You could have the prefix instruction supply the missing bits of all shortened register specifiers.
< < Worst case, this is the same as
my original proposal - one extra, not really executed, instruction<
Which is why I use the term instruction-modifier.
<
(prefix versus register to register move) for one where you need to<
use it, but this idea might, by allowing the prefix to specify
multiple instructions, save more than one extra "instruction". The
only downside is it requires an additional op code.
But by having an instruction-modifier that can add bits to several
succeeding instructions, you can avoid cluttering up ISA with things
like ADC, SBC, IMULD, DDIV, ....... So, in the end, you save OpCode enumeration space not consume it.
On 11/15/2023 11:02 AM, MitchAlsup wrote:
Stephen Fuld wrote:
On 11/11/2023 10:11 AM, MitchAlsup wrote:<
Stephen Fuld wrote:
On 11/10/2023 10:24 AM, BGB wrote:<
Much better to have a big flat register space.
Yes, but sometimes you just need "another bit" in the instructions.
So an alternative is to break the requirement that all register
specifier fields in the instruction be the same length. So, for
example, allow
Another way to get a few more bits is to use a prefix-instruction like >>>> CARRY for those seldom needed bits.
Good point. A combination of the two ideas could be to have the prefix
instruction specify which register to use instead of the one specified
in the reduced register specifier for whichever instructions in its
shadow have the bit set in the prefix.
You could have the prefix instruction supply the missing bits of all
shortened register specifiers.
I am not sure what you are proposing here. Can you show an example?
<
< Worst case, this is the same as
my original proposal - one extra, not really executed, instruction<
Which is why I use the term instruction-modifier.
Agreed.
<
(prefix versus register to register move) for one where you need to<
use it, but this idea might, by allowing the prefix to specify
multiple instructions, save more than one extra "instruction". The
only downside is it requires an additional op code.
But by having an instruction-modifier that can add bits to several
succeeding instructions, you can avoid cluttering up ISA with things
like ADC, SBC, IMULD, DDIV, ....... So, in the end, you save OpCode
enumeration space not consume it.
In the general case, I certainly agree. But here you need a different op-code than CARRY, as this has different semantics, and I think the new instruction modifier has no other use, hence it is an additional op code versus the original proposal of using essentially a register copy instruction, which already exists (i.e. a load with a zero displacement
and the source register as the address modifier).
Stephen Fuld wrote:
On 11/15/2023 11:02 AM, MitchAlsup wrote:
Stephen Fuld wrote:
On 11/11/2023 10:11 AM, MitchAlsup wrote:<
Stephen Fuld wrote:
On 11/10/2023 10:24 AM, BGB wrote:<
Much better to have a big flat register space.
Yes, but sometimes you just need "another bit" in the
instructions. So an alternative is to break the requirement that
all register specifier fields in the instruction be the same
length. So, for example, allow
Another way to get a few more bits is to use a prefix-instruction like >>>>> CARRY for those seldom needed bits.
Good point. A combination of the two ideas could be to have the
prefix instruction specify which register to use instead of the one
specified in the reduced register specifier for whichever
instructions in its shadow have the bit set in the prefix.
You could have the prefix instruction supply the missing bits of all
shortened register specifiers.
I am not sure what you are proposing here. Can you show an example?
Let us postulate an MoreBits instruction-modifier with a 16-bit immediate field. Now each 16-bit instruction, that has access to only 8 registers, strips off 2-bits/specifier, so now all its register specifiers are 5-bits. The immediate supplies the bits and as bits are stripped off the Decoder shifts the field down by the consumed bits. When the last bit has been stripped off you would need another MB im to supply those bits. Since
only 16-bit instructions are "limited" one MB should last about a basic
block or extended basic block.
Note I don't care how the bits are apportioned, formatted, consumed, ...
On 11/15/2023 1:10 PM, MitchAlsup wrote:
Stephen Fuld wrote:
On 11/15/2023 11:02 AM, MitchAlsup wrote:
Stephen Fuld wrote:
On 11/11/2023 10:11 AM, MitchAlsup wrote:<
Stephen Fuld wrote:
On 11/10/2023 10:24 AM, BGB wrote:<
Much better to have a big flat register space.
Yes, but sometimes you just need "another bit" in the
instructions. So an alternative is to break the requirement that >>>>>>> all register specifier fields in the instruction be the same
length. So, for example, allow
Another way to get a few more bits is to use a prefix-instruction
like
CARRY for those seldom needed bits.
Good point. A combination of the two ideas could be to have the
prefix instruction specify which register to use instead of the one
specified in the reduced register specifier for whichever
instructions in its shadow have the bit set in the prefix.
You could have the prefix instruction supply the missing bits of all
shortened register specifiers.
I am not sure what you are proposing here. Can you show an example?
Let us postulate an MoreBits instruction-modifier with a 16-bit immediate
field. Now each 16-bit instruction, that has access to only 8 registers,
strips off 2-bits/specifier, so now all its register specifiers are
5-bits.
The immediate supplies the bits and as bits are stripped off the Decoder
shifts the field down by the consumed bits. When the last bit has been
stripped off you would need another MB im to supply those bits. Since
only 16-bit instructions are "limited" one MB should last about a
basic block or extended basic block.
Note I don't care how the bits are apportioned, formatted, consumed, ...
Oh, so you have changed the meaning of the "immediate bit map" from specifying which of the following instructions it applies to (e.g.
CARRY) to the actual data. I like it!
If using 16 bit instructions, and if you only have one small register
field per instruction, I think it is better to make "MoreBits" a 16 bit instruction modifier itself, with say a five bit op code and an eleven
bit immediate, which supplies the extra bit for the next 11
instructions. More compact than a 32 bit instruction, and almost as
"far reaching". If you need more than 11 bits, even if you add a second
MB instruction modifier 11 instructions later, you are still no worse
off than an instruction modifier plus a 16 bit immediate.
Of course, if you need more than one extra bit per instruction, then
more "drastic" measures, such as your proposal, are needed.
On 11/20/2023 11:31 AM, Stephen Fuld wrote:
On 11/15/2023 1:10 PM, MitchAlsup wrote:
For some ops, the 3rd register (Ro) would instead operate as a 5-bit immediate/displacement field. Which was initially a similar idea, with
the 32-bit space mirroring the 16-bit space.
Where, say:has a register file and a stack.
Thread: Logical thread of execution within some existing process;
Process: Distinct collection of 1 or more threads within a sharedhas a memory map a heap and a vector of threads.
address space and shared process identity (may have its own address
space, though as-of-yet, TestKern uses a shared global address space);
Task: Supergroup that includes Threads, Processes, and other thread-like entities (such as call and method handlers), may be either thread-like
or process-like.
Where, say, the Syscall interrupt handler doesn't generally handle
syscalls itself (since the ISRs will only have access to
physically-mapped addresses), but effectively instead initiates a
context switch to the task that can handle the request (or, to context
switch back to the task that made the request, or to yield to another
task, ...).
Though, will need to probably add more special case handling such that
the Syscall task can not yield or try to itself make a syscall (the only valid exit point for this task being where it transfers control back to
the caller and awaits the next syscall to arrive; and it is not valid
for this task to try to syscall back into itself).
BGB wrote:
On 11/20/2023 11:31 AM, Stephen Fuld wrote:
On 11/15/2023 1:10 PM, MitchAlsup wrote:
For some ops, the 3rd register (Ro) would instead operate as a 5-bit
immediate/displacement field. Which was initially a similar idea, with
the 32-bit space mirroring the 16-bit space.
Almost all My 66000 {1,2,3}-operand instructions can convert a 5-bit
register
specifier into a 5-bit immediate of either positive or negative integer value. This makes::
1<<n
~0<<n
container.bitfield = 7;
single instructions.
Where, say:has a register file and a stack.
Thread: Logical thread of execution within some existing process;
Process: Distinct collection of 1 or more threads within a sharedhas a memory map a heap and a vector of threads.
address space and shared process identity (may have its own address
space, though as-of-yet, TestKern uses a shared global address space);
Task: Supergroup that includes Threads, Processes, and other
thread-like entities (such as call and method handlers), may be either
thread-like or process-like.
Where, say, the Syscall interrupt handler doesn't generally handle
syscalls itself (since the ISRs will only have access to
physically-mapped addresses), but effectively instead initiates a
context switch to the task that can handle the request (or, to context
switch back to the task that made the request, or to yield to another
task, ...).
We call these things:: dispatchers.
Though, will need to probably add more special case handling such that
the Syscall task can not yield or try to itself make a syscall (the
only valid exit point for this task being where it transfers control
back to the caller and awaits the next syscall to arrive; and it is
not valid for this task to try to syscall back into itself).
In My 66000, every <effective> SysCall goes deeper into the privilege hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV, Guest HV SysCalls real HV. No data structures need maintenance during
these transitions of the hierarchy.
BGB wrote:
On 11/20/2023 11:31 AM, Stephen Fuld wrote:
On 11/15/2023 1:10 PM, MitchAlsup wrote:
For some ops, the 3rd register (Ro) would instead operate as a 5-bit
immediate/displacement field. Which was initially a similar idea, with
the 32-bit space mirroring the 16-bit space.
Almost all My 66000 {1,2,3}-operand instructions can convert a 5-bit
register
specifier into a 5-bit immediate of either positive or negative integer value. This makes::
1<<n
~0<<n
container.bitfield = 7;
single instructions.
Where, say:has a register file and a stack.
Thread: Logical thread of execution within some existing process;
Process: Distinct collection of 1 or more threads within a sharedhas a memory map a heap and a vector of threads.
address space and shared process identity (may have its own address
space, though as-of-yet, TestKern uses a shared global address space);
Task: Supergroup that includes Threads, Processes, and other
thread-like entities (such as call and method handlers), may be either
thread-like or process-like.
Where, say, the Syscall interrupt handler doesn't generally handle
syscalls itself (since the ISRs will only have access to
physically-mapped addresses), but effectively instead initiates a
context switch to the task that can handle the request (or, to context
switch back to the task that made the request, or to yield to another
task, ...).
We call these things:: dispatchers.
Though, will need to probably add more special case handling such that
the Syscall task can not yield or try to itself make a syscall (the
only valid exit point for this task being where it transfers control
back to the caller and awaits the next syscall to arrive; and it is
not valid for this task to try to syscall back into itself).
In My 66000, every <effective> SysCall goes deeper into the privilege hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV, Guest HV SysCalls real HV. No data structures need maintenance during
these transitions of the hierarchy.
On 11/21/2023 4:12 PM, MitchAlsup wrote:
BGB wrote:
Where, say, the Syscall interrupt handler doesn't generally handle
syscalls itself (since the ISRs will only have access to
physically-mapped addresses), but effectively instead initiates a
context switch to the task that can handle the request (or, to context
switch back to the task that made the request, or to yield to another
task, ...).
We call these things:: dispatchers.
Yeah.
As-is, I have several major interrupt handlers:
Fault: Something has gone wrong, current handling is to stall the CPU
until reset (and/or terminate the emulator). Could in premise do other things.
IRQ: Deals with timer, may potentially be used for preemptive task
scheduling (code is in place, but this is not currently enabled). Does
not currently perform any other "complex" actions (and the "practical"
use of IRQ's remains limited in my case, due in large part to the
limitations of interrupt handling).
TLB Miss: Handles TLB miss and ACL Miss events, may initiate further
action if a "page fault" style event occurs (or something needs to be
paged in/paged out from the swapfile).
SYSCALL: Mostly initiates task switches and similar, and little else.
Unlike x86, the design of the interrupt mechanisms means it isn't
practical to hang the whole OS off of an interrupt handler. The closest option is mostly to use the interrupt handlers to trigger context
switches (which is, ironically, slightly less of an issue, as many of
the "hard" parts of a context switch are already performed for sake of dealing with the "rather minimalist" interrupt mechanism).
Basically, in this design, it isn't possible to enter a new interrupt
without first returning from the prior interrupt (at least not without
f*ing the CPU state). And, as-is, interrupts can only operate in
physically addressed mode.
They also need to manually save and restore all the registers, since
unlike either SuperH or RISC-V, BJX2 does not have any banked registers (apart from SP/SSP, which switch places when entering/leaving an ISR).
Unlike x86 (protected mode), it doesn't have TSS's either, and unlike
8086 real-mode, it doesn't implicitly push anything to the stack (nor
have an "interrupt vector table").
So, the interrupt handling is basically a computed branch; which was basically about the cheapest mechanism I could come up with at the time.
Did create a little bit of a puzzle initially as to how to get the CPU
state saved off and restored with no free registers. Though, there are a
few CR's which capture the CPU state at the time the ISR happens (these registers getting overwritten every time a new interrupt occurs).
So, say:
Interrupt entry:
Copy low bits of SR into high bits of EXSR;
Copy PC into SPC.
Copy fault address into TEA;
Swap SP and SSP (*1);
Set CPU flags to Supervisor+ISR mode;
CPU Mode bits now copied from high bits of VBR.
Computed branch relative to VBR.
Offset depends on interrupt category.
Interrupt return (RTE):
Copy EXSR bits back into SR;
Unswap SP/SSP (*1);
Branch to SPC.
*1: At the time, couldn't figure a good way to shave more logic off the mechanism. Though, now, the most obvious candidate now would be to
eliminate the implicit SP/SSP swapping (this part is currently handled
in the instruction decoder).
So, instead, the ISR entry point would do something like:
MOV SP, SSP
MOV 0xDE00, SP //Designated ISR stack SRAM
MOV.Q R0, (SP, 0)
NOV.Q R1, (SP, 8)
... Now save off everything else ...
But, didn't really think of it at the time.
There is already the trick of requiring VBR to be aligned (currently 64B
in practice; formally 256B), mostly so as to allow the "address
computation" to be done via bit-slicing.
Not sure if many CPUs have a cheaper mechanism here...
Note that in my case, generally the interrupt handlers are written in C,
with the compiler managing all the ISR prolog/epilog stuff (mostly saving/restoring pretty much the entire CPU state to the ISR stack).
Generally, the ISR's also need to deal with having a comparably small
stack (with 0.75K already used for the saved CPU state).
Where:
0000..7FFF: Boot ROM
8000..BFFF: (Optional) Extended Boot ROM
C000..DFFF: Boot/ISR SRAM
E000..FFFF: (Optional) Extended SRAM
Generally, much of the work of the context switch is pulled off using "memcpy" calls (with the compiler providing a special "__arch_regsave" variable giving the address of the location it has dumped the CPU
registers into; which in turn covers most of the core state that needs
to be saved/restored for a process context switch).
On 2023-11-21 5:12 p.m., MitchAlsup wrote:
In My 66000, every <effective> SysCall goes deeper into the privilege
hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV,
Guest HV SysCalls real HV. No data structures need maintenance during
these transitions of the hierarchy.
Does it follow the same way for hardware interrupts? I think RISCV goes
to the deepest level first, machine level, then redirects to lower
levels as needed. I was planning on Q+ operating the same way.
Why not just treat the RF as a cache with a known address in physical memory. In MY 66000 that is what I do and then just push and pull 4 cache lines at a
Why not just treat the RF as a cache with a known address in physical memory.
In MY 66000 that is what I do and then just push and pull 4 cache lines at a
Hmm... I thought the "66000" came from the CDC 6600 but now I wonder if
it's not also a pun on the TI 9900.
Stefan
BGB wrote:
On 11/21/2023 4:12 PM, MitchAlsup wrote:
BGB wrote:
Where, say, the Syscall interrupt handler doesn't generally handle
syscalls itself (since the ISRs will only have access to
physically-mapped addresses), but effectively instead initiates a
context switch to the task that can handle the request (or, to
context switch back to the task that made the request, or to yield
to another task, ...).
We call these things:: dispatchers.
Yeah.
As-is, I have several major interrupt handlers:
Fault: Something has gone wrong, current handling is to stall the CPU
until reset (and/or terminate the emulator). Could in premise do other
things.
I call these checks:: a page fault is an unanticipated SysCall to the
Guest OS page fault handler; whereas a check is something that should
never happen but did (ECC repair fail): These trap to Real HV.
IRQ: Deals with timer, may potentially be used for preemptive task
scheduling (code is in place, but this is not currently enabled). Does
not currently perform any other "complex" actions (and the "practical"
use of IRQ's remains limited in my case, due in large part to the
limitations of interrupt handling).
Every My 66000 process has its own event table which combines exceptions interrupts, SysCalls,... This means there is no table surgery when
switching
between Guest OS and Guest Hypervisor and Real Hypervisor.
TLB Miss: Handles TLB miss and ACL Miss events, may initiate further
action if a "page fault" style event occurs (or something needs to be
paged in/paged out from the swapfile).
HW table walking.
SYSCALL: Mostly initiates task switches and similar, and little else.
Part of Event table.
Unlike x86, the design of the interrupt mechanisms means it isn't
practical to hang the whole OS off of an interrupt handler. The
closest option is mostly to use the interrupt handlers to trigger
context switches (which is, ironically, slightly less of an issue, as
many of the "hard" parts of a context switch are already performed for
sake of dealing with the "rather minimalist" interrupt mechanism).
My 66000 can perform a context (user->user) in a single instruction.
Old state goes to memory, new state comes from memory; by the time
state has arrived, you are fetching instructions in the new context
under the new context MMU tables and privileges and priorities.
Basically, in this design, it isn't possible to enter a new interrupt
without first returning from the prior interrupt (at least not without
f*ing the CPU state). And, as-is, interrupts can only operate in
physically addressed mode.
They also need to manually save and restore all the registers, since
unlike either SuperH or RISC-V, BJX2 does not have any banked
registers (apart from SP/SSP, which switch places when
entering/leaving an ISR).
Unlike x86 (protected mode), it doesn't have TSS's either, and unlike
8086 real-mode, it doesn't implicitly push anything to the stack (nor
have an "interrupt vector table").
So, the interrupt handling is basically a computed branch; which was
basically about the cheapest mechanism I could come up with at the time.
Did create a little bit of a puzzle initially as to how to get the CPU
state saved off and restored with no free registers. Though, there are
a few CR's which capture the CPU state at the time the ISR happens
(these registers getting overwritten every time a new interrupt occurs).
Why not just treat the RF as a cache with a known address in physical
memory.
In MY 66000 that is what I do and then just push and pull 4 cache lines
at a
time.
So, say:
Interrupt entry:
Copy low bits of SR into high bits of EXSR;
Copy PC into SPC.
Copy fault address into TEA;
Swap SP and SSP (*1);
Set CPU flags to Supervisor+ISR mode;
CPU Mode bits now copied from high bits of VBR.
Computed branch relative to VBR.
Offset depends on interrupt category.
Interrupt return (RTE):
Copy EXSR bits back into SR;
Unswap SP/SSP (*1);
Branch to SPC.
Interrupt Entry Point::
// by this point all the old registers have been saved where they
// are supposed to go, and the interrupt dispatcher registers are
// already loader up and ready to go, and the CPU is running at
// whatever privilege level was specified.
HR R1<-WHY
LD IP,[IP,R1<<3,InterruptVectorTable] // Call through table
RTI
//
InterruptHandler0:
// do what is necessary
// note this can all be written in C
RET
InterruptHandler1::
*1: At the time, couldn't figure a good way to shave more logic off
the mechanism. Though, now, the most obvious candidate now would be to
eliminate the implicit SP/SSP swapping (this part is currently handled
in the instruction decoder).
So, instead, the ISR entry point would do something like:
MOV SP, SSP
MOV 0xDE00, SP //Designated ISR stack SRAM
MOV.Q R0, (SP, 0)
NOV.Q R1, (SP, 8)
... Now save off everything else ...
But, didn't really think of it at the time.
There is already the trick of requiring VBR to be aligned (currently
64B in practice; formally 256B), mostly so as to allow the "address
computation" to be done via bit-slicing.
Not sure if many CPUs have a cheaper mechanism here...
Treat the CPU state and the register state as cache lines and have
HW shuffle them in and out. You can even start the 5 cache line reads
before you start the CPU state writes; saving latency (which you cannot
using SW only methods).
Note that in my case, generally the interrupt handlers are written in
C, with the compiler managing all the ISR prolog/epilog stuff (mostly
saving/restoring pretty much the entire CPU state to the ISR stack).
My 66000 compiler remains blissfully ignorant of ISR prologue and
epilogue and it still works.
Generally, the ISR's also need to deal with having a comparably small
stack (with 0.75K already used for the saved CPU state).
Where:
0000..7FFF: Boot ROM
8000..BFFF: (Optional) Extended Boot ROM
C000..DFFF: Boot/ISR SRAM
E000..FFFF: (Optional) Extended SRAM
Generally, much of the work of the context switch is pulled off using
"memcpy" calls (with the compiler providing a special "__arch_regsave"
variable giving the address of the location it has dumped the CPU
registers into; which in turn covers most of the core state that needs
to be saved/restored for a process context switch).
Why not just make the HW push and pull cache lines.
On 11/22/2023 12:38 PM, MitchAlsup wrote:
BGB wrote:
Yeah, but that is not exactly minimalist in terms of the hardware.
Granted, burning around 1 kilocycle of overhead per syscall isn't ideal either...
Eg:
Save registers to ISR stack;
Copy registers to User context;
Copy handler-task registers to ISR stack;
Reload registers from ISR stack;
Handle the syscall;
Save registers to ISR stack;
Copy registers to Syscall context;
Copy User registers to ISR stack;
Reload registers from ISR stack.
Does mean that one needs to be economical with syscalls (say, doing
"printf" a whole line at a time, rather than individual characters, ...).
And, did create incentive to allow getting the microsecond-clock value
and hardware RNG values from CPUID rather than needing a syscall (say,
don't want to burn 20us to check the microsecond counter, ...).
If the "memcpy's" could be eliminated, this could roughly halve the cost
of doing a syscall.
One other option would be to do like RISC-V's privileged spec and have multiple copies of the register file (and likely instructions for
accessing these alternate register files).
Worth the cost? Dunno.
Not too much different to modern Windows, where slow syscalls are still fairly common (and despite the slowness of the mechanism, it seems like
BJX2 sycalls still manage to be around an order of magnitude faster than Windows syscalls in terms of clock-cycle cost...).
Why not just treat the RF as a cache with a known address in physical
memory.
In MY 66000 that is what I do and then just push and pull 4 cache lines
at a
time.
Possible, but poses its own share of problems...
Not sure how this could be implemented cost-effectively, or for that
matter, more cheaply than a RISC-V style mode-banked register-file.
Though, could make sense if one has a mechanism where a context switch
could have a mechanism to dump the whole register file to Block-RAM, and
some sort of mechanism to access this RAM via an MMIO interface.
Pros/cons, seems like each possibility would also come with drawbacks:
As-is: Slowness due to needing to save/reload everything;
RISC-V: Expensive regfile, only works for limited cases;
MMIO Backed + RV-like: Faster U<->S, but slower task switching.
RAM Backed: Cache coherence becomes a critical feature.
The RISC-V like approach makes sense if one assumes:
There is a user process;
There is a kernel running under it;
We want to call from the user process into the kernel.
Doesn't make so much sense, say, for:
User Process A calls a VTable entry which calls into User Process B;
Service A uses a VTable to call into the VFS;
...
Say, where one is making use of horizontal context switches for control
flow between logical tasks. Which would still remain fairly expensive
under a RISC-V like model.
One could have enough register banks for N logical tasks, but supporting
4 or 8 copies of the register file is going to cost more than 2 or 3.
Above, I was describing what the hardware was doing.
The software side is basically more like:
Branch from VBR-table to ISR entry point;
Get R0 and R1 saved onto the stack;
Get some of the CRs saved off (we need R0 and R1 free here);
Get the rest of the GPRs saved onto the stack;
Call into the main part of the ISR handler (using normal C ABI);
Restore most of the GPRs;
Restore most of the CRs;
Restore R0 and R1;
Do an RTE.
The software side is basically more like:
Branch from VBR-table to ISR entry point;
Call into the main part of the ISR handler (using normal C ABI);
Do an RTE.
Robert Finch wrote:
On 2023-11-21 5:12 p.m., MitchAlsup wrote:
In My 66000, every <effective> SysCall goes deeper into the privilege
hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV, >>> Guest HV SysCalls real HV. No data structures need maintenance during
these transitions of the hierarchy.
Does it follow the same way for hardware interrupts? I think RISCV goes
to the deepest level first, machine level, then redirects to lower
levels as needed. I was planning on Q+ operating the same way.
It depends, there is the school of thought that just deliver control to >someone who can always deal with it (Machine level in RISC-V) and there
is the other school of thought that some table should encode which level
of the system control is delivered to. The former allow SW to control
every step of the process, the later gets rid of all the SW checking
and simplifies the process of getting to and back from interrupt handlers >(and their associated soft IRQs.)
mitchalsup@aol.com (MitchAlsup) writes:
Stefan Monnier wrote:
I have a Linux friendly version where context switch is a single instruction.
The Burroughs B3500 had a single such instruction, called
Branch Reinstate (BRE).
Stefan Monnier wrote:
I have a Linux friendly version where context switch is a single instruction.
The 4 privilege levels, each, have a pointer to those 5 cache
lines. By writing the control register (HR instruction) one
can change the control point for each level (of course you
have to have appropriate permission-- but I decided that a
user should have the ability to context switch to another
user without needing OS intervention--thus pthreads do not
need an excursion through the Guest OS to switch threads
under the same memory map {but do when crossing processes}.
BGB wrote:
On 11/22/2023 12:38 PM, MitchAlsup wrote:
BGB wrote:
Yeah, but that is not exactly minimalist in terms of the hardware.
Granted, burning around 1 kilocycle of overhead per syscall isn't
ideal either...
Eg:
Save registers to ISR stack;
Copy registers to User context;
Copy handler-task registers to ISR stack;
Reload registers from ISR stack;
Handle the syscall;
Save registers to ISR stack;
Copy registers to Syscall context;
Copy User registers to ISR stack;
Reload registers from ISR stack.
Does mean that one needs to be economical with syscalls (say, doing
"printf" a whole line at a time, rather than individual characters, ...).
Not at all--I have reduced SysCalls to just a bit slower than actual CALL. say around 10-cycles. Use them as often as you like.
And, did create incentive to allow getting the microsecond-clock value
and hardware RNG values from CPUID rather than needing a syscall (say,
don't want to burn 20us to check the microsecond counter, ...).
If the "memcpy's" could be eliminated, this could roughly halve the
cost of doing a syscall.
I have MM (memory move) as a 3-operand instruction.
One other option would be to do like RISC-V's privileged spec and have
multiple copies of the register file (and likely instructions for
accessing these alternate register files).
There is one CPU register file, and every running thread has an address
where that file comes from and goes to--just like a block of 4 cache lines; There is a 5th cache line that contains all the other PSW stuff.
Worth the cost? Dunno.
In my opinion--Absolutely worth it.
Not too much different to modern Windows, where slow syscalls are
still fairly common (and despite the slowness of the mechanism, it
seems like BJX2 sycalls still manage to be around an order of
magnitude faster than Windows syscalls in terms of clock-cycle cost...).
Now, just get it down to a cache missing {L1, L2} instruction fetch.
Why not just treat the RF as a cache with a known address in physical
memory.
In MY 66000 that is what I do and then just push and pull 4 cache
lines at a
time.
Possible, but poses its own share of problems...
Not sure how this could be implemented cost-effectively, or for that
matter, more cheaply than a RISC-V style mode-banked register-file.
1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead
of having 4 cache lines of state and 1 doubleword of address, you need
16 cache lines of state.
Though, could make sense if one has a mechanism where a context switch
could have a mechanism to dump the whole register file to Block-RAM,
and some sort of mechanism to access this RAM via an MMIO interface.
Just put it in DRAM at SW controlled (via TLB) addresses.
Pros/cons, seems like each possibility would also come with drawbacks:
As-is: Slowness due to needing to save/reload everything;
RISC-V: Expensive regfile, only works for limited cases;
MMIO Backed + RV-like: Faster U<->S, but slower task switching.
RAM Backed: Cache coherence becomes a critical feature.
The RISC-V like approach makes sense if one assumes:
There is a user process;
There is a kernel running under it;
We want to call from the user process into the kernel.
So if you ae running under a Real OS you don't need 2 sets of RFs in my model.
Doesn't make so much sense, say, for:
User Process A calls a VTable entry which calls into User Process B;
Service A uses a VTable to call into the VFS;
...
Say, where one is making use of horizontal context switches for
control flow between logical tasks. Which would still remain fairly
expensive under a RISC-V like model.
Yes, but PTHREADing can be done without privilege and in a single instruction.
One could have enough register banks for N logical tasks, but
supporting 4 or 8 copies of the register file is going to cost more
than 2 or 3.
Above, I was describing what the hardware was doing.
The software side is basically more like:
Branch from VBR-table to ISR entry point;
Get R0 and R1 saved onto the stack;
Where did you get the address of this stack ??
Get some of the CRs saved off (we need R0 and R1 free here);
Get the rest of the GPRs saved onto the stack;
Call into the main part of the ISR handler (using normal C ABI);
Restore most of the GPRs;
Restore most of the CRs;
Restore R0 and R1;
Do an RTE.
If HW does register file save/restore the above looks like::
The software side is basically more like:
Branch from VBR-table to ISR entry point;
Call into the main part of the ISR handler (using normal C ABI);
Do an RTE.
See what it saves ??
On 11/23/2023 10:53 AM, MitchAlsup wrote:
BGB wrote:
If the "memcpy's" could be eliminated, this could roughly halve the
cost of doing a syscall.
I have MM (memory move) as a 3-operand instruction.
None in my case...
But, a memcpy loop can move ~ 280-300 MB/s in the L1 cache at 50MHz.
Still might be better to not do a memcpy in these cases.
Say, if the ISR handler could "merely" reassign the TBR register to
switch from one task to another to perform the context switch (still
ignoring all the loads/stores hidden in the prolog and epilog).
One other option would be to do like RISC-V's privileged spec and have
multiple copies of the register file (and likely instructions for
accessing these alternate register files).
There is one CPU register file, and every running thread has an address
where that file comes from and goes to--just like a block of 4 cache lines; >> There is a 5th cache line that contains all the other PSW stuff.
No direct equivalent.
I was thinking sort of like the RISC-V Privileged spec, there are User/Supervisor/Machine sets, with the mode effecting which of these is visible.
Obvious drawback in my case is that this would effectively increase the number of internal GPRs from 64 to 192 (and, at that point, may as well
go to 4 copies and have 256).
If this were handled in the decoder, this would mean roughly a 9-bit
register selector field (vs the current 7 bits).
The increase in the number of CRs could be less, since only a few of
them actually need duplication.
But, don't want to go this way, and it would only be a partial solution
that also does not map up well to my current implementation.
Not sure how an OS on SH-4 would have managed all this, but I suspect
their interrupt model would have had similar limitations to mine.
Major differences:
SH-4 banked out R0..R7 when entering an interrupt;
The VBR relative entry-point offsets were a bit, ad-hoc.
There were some fairly arbitrary displacements based on the type of interrupt. Almost like they designed their interrupt mechanism around a particular chunk of ASM code or something. In my case, I kept a similar
idea, but just used a fixed 8-byte spacing, with the idea of these spots branching to the actual entry point.
Though, one other difference is in my case I ended up adding a dedicated SYSCALL handler; on SH-4 they had used a TRAP instruction, which would
have gone to the FAULT handler instead.
It is in-theory possible to jump from Interrupt Mode to normal
Supervisor Mode without a full context switch,
but the specifics of
doing so would get a bit more hairy and arcane (which is sort of why I
just sorta ended up using a context switch).
Not sure what Linux on SH-4 had done, didn't really investigate this
part of the code all that much at the time.
In theory, the ISR handlers could be made to mimic the x86 TSS
mechanism, but this wouldn't gain much.
I think at one point, I had considered having tasks have both User and Supervisor state (with two stacks and two copies of all the registers),
but ended up not going this way (and instead giving the syscalls their designated own task context; which also saves on per-task memory overhead).
Worth the cost? Dunno.
In my opinion--Absolutely worth it.
Not too much different to modern Windows, where slow syscalls are
still fairly common (and despite the slowness of the mechanism, it
seems like BJX2 sycalls still manage to be around an order of
magnitude faster than Windows syscalls in terms of clock-cycle cost...).
Now, just get it down to a cache missing {L1, L2} instruction fetch.
Looked into it a little more, realized that "an order of magnitude" may
have actually been a little conservative; seems like Windows syscalls
may be more in the area of 50-100k cycles.
Why exactly? Dunno.
This is still ignoring some of the "slow cases" which may take millions
of clock cycles.
It also seems like fast-ish syscalls may be more of a Linux thing.
Why not just treat the RF as a cache with a known address in physical
memory.
In MY 66000 that is what I do and then just push and pull 4 cache
lines at a
time.
Possible, but poses its own share of problems...
Not sure how this could be implemented cost-effectively, or for that
matter, more cheaply than a RISC-V style mode-banked register-file.
1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead
of having 4 cache lines of state and 1 doubleword of address, you need
16 cache lines of state.
OK.
Having only 1 set of registers is good...
Issue is the mechanism for how to get all the contents in/out of the
register file, in a way that is both cost effective, and faster than
using a series of Load/Store instructions would have otherwise been.
Short of a pipeline redesign, it is unlikely to exceed a best case of
around 128 bits per clock cycle, with (in practice) there typically
being other penalties due to things like L1 misses and similar.
One bit of trickery would be, "what if" the Boot SRAM region were inside
the L1 cache rather than out on the ringbus?...
But, then one would have the cost of keeping 8K of SRAM close to the CPU
core that is mostly only ever used during interrupt handling (but,
probably still cheaper than making the register file 3x bigger, in any case...).
Though keeping it tied to a specific CPU core (and effectively processor local) would avoid the ugly "what if" scenario of two CPU cores trying
to service an interrupt at the same time and potentially stepping on
each others' stacks. The main tradeoff vs putting the stacks in DRAM is mostly that DRAM may have (comparably more expensive) L2 misses.
Would add a potential "wonk" factor though, if this SRAM region were
only visible for D$ access, but inaccessible from the I$. But, I guess
one can argue, there isn't really a valid reason to try to run code from
the ISR stack or similar.
Though, could make sense if one has a mechanism where a context switch
could have a mechanism to dump the whole register file to Block-RAM,
and some sort of mechanism to access this RAM via an MMIO interface.
Just put it in DRAM at SW controlled (via TLB) addresses.
Possibly.
It is also possible that some of the TBR / "struct TKPE_TaskInfo_s"
stuff could be baked into hardware... But, I don't want to go this route (baking parts of it into the C ABI is at least "slightly" less evil).
Also possible could be to add another CR for "Dump context registers
here", this adds the costs of another CR though.
I guess I can probably safely rule out MMIO under the basis that context switching via moving registers via MMIO would be slower than the current mechanism (of using a series of Load/Store instructions)..................
Yes, but PTHREADing can be done without privilege and in a single
instruction.
OK.
Luckily, a thread-switch only needs to go 1-way, reducing it to around
500 cycles as-is in my case.
Theoretical minimum would be around 150-200 cycles, with most of the
savings based on eliminating around 1.5kB worth of "memcpy()"...
This need not involve an ISA change, could in theory be done by making
the SYSCALL ISR mandate that TBR be valid (and the associated compiler changes, likely the main issue here).
Well, nevermind any cost of locating the next thread, but at the moment,
I am using a fairly simplistic round-robin scheduling strategy, so the scheduler mostly starts at a given PID, and looks for the next PID that
holds a valid/running task (wrapping back to PID 1 if it hits the end,
and stopping the search if it gets back to the original PID).
The high-level threading model wasn't based on pthreads in my case, but rather C11 threads (and had implemented a lot of the "threads.h" stuff).
One could potentially mimic pthreads on top of C11 threads though.
At the moment, I forgot why I decided to go with C11 threads over
pthreads, but IIRC I think I had felt at the time like C11 threads were
a better fit.
One could have enough register banks for N logical tasks, but
supporting 4 or 8 copies of the register file is going to cost more
than 2 or 3.
Above, I was describing what the hardware was doing.
The software side is basically more like:
Branch from VBR-table to ISR entry point;
Get R0 and R1 saved onto the stack;
Where did you get the address of this stack ??
SP and SSP swap places on interrupt entry (currently by renumbering the registers in the instruction decoder).
SSP is initialized early on to the SRAM stack, so when an interrupt
happens, the 'SP' register automatically becomes the SRAM stack.
Essentially, both SP and SSP are SPRs, but:
SP is mapped into R15 in the GPR space;
SSP is mapped into the CR space.
So, when executing an ISR, it is effectively using SSP as its SP.
If I were eliminate this implicit register-swap mechanism, then the ISR
entry would likely need to reload a constant address each time. Though,
this change would also break binary compatibility with my existing code.
But, in theory, eliminating the register swap could allow demoting SP to being a normal GPR.
Also, things like renumbering parts of the register space based on CPU
mode is expensive.
Though, some of my more recent design ideas would have gone over to an ordering slightly more like RISC-V, say:
R0: ZR or PC (ALU or MEM)
R1: LR or TBR (ALU or MEM)
R2: SP
R3: GP (GBR)
R4 -R15: Scratch
R16-R31: Callee Save
R32-R47: Scratch
R48-R63: Callee Save
Would likely not adopt RISC-V's C ABI though.
Though, if one assumes R4..R63 are GPRs, this would allow both this ISA
and RISC-V to still use the same register numbering.
This is already fairly close to the register numbering scheme used in
XG2RV, though the assumption was that XG2RV would have used RV's ABI,
but this was stalled out mostly due to compiler issues (getting BGBCC to
be able to follow RISC-V's C ABI rules would be a non-trivial level of effort; but is rendered moot if one still needs to use call thunking).
The interpretation for R0 and R1 would depend on how they are used:
ALU or similar: ZR and LR (Zero and Link Register)
Load/Store Base: PC and TBR.
Idea being that in userland, TBR effectively still exists as a Read-Only register (allowing userland to modify TBR would effectively also allow userland to wreck the OS).
Thing is mostly that needing to renumber registers in the decoder based
on CPU mode isn't entirely free in terms of LUT cost or timing latency
(even if it only applies to a subset of the register space).
Note that for RV decoding:
X0..X31 -> R0 ..R31 (more or less)
F0..F31 -> R32..R63
But, RV's FPU instructions don't match up exactly 1:1, and some cases
would have semantic differences.
Though, it seems like most RV code could likely tolerate some deviation
in some areas (will it care that the high 32 bits of a Binary32 register don't hold NaN? Will it care about the extra funkiness going on in LR? ...).
Get some of the CRs saved off (we need R0 and R1 free here);
Get the rest of the GPRs saved onto the stack;
Call into the main part of the ISR handler (using normal C ABI);
Restore most of the GPRs;
Restore most of the CRs;
Restore R0 and R1;
Do an RTE.
If HW does register file save/restore the above looks like::
The software side is basically more like:
Branch from VBR-table to ISR entry point;
Call into the main part of the ISR handler (using normal C ABI);
Do an RTE.
See what it saves ??
This is fewer instructions.
But, hardware cost,
As-is, I can't come up with much that is both:
Fairly cheap to implement in hardware;
Would saves a lot of clock-cycles over software-based options.
As noted, the former is also why I had thus far mostly rejected the
RISC-V strategy (*).
*: Ironically, despite RISC-V having fewer GPRs, to implement the
Privileged spec, RISC-V would still end up needing a somewhat bigger
register file... Nevermind what exactly is going on with CSRs...
BGB wrote:
On 11/23/2023 10:53 AM, MitchAlsup wrote:
BGB wrote:
If the "memcpy's" could be eliminated, this could roughly halve the
cost of doing a syscall.
I have MM (memory move) as a 3-operand instruction.
None in my case...
But, a memcpy loop can move ~ 280-300 MB/s in the L1 cache at 50MHz.
Still might be better to not do a memcpy in these cases.
Say, if the ISR handler could "merely" reassign the TBR register to
switch from one task to another to perform the context switch (still
ignoring all the loads/stores hidden in the prolog and epilog).
One other option would be to do like RISC-V's privileged spec and
have multiple copies of the register file (and likely instructions
for accessing these alternate register files).
There is one CPU register file, and every running thread has an address
where that file comes from and goes to--just like a block of 4 cache
lines;
There is a 5th cache line that contains all the other PSW stuff.
No direct equivalent.
I was thinking sort of like the RISC-V Privileged spec, there are
User/Supervisor/Machine sets, with the mode effecting which of these
is visible.
Obvious drawback in my case is that this would effectively increase
the number of internal GPRs from 64 to 192 (and, at that point, may as
well go to 4 copies and have 256).
If this were handled in the decoder, this would mean roughly a 9-bit
register selector field (vs the current 7 bits).
Decode is not the problem, sensing 1:256 is a big problem, in practice
even SRAMs only have 32-pairs of cells on a bit line using exotic timed
sense amps.
{{Decode is almost NEVER the logic delay problem:: ½ is situation recognition,
the other ½ is fan-out buffering--driving the lines into the decoder is
more
gates of delay than determining if a given select line should be
asserted.}}
The increase in the number of CRs could be less, since only a few of
them actually need duplication.
But, don't want to go this way, and it would only be a partial
solution that also does not map up well to my current implementation.
Not sure how an OS on SH-4 would have managed all this, but I suspect
their interrupt model would have had similar limitations to mine.
Major differences:
SH-4 banked out R0..R7 when entering an interrupt;
The VBR relative entry-point offsets were a bit, ad-hoc.
There were some fairly arbitrary displacements based on the type of
interrupt. Almost like they designed their interrupt mechanism around
a particular chunk of ASM code or something. In my case, I kept a
similar idea, but just used a fixed 8-byte spacing, with the idea of
these spots branching to the actual entry point.
Though, one other difference is in my case I ended up adding a
dedicated SYSCALL handler; on SH-4 they had used a TRAP instruction,
which would have gone to the FAULT handler instead.
It is in-theory possible to jump from Interrupt Mode to normal
Supervisor Mode without a full context switch,
but why ?? the probability that control returns from a given IST to its softIRQ is less than ½ in a loaded system.
but the specifics of
doing so would get a bit more hairy and arcane (which is sort of why I
just sorta ended up using a context switch).
Not sure what Linux on SH-4 had done, didn't really investigate this
part of the code all that much at the time.
In theory, the ISR handlers could be made to mimic the x86 TSS
mechanism, but this wouldn't gain much.
Stay away from anything you see in x86 except in using it a moniker
to avoid.
I think at one point, I had considered having tasks have both User and
Supervisor state (with two stacks and two copies of all the
registers), but ended up not going this way (and instead giving the
syscalls their designated own task context; which also saves on
per-task memory overhead).
Worth the cost? Dunno.
In my opinion--Absolutely worth it.
Not too much different to modern Windows, where slow syscalls are
still fairly common (and despite the slowness of the mechanism, it
seems like BJX2 sycalls still manage to be around an order of
magnitude faster than Windows syscalls in terms of clock-cycle
cost...).
Now, just get it down to a cache missing {L1, L2} instruction fetch.
Looked into it a little more, realized that "an order of magnitude"
may have actually been a little conservative; seems like Windows
syscalls may be more in the area of 50-100k cycles.
Why exactly? Dunno.
This is still ignoring some of the "slow cases" which may take
millions of clock cycles.
It also seems like fast-ish syscalls may be more of a Linux thing.
Why not just treat the RF as a cache with a known address in
physical memory.
In MY 66000 that is what I do and then just push and pull 4 cache
lines at a
time.
Possible, but poses its own share of problems...
Not sure how this could be implemented cost-effectively, or for that
matter, more cheaply than a RISC-V style mode-banked register-file.
1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead >>> of having 4 cache lines of state and 1 doubleword of address, you need
16 cache lines of state.
OK.
Having only 1 set of registers is good...
Issue is the mechanism for how to get all the contents in/out of the
register file, in a way that is both cost effective, and faster than
using a series of Load/Store instructions would have otherwise been.
6R6W RFs are as big as one can practically build. You can get as much
Read BW by duplication, but you only have "so much" Write BW (even when
you know each write is to a different register).
Short of a pipeline redesign, it is unlikely to exceed a best case of
around 128 bits per clock cycle, with (in practice) there typically
being other penalties due to things like L1 misses and similar.
6R ports are 6*64-bits = 384-bits out and 384-bits in per cycle.
One bit of trickery would be, "what if" the Boot SRAM region were
inside the L1 cache rather than out on the ringbus?...
2 things::
a) By giving threadstate an address you gain the ability to load the
initial RF image from ROM as the CPU comes out of reset--it comes out
with a complete RF, a complete thread.header, mapping tables, privilege
and priority.
b) Those ROM-based TLB entries map to the L1 and L2 caches in Allocate
state (no underlying DRAM address availible) so you have ~1MB to play
around
with until you find DRAM, configure, initialize, and put in fee-pool.)
So, here, you HAVE "enough" storage to program BOOT activities in a HLL
(of your choice).
But, then one would have the cost of keeping 8K of SRAM close to the
CPU core that is mostly only ever used during interrupt handling (but,
probably still cheaper than making the register file 3x bigger, in any
case...).
Is the Icache and Dcache not close enough ?? If not then add L2 !!
Though keeping it tied to a specific CPU core (and effectively
processor local) would avoid the ugly "what if" scenario of two CPU
cores trying to service an interrupt at the same time and potentially
stepping on each others' stacks. The main tradeoff vs putting the
stacks in DRAM is mostly that DRAM may have (comparably more
expensive) L2 misses.
The interrupt (re)mapping table takes care of this prior to the CPU being bothered. A {CPU or device} sends an interrupt to the Interrupt mapping
table associated with the "Originating" thread. (IO/-MMU). That interrupt
is logged into the table and if enabled its priority is used to determine which set of CPUs should be bothered, the affinity mask of the
"Originating"
thread is used to qualify which CPU from the priority set, and one of these is selected. The selected CPU is tapped on the shoulder, and sends a get- Interrupt request to the Interrupt table logic which sends back the
priority
and number of a pending interrupt. If the CPU is still at lower priority
than the returning interrupt, the CPU <at this point> stops running code
from the old thread and begins running code on the new thread.
{{During the sending of the interrupt to the CPU and the receipt of the claim-Interrupt message, that interrupt will not get handed to any other CPU}} So, the CPU continues to run instructions while the CPUs contend
for and claim unique interrupts. There are 512 unique interrupt at each of
64 priority levels, and each process can have its own Interrupt Table.
These tables need no maintenance except when interrupts are created and destroyed.}}
HV, Guest HV, Guest OS each have their own unique interrupt tables;
Although it could be arranged such that all could use the same table.
Would add a potential "wonk" factor though, if this SRAM region were
only visible for D$ access, but inaccessible from the I$. But, I guess
one can argue, there isn't really a valid reason to try to run code
from the ISR stack or similar.
Though, could make sense if one has a mechanism where a context
switch could have a mechanism to dump the whole register file to
Block-RAM, and some sort of mechanism to access this RAM via an MMIO
interface.
Just put it in DRAM at SW controlled (via TLB) addresses.
Possibly.
It is also possible that some of the TBR / "struct TKPE_TaskInfo_s"
stuff could be baked into hardware... But, I don't want to go this
route (baking parts of it into the C ABI is at least "slightly" less
evil).
My mechanism is taking that struct task.....s (at least the part HW
needs to understand) and associating each one into a table that points
at DRAM. Now, when you want this thread to run, you load up the pointer
set the e-bit (enabled) and write it into the current header at its
privilege level. Poof--all 5 cache lines of state from the currently
running thread goes back to where it permanent home in DRAM is, and
the new thread fetches 5 cache lines of state of the new thread.
a) you can start the reads before you start the writes
b) you can start the writes anytime you have outbound access to "the bus"
c) the writes can be no late than the ½ cycle before the reads get written. Which is a lot faster than you can do in SW with LDs and STs.
Also possible could be to add another CR for "Dump context registers
here", this adds the costs of another CR though.
I config-space mapped all my CRs, so you get an unlimited number of them.
I guess I can probably safely rule out MMIO under the basis that.................
context switching via moving registers via MMIO would be slower than
the current mechanism (of using a series of Load/Store instructions).
Yes, but PTHREADing can be done without privilege and in a single
instruction.
OK.
Luckily, a thread-switch only needs to go 1-way, reducing it to around
500 cycles as-is in my case.
In my case it is about MemoryLatency+5 cycles.
Yes, thread switch is a 1-way function--which is the reason you can
allow a user to preempt himself and allow a compatriot to run in his place.....
Theoretical minimum would be around 150-200 cycles, with most of the
savings based on eliminating around 1.5kB worth of "memcpy()"...
My Real Time version of MY 66000 does 10-ish cycle context switch
(as seen at the CPU) but here a hunk of HW has gathered up those 5 cache lines and sent them to the targeted CPU and all the CPU has to do is push
out the old state (5-cache liens) So the data was heading towards the
CPU before the CPU even knew it wanted that data !!
This need not involve an ISA change, could in theory be done by making
the SYSCALL ISR mandate that TBR be valid (and the associated compiler
changes, likely the main issue here).
Well, nevermind any cost of locating the next thread, but at the
moment, I am using a fairly simplistic round-robin scheduling
strategy, so the scheduler mostly starts at a given PID, and looks for
the next PID that holds a valid/running task (wrapping back to PID 1
if it hits the end, and stopping the search if it gets back to the
original PID).
The high-level threading model wasn't based on pthreads in my case,
but rather C11 threads (and had implemented a lot of the "threads.h"
stuff).
One could potentially mimic pthreads on top of C11 threads though.
At the moment, I forgot why I decided to go with C11 threads over
pthreads, but IIRC I think I had felt at the time like C11 threads
were a better fit.
One could have enough register banks for N logical tasks, but
supporting 4 or 8 copies of the register file is going to cost more
than 2 or 3.
Above, I was describing what the hardware was doing.
The software side is basically more like:
Branch from VBR-table to ISR entry point;
Get R0 and R1 saved onto the stack;
Where did you get the address of this stack ??
SP and SSP swap places on interrupt entry (currently by renumbering
the registers in the instruction decoder).
So, in effect, you actually have 33 registers with only 32 visible at
any instant. I am just so glad not to have gone down that rabbet hole
this time......
SSP is initialized early on to the SRAM stack, so when an interrupt
happens, the 'SP' register automatically becomes the SRAM stack.
Essentially, both SP and SSP are SPRs, but:
SP is mapped into R15 in the GPR space;
SSP is mapped into the CR space.
So, when executing an ISR, it is effectively using SSP as its SP.
If I were eliminate this implicit register-swap mechanism, then the
ISR entry would likely need to reload a constant address each time.
Though, this change would also break binary compatibility with my
existing code.
But, in theory, eliminating the register swap could allow demoting SP
to being a normal GPR.
Also, things like renumbering parts of the register space based on CPU
mode is expensive.
Though, some of my more recent design ideas would have gone over to an
ordering slightly more like RISC-V, say:
R0: ZR or PC (ALU or MEM)
R1: LR or TBR (ALU or MEM)
R2: SP
R3: GP (GBR)
R4 -R15: Scratch
R16-R31: Callee Save
R32-R47: Scratch
R48-R63: Callee Save
Would likely not adopt RISC-V's C ABI though.
R0:: GPR, Return Address, proxy for IP, proxy for 0
R1..R9 Arguments and results passed in registers
R10..R15 Temporary Registers (scratch)
R16..R29 Callee Save
R30 FP when in use, Callee Save
R31 SP
Though, if one assumes R4..R63 are GPRs, this would allow both this
ISA and RISC-V to still use the same register numbering.
This is already fairly close to the register numbering scheme used in
XG2RV, though the assumption was that XG2RV would have used RV's ABI,
but this was stalled out mostly due to compiler issues (getting BGBCC
to be able to follow RISC-V's C ABI rules would be a non-trivial level
of effort; but is rendered moot if one still needs to use call thunking).
The interpretation for R0 and R1 would depend on how they are used:
ALU or similar: ZR and LR (Zero and Link Register)
Load/Store Base: PC and TBR.
Idea being that in userland, TBR effectively still exists as a
Read-Only register (allowing userland to modify TBR would effectively
also allow userland to wreck the OS).
Thing is mostly that needing to renumber registers in the decoder
based on CPU mode isn't entirely free in terms of LUT cost or timing
latency (even if it only applies to a subset of the register space).
Note that for RV decoding:
X0..X31 -> R0 ..R31 (more or less)
F0..F31 -> R32..R63
But, RV's FPU instructions don't match up exactly 1:1, and some cases
would have semantic differences.
Though, it seems like most RV code could likely tolerate some
deviation in some areas (will it care that the high 32 bits of a
Binary32 register don't hold NaN? Will it care about the extra
funkiness going on in LR? ...).
Get some of the CRs saved off (we need R0 and R1 free here);
Get the rest of the GPRs saved onto the stack;
Call into the main part of the ISR handler (using normal C ABI);
Restore most of the GPRs;
Restore most of the CRs;
Restore R0 and R1;
Do an RTE.
If HW does register file save/restore the above looks like::
The software side is basically more like:
Branch from VBR-table to ISR entry point;
Call into the main part of the ISR handler (using normal C ABI);
Do an RTE.
See what it saves ??
This is fewer instructions.
But, hardware cost,
the HW cost has already been purchased by the state machine that writes
out 5-cache lines and waits for 5-cache lines to arrive.
and clock-cycle savings?...
The reads can arrive before you start the writes, you can go so far as
to organize your pipeline so the read data being written pushes
out the write data that needs to return to memory-making the timing
brain dead easy to achieve.
As-is, I can't come up with much that is both:
Fairly cheap to implement in hardware;
Would saves a lot of clock-cycles over software-based options.
As noted, the former is also why I had thus far mostly rejected the
RISC-V strategy (*).
Yet, you seem to be buying insurance as if you might need to head in that direction.
*: Ironically, despite RISC-V having fewer GPRs, to implement the
Privileged spec, RISC-V would still end up needing a somewhat bigger
register file... Nevermind what exactly is going on with CSRs...
Whereas that special State is only a dozen register <with state>
in My 66000--the rest being either memory resident or memory mapped.
On 2023-11-23 6:30 p.m., MitchAlsup wrote:
BGB wrote:
Whereas that special State is only a dozen register <with state>
in My 66000--the rest being either memory resident or memory mapped.
My 68000 CPU core had a couple of task switching instructions added to
it. I made a dedicated task switch RAM wide enough to load or store all
the 68k registers in a single clock. Total task switch time was about
four clocks IIRC. The interrupt vector table was setup to be able to automatically task switch on interrupt. The RAM had storage for up to
512 tasks, but it was dedicated inside the CPU core rather than storing
task information in the memory system.
Q+ has a 64 register file, so it would take eight or nine cache lines to store the context. Q+ register file is 4w18r ATM. Getting from the
register file to or from a cache line is a challenge. To access groups
of eight registers at once would mean adding or using eight register
file ports. The register file has only four write ports so only ½ of a
cache line could be written to the file in a clock cycle. It is
appealing to handle multiple registers per clock. Read/write ports are dedicated to specific function units, so making use of them for task switching may involve additional logic. I called the CSR to store the
task state address the TS CSR.
As I understand it normally RISCV does not use multiple register files,
it has only a single file. There may be implementations out there that
do make use of multiple files, but I think the standard is setup to get
by with a single file.
Robert Finch wrote:
On 2023-11-23 6:30 p.m., MitchAlsup wrote:
BGB wrote:
Whereas that special State is only a dozen register <with state>
in My 66000--the rest being either memory resident or memory mapped.
My 68000 CPU core had a couple of task switching instructions added to
it. I made a dedicated task switch RAM wide enough to load or store
all the 68k registers in a single clock. Total task switch time was
about four clocks IIRC. The interrupt vector table was setup to be
able to automatically task switch on interrupt. The RAM had storage
for up to 512 tasks, but it was dedicated inside the CPU core rather
than storing task information in the memory system.
This is headed in the right direction. Make context switching something
easy to pull off.
Q+ has a 64 register file, so it would take eight or nine cache lines
to store the context. Q+ register file is 4w18r ATM. Getting from the
register file to or from a cache line is a challenge. To access groups
of eight registers at once would mean adding or using eight register
file ports. The register file has only four write ports so only ½ of a
cache line could be written to the file in a clock cycle. It is
appealing to handle multiple registers per clock. Read/write ports are
dedicated to specific function units, so making use of them for task
switching may involve additional logic. I called the CSR to store the
task state address the TS CSR.
4W generally ends up with 4R and replications lead to 8R 12R 16R and 20R.
Yet you chose 18. Why ?
This is above and beyond the "typical" operand consumption of a RISC ISA. Your typical 4-wide RISC ISA would have 8R (6-wide is better balanced at
12R allowing 1 FU to consume 3-registers and 1 FU having only 1-operand
(or forwarding). What are you using the other 5-operands for ??
As I understand it normally RISCV does not use multiple register files,
RISC-V has a 32 entry GPR and a 32 entry FPR.
it has only a single file. There may be implementations out there that
do make use of multiple files, but I think the standard is setup to
get by with a single file.
BGB wrote:
On 11/23/2023 10:53 AM, MitchAlsup wrote:
BGB wrote:
If the "memcpy's" could be eliminated, this could roughly halve the
cost of doing a syscall.
I have MM (memory move) as a 3-operand instruction.
None in my case...
But, a memcpy loop can move ~ 280-300 MB/s in the L1 cache at 50MHz.
Still might be better to not do a memcpy in these cases.
Say, if the ISR handler could "merely" reassign the TBR register to
switch from one task to another to perform the context switch (still
ignoring all the loads/stores hidden in the prolog and epilog).
One other option would be to do like RISC-V's privileged spec and
have multiple copies of the register file (and likely instructions
for accessing these alternate register files).
There is one CPU register file, and every running thread has an address
where that file comes from and goes to--just like a block of 4 cache
lines;
There is a 5th cache line that contains all the other PSW stuff.
No direct equivalent.
I was thinking sort of like the RISC-V Privileged spec, there are
User/Supervisor/Machine sets, with the mode effecting which of these
is visible.
Obvious drawback in my case is that this would effectively increase
the number of internal GPRs from 64 to 192 (and, at that point, may as
well go to 4 copies and have 256).
If this were handled in the decoder, this would mean roughly a 9-bit
register selector field (vs the current 7 bits).
Decode is not the problem, sensing 1:256 is a big problem, in practice
even SRAMs only have 32-pairs of cells on a bit line using exotic timed
sense amps.
{{Decode is almost NEVER the logic delay problem:: ½ is situation recognition,
the other ½ is fan-out buffering--driving the lines into the decoder is
more
gates of delay than determining if a given select line should be
asserted.}}
The increase in the number of CRs could be less, since only a few of
them actually need duplication.
But, don't want to go this way, and it would only be a partial
solution that also does not map up well to my current implementation.
Not sure how an OS on SH-4 would have managed all this, but I suspect
their interrupt model would have had similar limitations to mine.
Major differences:
SH-4 banked out R0..R7 when entering an interrupt;
The VBR relative entry-point offsets were a bit, ad-hoc.
There were some fairly arbitrary displacements based on the type of
interrupt. Almost like they designed their interrupt mechanism around
a particular chunk of ASM code or something. In my case, I kept a
similar idea, but just used a fixed 8-byte spacing, with the idea of
these spots branching to the actual entry point.
Though, one other difference is in my case I ended up adding a
dedicated SYSCALL handler; on SH-4 they had used a TRAP instruction,
which would have gone to the FAULT handler instead.
It is in-theory possible to jump from Interrupt Mode to normal
Supervisor Mode without a full context switch,
but why ?? the probability that control returns from a given IST to its softIRQ is less than ½ in a loaded system.
but the specifics of
doing so would get a bit more hairy and arcane (which is sort of why I
just sorta ended up using a context switch).
Not sure what Linux on SH-4 had done, didn't really investigate this
part of the code all that much at the time.
In theory, the ISR handlers could be made to mimic the x86 TSS
mechanism, but this wouldn't gain much.
Stay away from anything you see in x86 except in using it a moniker
to avoid.
I think at one point, I had considered having tasks have both User and
Supervisor state (with two stacks and two copies of all the
registers), but ended up not going this way (and instead giving the
syscalls their designated own task context; which also saves on
per-task memory overhead).
Worth the cost? Dunno.
In my opinion--Absolutely worth it.
Not too much different to modern Windows, where slow syscalls are
still fairly common (and despite the slowness of the mechanism, it
seems like BJX2 sycalls still manage to be around an order of
magnitude faster than Windows syscalls in terms of clock-cycle
cost...).
Now, just get it down to a cache missing {L1, L2} instruction fetch.
Looked into it a little more, realized that "an order of magnitude"
may have actually been a little conservative; seems like Windows
syscalls may be more in the area of 50-100k cycles.
Why exactly? Dunno.
This is still ignoring some of the "slow cases" which may take
millions of clock cycles.
It also seems like fast-ish syscalls may be more of a Linux thing.
Why not just treat the RF as a cache with a known address in
physical memory.
In MY 66000 that is what I do and then just push and pull 4 cache
lines at a
time.
Possible, but poses its own share of problems...
Not sure how this could be implemented cost-effectively, or for that
matter, more cheaply than a RISC-V style mode-banked register-file.
1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead >>> of having 4 cache lines of state and 1 doubleword of address, you need
16 cache lines of state.
OK.
Having only 1 set of registers is good...
Issue is the mechanism for how to get all the contents in/out of the
register file, in a way that is both cost effective, and faster than
using a series of Load/Store instructions would have otherwise been.
6R6W RFs are as big as one can practically build. You can get as much
Read BW by duplication, but you only have "so much" Write BW (even when
you know each write is to a different register).
Short of a pipeline redesign, it is unlikely to exceed a best case of
around 128 bits per clock cycle, with (in practice) there typically
being other penalties due to things like L1 misses and similar.
6R ports are 6*64-bits = 384-bits out and 384-bits in per cycle.
One bit of trickery would be, "what if" the Boot SRAM region were
inside the L1 cache rather than out on the ringbus?...
2 things::
a) By giving threadstate an address you gain the ability to load the
initial RF image from ROM as the CPU comes out of reset--it comes out
with a complete RF, a complete thread.header, mapping tables, privilege
and priority.
b) Those ROM-based TLB entries map to the L1 and L2 caches in Allocate
state (no underlying DRAM address availible) so you have ~1MB to play
around
with until you find DRAM, configure, initialize, and put in fee-pool.)
So, here, you HAVE "enough" storage to program BOOT activities in a HLL
(of your choice).
But, then one would have the cost of keeping 8K of SRAM close to the
CPU core that is mostly only ever used during interrupt handling (but,
probably still cheaper than making the register file 3x bigger, in any
case...).
Is the Icache and Dcache not close enough ?? If not then add L2 !!
Though keeping it tied to a specific CPU core (and effectively
processor local) would avoid the ugly "what if" scenario of two CPU
cores trying to service an interrupt at the same time and potentially
stepping on each others' stacks. The main tradeoff vs putting the
stacks in DRAM is mostly that DRAM may have (comparably more
expensive) L2 misses.
The interrupt (re)mapping table takes care of this prior to the CPU being bothered. A {CPU or device} sends an interrupt to the Interrupt mapping
table associated with the "Originating" thread. (IO/-MMU). That interrupt
is logged into the table and if enabled its priority is used to determine which set of CPUs should be bothered, the affinity mask of the
"Originating"
thread is used to qualify which CPU from the priority set, and one of these is selected. The selected CPU is tapped on the shoulder, and sends a get- Interrupt request to the Interrupt table logic which sends back the
priority
and number of a pending interrupt. If the CPU is still at lower priority
than the returning interrupt, the CPU <at this point> stops running code
from the old thread and begins running code on the new thread.
{{During the sending of the interrupt to the CPU and the receipt of the claim-Interrupt message, that interrupt will not get handed to any other CPU}} So, the CPU continues to run instructions while the CPUs contend
for and claim unique interrupts. There are 512 unique interrupt at each of
64 priority levels, and each process can have its own Interrupt Table.
These tables need no maintenance except when interrupts are created and destroyed.}}
HV, Guest HV, Guest OS each have their own unique interrupt tables;
Although it could be arranged such that all could use the same table.
Would add a potential "wonk" factor though, if this SRAM region were
only visible for D$ access, but inaccessible from the I$. But, I guess
one can argue, there isn't really a valid reason to try to run code
from the ISR stack or similar.
Though, could make sense if one has a mechanism where a context
switch could have a mechanism to dump the whole register file to
Block-RAM, and some sort of mechanism to access this RAM via an MMIO
interface.
Just put it in DRAM at SW controlled (via TLB) addresses.
Possibly.
It is also possible that some of the TBR / "struct TKPE_TaskInfo_s"
stuff could be baked into hardware... But, I don't want to go this
route (baking parts of it into the C ABI is at least "slightly" less
evil).
My mechanism is taking that struct task.....s (at least the part HW
needs to understand) and associating each one into a table that points
at DRAM. Now, when you want this thread to run, you load up the pointer
set the e-bit (enabled) and write it into the current header at its
privilege level. Poof--all 5 cache lines of state from the currently
running thread goes back to where it permanent home in DRAM is, and
the new thread fetches 5 cache lines of state of the new thread.
a) you can start the reads before you start the writes
b) you can start the writes anytime you have outbound access to "the bus"
c) the writes can be no late than the ½ cycle before the reads get written. Which is a lot faster than you can do in SW with LDs and STs.
Also possible could be to add another CR for "Dump context registers
here", this adds the costs of another CR though.
I config-space mapped all my CRs, so you get an unlimited number of them.
I guess I can probably safely rule out MMIO under the basis that.................
context switching via moving registers via MMIO would be slower than
the current mechanism (of using a series of Load/Store instructions).
Yes, but PTHREADing can be done without privilege and in a single
instruction.
OK.
Luckily, a thread-switch only needs to go 1-way, reducing it to around
500 cycles as-is in my case.
In my case it is about MemoryLatency+5 cycles.
Yes, thread switch is a 1-way function--which is the reason you can
allow a user to preempt himself and allow a compatriot to run in his place.....
Theoretical minimum would be around 150-200 cycles, with most of the
savings based on eliminating around 1.5kB worth of "memcpy()"...
My Real Time version of MY 66000 does 10-ish cycle context switch
(as seen at the CPU) but here a hunk of HW has gathered up those 5 cache lines and sent them to the targeted CPU and all the CPU has to do is push
out the old state (5-cache liens) So the data was heading towards the
CPU before the CPU even knew it wanted that data !!
This need not involve an ISA change, could in theory be done by making
the SYSCALL ISR mandate that TBR be valid (and the associated compiler
changes, likely the main issue here).
Well, nevermind any cost of locating the next thread, but at the
moment, I am using a fairly simplistic round-robin scheduling
strategy, so the scheduler mostly starts at a given PID, and looks for
the next PID that holds a valid/running task (wrapping back to PID 1
if it hits the end, and stopping the search if it gets back to the
original PID).
The high-level threading model wasn't based on pthreads in my case,
but rather C11 threads (and had implemented a lot of the "threads.h"
stuff).
One could potentially mimic pthreads on top of C11 threads though.
At the moment, I forgot why I decided to go with C11 threads over
pthreads, but IIRC I think I had felt at the time like C11 threads
were a better fit.
One could have enough register banks for N logical tasks, but
supporting 4 or 8 copies of the register file is going to cost more
than 2 or 3.
Above, I was describing what the hardware was doing.
The software side is basically more like:
Branch from VBR-table to ISR entry point;
Get R0 and R1 saved onto the stack;
Where did you get the address of this stack ??
SP and SSP swap places on interrupt entry (currently by renumbering
the registers in the instruction decoder).
So, in effect, you actually have 33 registers with only 32 visible at
any instant. I am just so glad not to have gone down that rabbet hole
this time......
SSP is initialized early on to the SRAM stack, so when an interrupt
happens, the 'SP' register automatically becomes the SRAM stack.
Essentially, both SP and SSP are SPRs, but:
SP is mapped into R15 in the GPR space;
SSP is mapped into the CR space.
So, when executing an ISR, it is effectively using SSP as its SP.
If I were eliminate this implicit register-swap mechanism, then the
ISR entry would likely need to reload a constant address each time.
Though, this change would also break binary compatibility with my
existing code.
But, in theory, eliminating the register swap could allow demoting SP
to being a normal GPR.
Also, things like renumbering parts of the register space based on CPU
mode is expensive.
Though, some of my more recent design ideas would have gone over to an
ordering slightly more like RISC-V, say:
R0: ZR or PC (ALU or MEM)
R1: LR or TBR (ALU or MEM)
R2: SP
R3: GP (GBR)
R4 -R15: Scratch
R16-R31: Callee Save
R32-R47: Scratch
R48-R63: Callee Save
Would likely not adopt RISC-V's C ABI though.
R0:: GPR, Return Address, proxy for IP, proxy for 0
R1..R9 Arguments and results passed in registers
R10..R15 Temporary Registers (scratch)
R16..R29 Callee Save
R30 FP when in use, Callee Save
R31 SP
Though, if one assumes R4..R63 are GPRs, this would allow both this
ISA and RISC-V to still use the same register numbering.
This is already fairly close to the register numbering scheme used in
XG2RV, though the assumption was that XG2RV would have used RV's ABI,
but this was stalled out mostly due to compiler issues (getting BGBCC
to be able to follow RISC-V's C ABI rules would be a non-trivial level
of effort; but is rendered moot if one still needs to use call thunking).
The interpretation for R0 and R1 would depend on how they are used:
ALU or similar: ZR and LR (Zero and Link Register)
Load/Store Base: PC and TBR.
Idea being that in userland, TBR effectively still exists as a
Read-Only register (allowing userland to modify TBR would effectively
also allow userland to wreck the OS).
Thing is mostly that needing to renumber registers in the decoder
based on CPU mode isn't entirely free in terms of LUT cost or timing
latency (even if it only applies to a subset of the register space).
Note that for RV decoding:
X0..X31 -> R0 ..R31 (more or less)
F0..F31 -> R32..R63
But, RV's FPU instructions don't match up exactly 1:1, and some cases
would have semantic differences.
Though, it seems like most RV code could likely tolerate some
deviation in some areas (will it care that the high 32 bits of a
Binary32 register don't hold NaN? Will it care about the extra
funkiness going on in LR? ...).
Get some of the CRs saved off (we need R0 and R1 free here);
Get the rest of the GPRs saved onto the stack;
Call into the main part of the ISR handler (using normal C ABI);
Restore most of the GPRs;
Restore most of the CRs;
Restore R0 and R1;
Do an RTE.
If HW does register file save/restore the above looks like::
The software side is basically more like:
Branch from VBR-table to ISR entry point;
Call into the main part of the ISR handler (using normal C ABI);
Do an RTE.
See what it saves ??
This is fewer instructions.
But, hardware cost,
the HW cost has already been purchased by the state machine that writes
out 5-cache lines and waits for 5-cache lines to arrive.
and clock-cycle savings?...
The reads can arrive before you start the writes, you can go so far as
to organize your pipeline so the read data being written pushes
out the write data that needs to return to memory-making the timing
brain dead easy to achieve.
As-is, I can't come up with much that is both:
Fairly cheap to implement in hardware;
Would saves a lot of clock-cycles over software-based options.
As noted, the former is also why I had thus far mostly rejected the
RISC-V strategy (*).
Yet, you seem to be buying insurance as if you might need to head in that direction.
*: Ironically, despite RISC-V having fewer GPRs, to implement the
Privileged spec, RISC-V would still end up needing a somewhat bigger
register file... Nevermind what exactly is going on with CSRs...
Whereas that special State is only a dozen register <with state>
in My 66000--the rest being either memory resident or memory mapped.
Stay away from anything you see in x86 except in using it a moniker
to avoid.
On 11/23/23 6:30 PM, MitchAlsup wrote:
[snip]
Stay away from anything you see in x86 except in using it a moniker
to avoid.
Even a stopped (12-hour) clock is right twice a day.
I hope you are not going to remove from My 66000 variable length
instruction encoding, hardware handling of (some for x86, XSAVE/
XRSTR) context saving and restoring, or even Memory Move.
One could go even further and claim avoiding anything seen in x86
means not having registers (a storage region with simple, compact
addressing that an implementation will optimize as the common case
for operands — the Mill's Belt counts as "registers" in this sense
and even something like a transport-trigger architecture would
likely have storage for values with temporal locality coarser than
immediate use but frequent enough to justify simpler and more
compact addressing).
Yes, x86 messes up even these aspects. VLE does not have to be
byte granular or use multiple prefixes in variable order. Hardware
context save/restore does not have to be limited to extended
state. A memory move instruction does not *need* to have a variant
for each possible/likely chunk size or be implemented as
substantially less performant than a software implementation, even
with compile-time known size and alignment. Registers do not have
to be limited to 8 or be accessed in sub-units.
(Sub-unit access has some attraction to me for more efficiently
using a limited storage space while still trying to keep access
simple by limiting variability of shifting and complexity of
partial write ordering, but less efficient storage use can easily
be better than complexity of accessing the fastest and most
commonly accessed storage. More recent ISAs have implemented
partial register accesses. IBM ZArch, S/360 descendant, spit GPRs
into high and low halves to increase the number of values
available in the nominally 16 GPRs. AArch64 has 32-bit computer
operations motivated, I think, for power saving, which do not
increase the number of values and so avoids the shift and partial-
write problems.)
I suspect you could write a multi-volume treatise on x86 about hardware-software interface design and management (including the
social and economic considerations of project/product management).
Ignoring human factors, including those outside the organization
owning the interface, seems attractive to a certain engineering
mindset but human factors are significant design considerations.
[Yet once more stating what is obvious, especially to one skilled
in the art.]
My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
On 11/23/23 6:30 PM, MitchAlsup wrote:
[snip]
Stay away from anything you see in x86 except in using it a moniker
to avoid.
Even a stopped (12-hour) clock is right twice a day.
I hope you are not going to remove from My 66000 variable length
instruction encoding, hardware handling of (some for x86, XSAVE/
XRSTR) context saving and restoring, or even Memory Move.
One could go even further and claim avoiding anything seen in x86
means not having registers (a storage region with simple, compact
addressing that an implementation will optimize as the common case
for operands — the Mill's Belt counts as "registers" in this sense
and even something like a transport-trigger architecture would
likely have storage for values with temporal locality coarser than
immediate use but frequent enough to justify simpler and more
compact addressing).
Yes, x86 messes up even these aspects.VLE does not need prefixes of any kind.
VLE does not have to be byte granular or use multiple prefixes in variable order.
Hardware context save/restore does not have to be limited to extended state.HW S/R is most useful when it deals with ALL the state.
A memory move instruction does not *need* to have a variantOne can synthesize SIMD and Vector saving 90% of the OpCode space
for each possible/likely chunk size or be implemented as
substantially less performant than a software implementation,
to be limited to 8 or be accessed in sub-units.
(Sub-unit access has some attraction to me for more efficiently
using a limited storage space while still trying to keep access
simple by limiting variability of shifting and complexity of
partial write ordering, but less efficient storage use can easily
be better than complexity of accessing the fastest and most
commonly accessed storage. More recent ISAs have implemented
partial register accesses. IBM ZArch, S/360 descendant, spit GPRs
into high and low halves to increase the number of values
available in the nominally 16 GPRs. AArch64 has 32-bit computer
operations motivated, I think, for power saving, which do not
increase the number of values and so avoids the shift and partial-
write problems.)
I suspect you could write a multi-volume treatise on x86 about hardware-software interface design and management (including the
social and economic considerations of project/product management).
Ignoring human factors, including those outside the organization
owning the interface, seems attractive to a certain engineering
mindset but human factors are significant design considerations.
[Yet once more stating what is obvious, especially to one skilled
in the art.]
My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?
My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?
Stefan Monnier <monnier@iro.umontreal.ca> writes:
My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV} >>Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?
Would require priority decoders to differeniate rather
than simple gates, probably.
Although I wonder at the missing firmware privilege level, a la SMM or EL3.
ARM added support for nested hypervisors without adding a
new exception level. Although interesting, there isn't much
evidence of it being used in production. Yet anyway.
Paul A. Clayton wrote:
On 11/23/23 6:30 PM, MitchAlsup wrote:
[snip]
Stay away from anything you see in x86 except in using it a moniker
to avoid.
Even a stopped (12-hour) clock is right twice a day.
I hope you are not going to remove from My 66000 variable length
instruction encoding, hardware handling of (some for x86, XSAVE/
XRSTR) context saving and restoring, or even Memory Move.
It is these things which allows my architecture to only need 70%
of the instructions RISC-V needs.
One could go even further and claim avoiding anything seen in x86
means not having registers (a storage region with simple, compact
addressing that an implementation will optimize as the common case
for operands — the Mill's Belt counts as "registers" in this sense
and even something like a transport-trigger architecture would
likely have storage for values with temporal locality coarser than
immediate use but frequent enough to justify simpler and more
compact addressing).
Having 1 set of flat (any register can do any result or operand) is
a My 66000 requirement, The only things I took from x86-64 is
the [base+index<<scale+displacement] memory addressing model, and
the 2-level MMU, even here I used the I/O MMU version rather than the processor version.
Yes, x86 messes up even these aspects. VLE does not have to be byteVLE does not need prefixes of any kind.
granular or use multiple prefixes in variable order.
Hardware context save/restore does not have to be limited to extendedHW S/R is most useful when it deals with ALL the state.
state.
A memory move instruction does not *need* to have a variantOne can synthesize SIMD and Vector saving 90% of the OpCode space
for each possible/likely chunk size or be implemented as
substantially less performant than a software implementation,
< even with compile-time known size and alignment. Registers do not have
to be limited to 8 or be accessed in sub-units.
(Sub-unit access has some attraction to me for more efficiently
using a limited storage space while still trying to keep access
simple by limiting variability of shifting and complexity of
partial write ordering, but less efficient storage use can easily
be better than complexity of accessing the fastest and most
commonly accessed storage. More recent ISAs have implemented
partial register accesses. IBM ZArch, S/360 descendant, spit GPRs
into high and low halves to increase the number of values
available in the nominally 16 GPRs. AArch64 has 32-bit computer
operations motivated, I think, for power saving, which do not
increase the number of values and so avoids the shift and partial-
write problems.)
I suspect this came out of already having to implement HW for IC (insert Character) instruction from System 360 time.
I suspect you could write a multi-volume treatise on x86 about
hardware-software interface design and management (including the
social and economic considerations of project/product management).
Ignoring human factors, including those outside the organization
owning the interface, seems attractive to a certain engineering
mindset but human factors are significant design considerations.
It would be more beneficial to the world just to build an architecture without any of those flaws--just to show them how its done.
[Yet once more stating what is obvious, especially to one skilled
in the art.]
Captain Obvious would be proud.
On 11/24/2023 6:01 PM, Scott Lurndal wrote:
Stefan Monnier <monnier@iro.umontreal.ca> writes:
My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV} >>>Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?
Would require priority decoders to differeniate rather
than simple gates, probably.
Although I wonder at the missing firmware privilege level, a la SMM or EL3. >>
ARM added support for nested hypervisors without adding a
new exception level. Although interesting, there isn't much
evidence of it being used in production. Yet anyway.
It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.
With pretty much anything that isn't "bare metal" being put in User Mode >(potentially using emulation traps as needed).
Something like a Soft-TLB or Inverted-Page-Table does not need any
special hardware to support nested translation (whereas hardware
page-walking would require dedicated support).
Not entirely sure how multi-level virtualization works with page-tables,
but works "somehow".
BGB <cr88192@gmail.com> writes:
On 11/24/2023 6:01 PM, Scott Lurndal wrote:
Stefan Monnier <monnier@iro.umontreal.ca> writes:
My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?
Would require priority decoders to differeniate rather
than simple gates, probably.
Although I wonder at the missing firmware privilege level, a la SMM or EL3. >>>
ARM added support for nested hypervisors without adding a
new exception level. Although interesting, there isn't much
evidence of it being used in production. Yet anyway.
It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.
With pretty much anything that isn't "bare metal" being put in User Mode
(potentially using emulation traps as needed).
Something like a Soft-TLB or Inverted-Page-Table does not need any
special hardware to support nested translation (whereas hardware
page-walking would require dedicated support).
It's been tried. And performance sucked big-time. The reason
that AMD added back support for the DS limit register in AMD64
was to support xen (and vmware) before Pacifica (the AMD project
that became Secure Virtual Machine (SVM) known now as AMD-V).
Both intel and AMD use a block of memory to record guest state
and have instructions to enter and leave VM mode (e.g. vmenter);
ARM stores guest state in system registers - less overhead
when switching from guest to host or guest to guest.
Not entirely sure how multi-level virtualization works with page-tables,
but works "somehow".
But not well, nor performant.
On 11/25/2023 10:55 AM, Scott Lurndal wrote:
BGB <cr88192@gmail.com> writes:
On 11/24/2023 6:01 PM, Scott Lurndal wrote:
Stefan Monnier <monnier@iro.umontreal.ca> writes:
My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?
Would require priority decoders to differeniate rather
than simple gates, probably.
Although I wonder at the missing firmware privilege level, a la SMM or EL3.
ARM added support for nested hypervisors without adding a
new exception level. Although interesting, there isn't much
evidence of it being used in production. Yet anyway.
It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.
With pretty much anything that isn't "bare metal" being put in User Mode >>> (potentially using emulation traps as needed).
Something like a Soft-TLB or Inverted-Page-Table does not need any
special hardware to support nested translation (whereas hardware
page-walking would require dedicated support).
It's been tried. And performance sucked big-time. The reason
that AMD added back support for the DS limit register in AMD64
was to support xen (and vmware) before Pacifica (the AMD project
that became Secure Virtual Machine (SVM) known now as AMD-V).
OK.
I wouldn't expect nested inverted-page-table translation to be *that*
much slower than normal inverted page tables. Though, would add a bit of multi-level translation wonk in the top-level miss handler (and likely
still better than multi-level soft-TLB, where a miss in the outer TLB
level means needing to propagate the interrupt inwards and then
emulating it the whole way up).
Granted, there is still the annoyance that the OS's tend to deal with page-tables, and one needs to translate to inverted page tables, which typically have a finite associativity (such as 4 or 8 way).
Would mean that multi-level interrupt handling would still be needed
whenever the page isn't in the guest's TLB or VIPT (short of breaking abstraction and faking the use of hardware page walking for the guest OS's).
Granted, full soft TLB isn't ideal for performance either (in general),
my workaround was mostly making the TLB big enough that the average-case
miss rate is kept fairly low (well, and for now, putting the whole OS in
one big address space).
But, multiple address spaces is sort of the whole point of VMs, so...
Seems like one might need a mechanism to remap the VM from real CR's to
a partially emulated set of CR's (VCR's ?...).
Both intel and AMD use a block of memory to record guest state
and have instructions to enter and leave VM mode (e.g. vmenter);
ARM stores guest state in system registers - less overhead
when switching from guest to host or guest to guest.
OK.
Not entirely sure how multi-level virtualization works with page-tables, >>> but works "somehow".
But not well, nor performant.
As far as I know, the whole "nested page tables" was the core of how virtualization worked on x86...
On 11/25/2023 10:55 AM, Scott Lurndal wrote:
BGB <cr88192@gmail.com> writes:
On 11/24/2023 6:01 PM, Scott Lurndal wrote:
Stefan Monnier <monnier@iro.umontreal.ca> writes:
My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?
Would require priority decoders to differeniate rather
than simple gates, probably.
Although I wonder at the missing firmware privilege level, a la SMM or EL3.
ARM added support for nested hypervisors without adding a
new exception level. Although interesting, there isn't much
evidence of it being used in production. Yet anyway.
It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.
With pretty much anything that isn't "bare metal" being put in User Mode >>> (potentially using emulation traps as needed).
Something like a Soft-TLB or Inverted-Page-Table does not need any
special hardware to support nested translation (whereas hardware
page-walking would require dedicated support).
It's been tried. And performance sucked big-time. The reason
that AMD added back support for the DS limit register in AMD64
was to support xen (and vmware) before Pacifica (the AMD project
that became Secure Virtual Machine (SVM) known now as AMD-V).
OK.
I wouldn't expect nested inverted-page-table translation to be *that*
much slower than normal inverted page tables. Though, would add a bit of >multi-level translation wonk in the top-level miss handler (and likely
still better than multi-level soft-TLB, where a miss in the outer TLB
level means needing to propagate the interrupt inwards and then
emulating it the whole way up).
Would mean that multi-level interrupt handling would still be needed
whenever the page isn't in the guest's TLB or VIPT (short of breaking >abstraction and faking the use of hardware page walking for the guest OS's).
Seems like one might need a mechanism to remap the VM from real CR's to
a partially emulated set of CR's (VCR's ?...).
But not well, nor performant.
As far as I know, the whole "nested page tables" was the core of how >virtualization worked on x86...
On 11/25/2023 1:28 PM, Scott Lurndal wrote:
If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.
If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
CPU), then the cost of the TLB miss handling is on par with other things
like handling the timer interrupt, etc...
But, what one does need, is a way to perform context switches without
also triggering a huge wave of TLB misses in the process.
Big TLB + strategic sharing and ASIDs can help here at least (whereas, a
full TLB flush on context-switch would suck pretty bad).
BGB <cr88192@gmail.com> writes:
On 11/25/2023 10:55 AM, Scott Lurndal wrote:
BGB <cr88192@gmail.com> writes:
On 11/24/2023 6:01 PM, Scott Lurndal wrote:
Stefan Monnier <monnier@iro.umontreal.ca> writes:
My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?
Would require priority decoders to differeniate rather
than simple gates, probably.
Although I wonder at the missing firmware privilege level, a la SMM or EL3.
ARM added support for nested hypervisors without adding a
new exception level. Although interesting, there isn't much
evidence of it being used in production. Yet anyway.
It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.
With pretty much anything that isn't "bare metal" being put in User Mode >>>> (potentially using emulation traps as needed).
Something like a Soft-TLB or Inverted-Page-Table does not need any
special hardware to support nested translation (whereas hardware
page-walking would require dedicated support).
It's been tried. And performance sucked big-time. The reason
that AMD added back support for the DS limit register in AMD64
was to support xen (and vmware) before Pacifica (the AMD project
that became Secure Virtual Machine (SVM) known now as AMD-V).
OK.
I wouldn't expect nested inverted-page-table translation to be *that*
much slower than normal inverted page tables. Though, would add a bit of
multi-level translation wonk in the top-level miss handler (and likely
still better than multi-level soft-TLB, where a miss in the outer TLB
level means needing to propagate the interrupt inwards and then
emulating it the whole way up).
Let's look at a hardware example of a nested page table walk,
using the AMD nested page table feature as a guide. The AMD
version uses the same PTE format as the non-virtualized page
tables (which reduces the amount of kernel code required to
manage the page tables) unlike Intel's EPT.
Assuming 4k-byte pages in both the primary and nested page tables,
a page table walk must make 22 memory accesses to satisfy a
VA to PA translation, versus only four in a non-virtualized
table walk. This can be reduced to 11 if you have the luxury
of using 1GB mappings in the nested page table.
Performing all those accesses in a kernel fault handler would
consume a great deal more time than a hardware table walker will (particularly
if the hardware table walkers can cache the intermediate results
of the higher-level blocks in the walk in the walk hardware).
The downsides of IPT pretty much preclude their use in most
modern operating systems where shared memory between processes
is common (explicitly -or- implicitly (such as VDSO on linux));
some of goals listed as benefits for IPT (e.g. easier whole
process swapping) are made irrelevent by modern operating
systems that don't do that. There's a rather incoherent
description of IPT at geeksforgeeks - I'd not recommend it
as a useful resource.
Would mean that multi-level interrupt handling would still be needed
whenever the page isn't in the guest's TLB or VIPT (short of breaking
abstraction and faking the use of hardware page walking for the guest OS's).
If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.
Seems like one might need a mechanism to remap the VM from real CR's to
a partially emulated set of CR's (VCR's ?...).
ARM does this by adding a layer above the OS ring that can trap
accesses to certain control registers used by the OS to the
hypervisor for resolution. But for the most part, the guest just
uses the same control registers as if it were running bare metal with
no trapping - they're just loaded by the hypervisor before the guest
is dispatched and saved by the hypervisor when scheduling a new
guest. Thats an advantage of the exception level scheme, where
each level has its own set of control registers.
However, shortcoming of the initial implementation was if the
hypervisor was type II, the hypervisor needed to have a special
privileged guest to run standard user-mode code[*]. So they
added (in V8.1) the virtual host extensions (VHE) which allowed
the hypervisor exception level (EL2) to directly dispatch
user-mode code to EL0 (with the normal traps from usermode
to the OS directed to the hypervisor instead of a guest OS). This
let the hypervisor (e.g. KVM) to act both as a hypervisor and
a guest OS with out the context switches required to support a
privileged guest).
[*] And also to provide VFIO support for non-SRIOV hardware devices.
But not well, nor performant.
As far as I know, the whole "nested page tables" was the core of how
virtualization worked on x86...
Before AMD added NPT (Nested Page Tables), the hypervisor needed to
be able to recognize and trap any accesses from the guest OS to
its own page tables and update the real page tables accordingly.
To do that, they had several options:
1) Paravirtualization (i.e. all guest page table ops call the
hypervisor rather than changing the page tables directly);
Xen did this.
2) Write-protecting the page tables and trapping any writes in
the hypervisor. Difficult to do since the page tables in
common OS are not allocated contiguously and they are updated
using normal loads and stores (the HV does know them, however,
as it can trap writes to CR3 and from there can write-protect
the entire table in the real page tables).
3) Binary patch the guest operating system. This was the approach used
by VMware before AMD introduced NPT.
BGB <cr88192@gmail.com> writes:
On 11/25/2023 10:55 AM, Scott Lurndal wrote:
BGB <cr88192@gmail.com> writes:
On 11/24/2023 6:01 PM, Scott Lurndal wrote:
Stefan Monnier <monnier@iro.umontreal.ca> writes:
My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?
Would require priority decoders to differeniate rather
than simple gates, probably.
Although I wonder at the missing firmware privilege level, a la SMM or EL3.
ARM added support for nested hypervisors without adding a
new exception level. Although interesting, there isn't much
evidence of it being used in production. Yet anyway.
It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.
With pretty much anything that isn't "bare metal" being put in User Mode >>>> (potentially using emulation traps as needed).
Something like a Soft-TLB or Inverted-Page-Table does not need any
special hardware to support nested translation (whereas hardware
page-walking would require dedicated support).
It's been tried. And performance sucked big-time. The reason
that AMD added back support for the DS limit register in AMD64
was to support xen (and vmware) before Pacifica (the AMD project
that became Secure Virtual Machine (SVM) known now as AMD-V).
OK.
I wouldn't expect nested inverted-page-table translation to be *that*
much slower than normal inverted page tables. Though, would add a bit of >>multi-level translation wonk in the top-level miss handler (and likely >>still better than multi-level soft-TLB, where a miss in the outer TLB
level means needing to propagate the interrupt inwards and then
emulating it the whole way up).
Let's look at a hardware example of a nested page table walk,
using the AMD nested page table feature as a guide. The AMD
version uses the same PTE format as the non-virtualized page
tables (which reduces the amount of kernel code required to
manage the page tables) unlike Intel's EPT.
Assuming 4k-byte pages in both the primary and nested page tables,
a page table walk must make 22 memory accesses to satisfy a
VA to PA translation, versus only four in a non-virtualized
table walk. This can be reduced to 11 if you have the luxury
of using 1GB mappings in the nested page table.
Performing all those accesses in a kernel fault handler would
consume a great deal more time than a hardware table walker will (particularly
if the hardware table walkers can cache the intermediate results
of the higher-level blocks in the walk in the walk hardware).
The downsides of IPT pretty much preclude their use in most
modern operating systems where shared memory between processes
is common (explicitly -or- implicitly (such as VDSO on linux));
some of goals listed as benefits for IPT (e.g. easier whole
process swapping) are made irrelevent by modern operating
systems that don't do that. There's a rather incoherent
description of IPT at geeksforgeeks - I'd not recommend it
as a useful resource.
Would mean that multi-level interrupt handling would still be needed >>whenever the page isn't in the guest's TLB or VIPT (short of breaking >>abstraction and faking the use of hardware page walking for the guest OS's).
If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.
Seems like one might need a mechanism to remap the VM from real CR's to
a partially emulated set of CR's (VCR's ?...).
ARM does this by adding a layer above the OS ring that can trap
accesses to certain control registers used by the OS to the
hypervisor for resolution. But for the most part, the guest just
uses the same control registers as if it were running bare metal with
no trapping - they're just loaded by the hypervisor before the guest
is dispatched and saved by the hypervisor when scheduling a new
guest. Thats an advantage of the exception level scheme, where
each level has its own set of control registers.
However, shortcoming of the initial implementation was if the
hypervisor was type II, the hypervisor needed to have a special
privileged guest to run standard user-mode code[*]. So they
added (in V8.1) the virtual host extensions (VHE) which allowed
the hypervisor exception level (EL2) to directly dispatch
user-mode code to EL0 (with the normal traps from usermode
to the OS directed to the hypervisor instead of a guest OS). This
let the hypervisor (e.g. KVM) to act both as a hypervisor and
a guest OS with out the context switches required to support a
privileged guest).
[*] And also to provide VFIO support for non-SRIOV hardware devices.
But not well, nor performant.
As far as I know, the whole "nested page tables" was the core of how >>virtualization worked on x86...
Before AMD added NPT (Nested Page Tables), the hypervisor needed to
be able to recognize and trap any accesses from the guest OS to
its own page tables and update the real page tables accordingly.
To do that, they had several options:
1) Paravirtualization (i.e. all guest page table ops call the
hypervisor rather than changing the page tables directly);
Xen did this.
2) Write-protecting the page tables and trapping any writes in
the hypervisor. Difficult to do since the page tables in
common OS are not allocated contiguously and they are updated
using normal loads and stores (the HV does know them, however,
as it can trap writes to CR3 and from there can write-protect
the entire table in the real page tables).
3) Binary patch the guest operating system. This was the approach used
by VMware before AMD introduced NPT.
Scott Lurndal wrote:
Seems like one might need a mechanism to remap the VM from real CR's to
a partially emulated set of CR's (VCR's ?...).
ARM does this by adding a layer above the OS ring that can trap
accesses to certain control registers used by the OS to the
hypervisor for resolution. But for the most part, the guest just
uses the same control registers as if it were running bare metal with
no trapping - they're just loaded by the hypervisor before the guest
is dispatched and saved by the hypervisor when scheduling a new
guest. Thats an advantage of the exception level scheme, where
each level has its own set of control registers.
My 66000 memory maps control registers {CPU, LLC, NorthBridge,
device, ...} into MMI/O space. A CPU, with access permission,
can read or write another CPU's control registers--used sparingly
to get out of trouble. Mainly this is used to allow a CPU to
read or write device control registers.
Nested Page Tables are the best solution (Fewest SW instructions of
overhead and total cycles of latency) we currently know of.
mitchalsup@aol.com (MitchAlsup) writes:
Scott Lurndal wrote:
Seems like one might need a mechanism to remap the VM from real CR's to >>>>a partially emulated set of CR's (VCR's ?...).
ARM does this by adding a layer above the OS ring that can trap
accesses to certain control registers used by the OS to the
hypervisor for resolution. But for the most part, the guest just
uses the same control registers as if it were running bare metal with
no trapping - they're just loaded by the hypervisor before the guest
is dispatched and saved by the hypervisor when scheduling a new
guest. Thats an advantage of the exception level scheme, where
each level has its own set of control registers.
My 66000 memory maps control registers {CPU, LLC, NorthBridge,
device, ...} into MMI/O space. A CPU, with access permission,
can read or write another CPU's control registers--used sparingly
to get out of trouble. Mainly this is used to allow a CPU to
read or write device control registers.
ARM supports access to CPU system registers via MMIO;
primarily for debug purposes. System Registers may be accessed
either via MMIO accesses from a running core, subject to
permission controls, or via JTAG interface(s).
The preferred way to access a cores own system registers is
via the MSR/MRS instructions.
<snip>
Nested Page Tables are the best solution (Fewest SW instructions of >>overhead and total cycles of latency) we currently know of.
Indeed.
BGB <cr88192@gmail.com> writes:
On 11/25/2023 1:28 PM, Scott Lurndal wrote:
If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.
If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
CPU), then the cost of the TLB miss handling is on par with other things
like handling the timer interrupt, etc...
Any cycle used by the miss handler is a cycle that could
have been used for useful work. Timer interrupt handling
is often very short (increment a memory location, a comparison
and a return if no timer has expired). And we're long
past the days of using regular timer interrupts for scheduling
(see tickless kernels, for example).
But, what one does need, is a way to perform context switches without
also triggering a huge wave of TLB misses in the process.
Why?
Note that depending on the number of entries in your TLB
and the scheduler behavior, it's unlikely that any prior
TLB entries will be useful to a newly scheduled thread
(in a different address space).
Having multiple banks of TLBs that you can switch between
might be able to provide you with the capability to
reduce the TLB miss rate on scheduling a new thread of
execution - but CAMs aren't cheap.
For the most part, industry has settled on a large number
of tagged TLB entries as a good compromise. Some architectures have
a global bit in the entry that can be set via the page
table that indicates that ASID and/or VMID qualifications
aren't necessary for a hit.
Big TLB + strategic sharing and ASIDs can help here at least (whereas, a
full TLB flush on context-switch would suck pretty bad).
That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half of the virtrual address space is shared by all processes - there's no reason that those entries need to be flushed on context-switch.
Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup) writes:
Scott Lurndal wrote:
Seems like one might need a mechanism to remap the VM from real CR's to >>>>>a partially emulated set of CR's (VCR's ?...).
ARM does this by adding a layer above the OS ring that can trap
accesses to certain control registers used by the OS to the
hypervisor for resolution. But for the most part, the guest just
uses the same control registers as if it were running bare metal with
no trapping - they're just loaded by the hypervisor before the guest
is dispatched and saved by the hypervisor when scheduling a new
guest. Thats an advantage of the exception level scheme, where
each level has its own set of control registers.
My 66000 memory maps control registers {CPU, LLC, NorthBridge,
device, ...} into MMI/O space. A CPU, with access permission,
can read or write another CPU's control registers--used sparingly
to get out of trouble. Mainly this is used to allow a CPU to
read or write device control registers.
ARM supports access to CPU system registers via MMIO;
primarily for debug purposes. System Registers may be accessed
either via MMIO accesses from a running core, subject to
permission controls, or via JTAG interface(s).
Nice to know someone already blazed the trail.
The preferred way to access a cores own system registers is
via the MSR/MRS instructions.
My 66000 has a HR (Header Register) instruction to access one
register at a time, but a MM (memory to memory move) instruction
can be used to swap the entire core-stack {HV-level context switch.}
MM to a MMI/O space is guaranteed to be ATOMIC across the entire
transfer.
But it is not just system registers, but all storage within a
CPU/core, the L2 control status registers, the HostBridge
control and status registers,...EVEN the register Registers
are available--remotely.
<snip>
Nested Page Tables are the best solution (Fewest SW instructions of >>>overhead and total cycles of latency) we currently know of.
Indeed.
Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup) writes:
Scott Lurndal wrote:
The preferred way to access a cores own system registers is
via the MSR/MRS instructions.
My 66000 has a HR (Header Register) instruction to access one
register at a time, but a MM (memory to memory move) instruction
can be used to swap the entire core-stack {HV-level context switch.}
MM to a MMI/O space is guaranteed to be ATOMIC across the entire
transfer.
But it is not just system registers, but all storage within a
CPU/core, the L2 control status registers, the HostBridge
control and status registers,...EVEN the register Registers
are available--remotely.
On 11/25/2023 4:10 PM, Scott Lurndal wrote:
BGB <cr88192@gmail.com> writes:
On 11/25/2023 1:28 PM, Scott Lurndal wrote:
If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.
If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
CPU), then the cost of the TLB miss handling is on par with other things >>> like handling the timer interrupt, etc...
Any cycle used by the miss handler is a cycle that could
have been used for useful work. Timer interrupt handling
is often very short (increment a memory location, a comparison
and a return if no timer has expired). And we're long
past the days of using regular timer interrupts for scheduling
(see tickless kernels, for example).
It takes roughly as much time to service a timer interrupt as to service
a TLB miss...
Much of the work in the time spent in the latter is saving/restoring the >relevant registers, with the actual page table walk and 'LDTLB'
instruction typically a fairly minor part in comparison...
That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half >> of the virtrual address space is shared by all processes - there's no reason >> that those entries need to be flushed on context-switch.
AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
the defined behavior?... Well, at least ignoring the support for global >pages.
mitchalsup@aol.com (MitchAlsup) writes:
Scott Lurndal wrote:
My 66000 memory maps control registers {CPU, LLC, NorthBridge,
device, ...} into MMI/O space. A CPU, with access permission,
can read or write another CPU's control registers--used sparingly
to get out of trouble. Mainly this is used to allow a CPU to
read or write device control registers.
ARM supports access to CPU system registers via MMIO;
primarily for debug purposes. System Registers may be accessed
either via MMIO accesses from a running core, subject to
permission controls, or via JTAG interface(s).
Nice to know someone already blazed the trail.
Note that a handful of system registers, when accessed
using the MRS/MSR instructions are self-synchronizing
with-respect to other state. This, architecturally,
does _not_ hold when accessed via MMIO.
mitchalsup@aol.com (MitchAlsup) writes:
Scott Lurndal wrote:
But it is not just system registers, but all storage within a
CPU/core, the L2 control status registers, the HostBridge
control and status registers,...EVEN the register Registers
are available--remotely.
Yes, we do that (useful on chips that can also be a PCIe endpoint).
Even AMD does that with the memory controllers, SMI, I2C/I3C
etc. appearing as PCI endpoints.
BGB <cr88192@gmail.com> writes:
On 11/25/2023 4:10 PM, Scott Lurndal wrote:
BGB <cr88192@gmail.com> writes:
On 11/25/2023 1:28 PM, Scott Lurndal wrote:
If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.
If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
CPU), then the cost of the TLB miss handling is on par with other things >>>> like handling the timer interrupt, etc...
Any cycle used by the miss handler is a cycle that could
have been used for useful work. Timer interrupt handling
is often very short (increment a memory location, a comparison
and a return if no timer has expired). And we're long
past the days of using regular timer interrupts for scheduling
(see tickless kernels, for example).
It takes roughly as much time to service a timer interrupt as to service
a TLB miss...
You'll need to provide more than an assertion for that.
Much of the work in the time spent in the latter is saving/restoring the >>relevant registers, with the actual page table walk and 'LDTLB'
instruction typically a fairly minor part in comparison...
Then you've a poorly written handler. Note that a hardware table
walker doesn't need to save any registers.
<snip>
That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half
of the virtrual address space is shared by all processes - there's no reason
that those entries need to be flushed on context-switch.
AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
the defined behavior?... Well, at least ignoring the support for global >>pages.
On 11/25/2023 4:10 PM, Scott Lurndal wrote:
BGB <cr88192@gmail.com> writes:
On 11/25/2023 1:28 PM, Scott Lurndal wrote:
If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.
If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
CPU), then the cost of the TLB miss handling is on par with other things >>> like handling the timer interrupt, etc...
Any cycle used by the miss handler is a cycle that could
have been used for useful work. Timer interrupt handling
is often very short (increment a memory location, a comparison
and a return if no timer has expired). And we're long
past the days of using regular timer interrupts for scheduling
(see tickless kernels, for example).
It takes roughly as much time to service a timer interrupt as to service
a TLB miss...
Much of the work in the time spent in the latter is saving/restoring the relevant registers, with the actual page table walk and 'LDTLB'
instruction typically a fairly minor part in comparison...
At least, excluding something like using B-Tree based page tables...
It could be made faster, but would likely require doing the TLB miss
handler in ASM and only saving/restoring the minimum number of registers (well, at least until we detect that there will be a page-fault, which
would still require falling back to a "more comprehensive" handler).
Any L1 miss penalties from the page-walk itself would likely also apply
to a hardware page-walker.
But, what one does need, is a way to perform context switches without
also triggering a huge wave of TLB misses in the process.
Why?
Note that depending on the number of entries in your TLB
and the scheduler behavior, it's unlikely that any prior
TLB entries will be useful to a newly scheduled thread
(in a different address space).
I am mostly using a 256x 4-way TLB (so, 1024 TLBE's).
With a 16K page size, this is basically enough to keep roughly something
the size of the working set of Doom entirely in the TLB.
In my past experiments, 16K seemed to be the local optimum for the
programs tested:
4K and 8K resulted in higher miss rates;
32K and 64K resulted in a more "internal fragmentation" without much reduction in miss rate.
Having multiple banks of TLBs that you can switch between
might be able to provide you with the capability to
reduce the TLB miss rate on scheduling a new thread of
execution - but CAMs aren't cheap.
This is why my TLB is 4-way set-associative.
An 8-way TLB would be a lot more expensive, and a fully-associative TLB
(of nearly any non-trivial size) would be effectively implausible.
For the most part, industry has settled on a large number
of tagged TLB entries as a good compromise. Some architectures have
a global bit in the entry that can be set via the page
table that indicates that ASID and/or VMID qualifications
aren't necessary for a hit.
Yeah.
I guess a factor here is mostly defining rules to both allow for and
control the scope of global pages.
In my case:
The TTB register defines an ASID in the high order bits;
The TLBE also has an ASID;
The ASID is split into two parts (6 and 10 bits).
In the ASID, 0 designates global pages
But they are broken into "groups"
So typically a global page is only shared within a given group.
I am thinking the 6.10 split may have given too many bits to the group,
and 4.12 or 2.14 might have been better.
As-is, say, ASID 03DE would be able see global pages in 0000, but 045F
would not (but would see global pages in ASID 0400).
So, say, in the current scheme:
ASID's 0000, 0400, 0800, 0C00, ... would exist as mirrors of the
global address space.
Where, say, if during a TLB Miss, if a page is marked global, it can be
put into one of these ASIDs rather than the main ASID of the current
process (if not in an ASID range which disallows global pages).
The size of the group will have an effect on miss rate in cases where
there are a lot of active PIDs though.
Big TLB + strategic sharing and ASIDs can help here at least (whereas, a >>> full TLB flush on context-switch would suck pretty bad).
That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the
kernel half
of the virtrual address space is shared by all processes - there's no
reason
that those entries need to be flushed on context-switch.
AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
the defined behavior?... Well, at least ignoring the support for global pages.
On 2023-11-26 4:17 p.m., BGB wrote:
On 11/25/2023 4:10 PM, Scott Lurndal wrote:A hardware table walker strikes me as not being a large component.
BGB <cr88192@gmail.com> writes:
On 11/25/2023 1:28 PM, Scott Lurndal wrote:
If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.
If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
CPU), then the cost of the TLB miss handling is on par with other
things
like handling the timer interrupt, etc...
Any cycle used by the miss handler is a cycle that could
have been used for useful work. Timer interrupt handling
is often very short (increment a memory location, a comparison
and a return if no timer has expired). And we're long
past the days of using regular timer interrupts for scheduling
(see tickless kernels, for example).
It takes roughly as much time to service a timer interrupt as to
service a TLB miss...
Much of the work in the time spent in the latter is saving/restoring
the relevant registers, with the actual page table walk and 'LDTLB'
instruction typically a fairly minor part in comparison...
At least, excluding something like using B-Tree based page tables...
It could be made faster, but would likely require doing the TLB miss
handler in ASM and only saving/restoring the minimum number of
registers (well, at least until we detect that there will be a
page-fault, which would still require falling back to a "more
comprehensive" handler).
Any L1 miss penalties from the page-walk itself would likely also
apply to a hardware page-walker.
Although untested yet, the Q+ table walker is only about 1,200 LUTs or
1% of the FPGA. Given the small size I think it is worth it to have the
table walker in hardware. It is hard to beat hardware timing wise when
it does not need to save / restore registers.
But, what one does need, is a way to perform context switches without
also triggering a huge wave of TLB misses in the process.
Why?
Note that depending on the number of entries in your TLB
and the scheduler behavior, it's unlikely that any prior
TLB entries will be useful to a newly scheduled thread
(in a different address space).
I am mostly using a 256x 4-way TLB (so, 1024 TLBE's).
With a 16K page size, this is basically enough to keep roughly
something the size of the working set of Doom entirely in the TLB.
In my past experiments, 16K seemed to be the local optimum for the
programs tested:
4K and 8K resulted in higher miss rates;
32K and 64K resulted in a more "internal fragmentation" without much
reduction in miss rate.
Having multiple banks of TLBs that you can switch between
might be able to provide you with the capability to
reduce the TLB miss rate on scheduling a new thread of
execution - but CAMs aren't cheap.
This is why my TLB is 4-way set-associative.
An 8-way TLB would be a lot more expensive, and a fully-associative
TLB (of nearly any non-trivial size) would be effectively implausible.
For the most part, industry has settled on a large number
of tagged TLB entries as a good compromise. Some architectures have
a global bit in the entry that can be set via the page
table that indicates that ASID and/or VMID qualifications
aren't necessary for a hit.
Yeah.
I guess a factor here is mostly defining rules to both allow for and
control the scope of global pages.
In my case:
The TTB register defines an ASID in the high order bits;
The TLBE also has an ASID;
The ASID is split into two parts (6 and 10 bits).
In the ASID, 0 designates global pages
But they are broken into "groups"
So typically a global page is only shared within a given group.
I am thinking the 6.10 split may have given too many bits to the
group, and 4.12 or 2.14 might have been better.
As-is, say, ASID 03DE would be able see global pages in 0000, but 045F
would not (but would see global pages in ASID 0400).
So, say, in the current scheme:
ASID's 0000, 0400, 0800, 0C00, ... would exist as mirrors of the
global address space.
Where, say, if during a TLB Miss, if a page is marked global, it can
be put into one of these ASIDs rather than the main ASID of the
current process (if not in an ASID range which disallows global pages).
The size of the group will have an effect on miss rate in cases where
there are a lot of active PIDs though.
Big TLB + strategic sharing and ASIDs can help here at least
(whereas, a
full TLB flush on context-switch would suck pretty bad).
That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the
kernel half
of the virtrual address space is shared by all processes - there's no
reason
that those entries need to be flushed on context-switch.
AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
the defined behavior?... Well, at least ignoring the support for
global pages.
BGB <cr88192@gmail.com> writes:
On 11/25/2023 4:10 PM, Scott Lurndal wrote:
BGB <cr88192@gmail.com> writes:
On 11/25/2023 1:28 PM, Scott Lurndal wrote:
If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.
If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
CPU), then the cost of the TLB miss handling is on par with other things >>>> like handling the timer interrupt, etc...
Any cycle used by the miss handler is a cycle that could
have been used for useful work. Timer interrupt handling
is often very short (increment a memory location, a comparison
and a return if no timer has expired). And we're long
past the days of using regular timer interrupts for scheduling
(see tickless kernels, for example).
It takes roughly as much time to service a timer interrupt as to service
a TLB miss...
You'll need to provide more than an assertion for that.
Much of the work in the time spent in the latter is saving/restoring the
relevant registers, with the actual page table walk and 'LDTLB'
instruction typically a fairly minor part in comparison...
Then you've a poorly written handler. Note that a hardware table
walker doesn't need to save any registers.
<snip>
That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half
of the virtrual address space is shared by all processes - there's no reason
that those entries need to be flushed on context-switch.
AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
the defined behavior?... Well, at least ignoring the support for global
pages.
Does x86 even tag the TLB entries with an ASID? I've been in ARMv8 land for the
last decade.
On 11/26/2023 4:46 PM, Scott Lurndal wrote:
BGB <cr88192@gmail.com> writes:
On 11/25/2023 4:10 PM, Scott Lurndal wrote:
BGB <cr88192@gmail.com> writes:
On 11/25/2023 1:28 PM, Scott Lurndal wrote:
If you're taking an interrupt, to resolve guest TLB misses,
performance is clearly not high priority.
If one can stay under, say, 100-500 TLB misses per second (on a 50MHz >>>>> CPU), then the cost of the TLB miss handling is on par with other things >>>>> like handling the timer interrupt, etc...
Any cycle used by the miss handler is a cycle that could
have been used for useful work. Timer interrupt handling
is often very short (increment a memory location, a comparison
and a return if no timer has expired). And we're long
past the days of using regular timer interrupts for scheduling
(see tickless kernels, for example).
It takes roughly as much time to service a timer interrupt as to service >>> a TLB miss...
You'll need to provide more than an assertion for that.
If
cycles, then the time needed to do a few memory loads, some bit
twiddling, and an LDTLB, mostly disappears in the noise...
Does x86 even tag the TLB entries with an ASID? I've been in ARMv8 land for the
last decade.
It seems to have added "something" to support global pages, but doesn't >appear to use an ASID.
I had tried, with all sorts of ingenious compromises of register spaces
and the like, to fit all the capabilities I wanted into the opcode space
of a single version of the instruction set, eliminating the need for
blocks which contained instructions belonging to alternate versions of
the instruction set.
But if the 16-bit instructions I'm making room for are useless to
compilers, that's questionable.
At first, when I mulled over this, I came up with multiple ideas to
address it, each one crazier than the last.
Seeing, therefore, that this was a difficult nut to crack, and not
wanting to go down in another wrong direction... instead, I found a way
to go that seemed to me to be reasonably sensible.
Go back to uncompromised 32-bit instructions, even though that means
there are no 16-bit instructions.
Then, bring back short instructions - effectively 17 bits long - so as
to have room for full register specifications. This means an alternative block format where 16, 32, 48, 64... bit instructions are all possible.
*But* because of the room 17-bit short instructions take up in the
header, the 32-bit instructions are the same regular format as in the
other case. Not some kind of 33-bit or 35-bit instruction with a new set
of instruction formats.
Having a look at the ConcertiaII ISA. I like the idea of
pseudo-immediates. All the immediates could be moved to one end of the
block and then skipped over during instruction fetch.
On Thu, 30 Nov 2023 11:22:55 -0500, Robert Finch wrote:
Having a look at the ConcertiaII ISA. I like the idea of
pseudo-immediates. All the immediates could be moved to one end of the
block and then skipped over during instruction fetch.
That is the general idea, with one minor correction.
The benefit of pseudo-immediates, like that of ordinary immediates,
are that they're already available, because they were brought into the
CPU by instruction fetch.
They get skipped over by the _next_ step, instruction decode.
Why a block structure? The goal is to have a situation where
instruction decode is largely done in parallel for the whole
block.
The first step is - is there a header? If not, decode all eight
32-bit instructions in the block in parallel.
If so, process the header, and that will directly and immediately
reveal where every instruction in the block begins, so again the
next step has all the instructions being decoded in parallel.
The header allows the length that immediates would add to instructions
to be in the pseudo-immediated instead, avoiding another potential complication to instruction decoding.
In addition, having headers means that the instruction set can be
expanded or made flexible without it being possible to change the
mode of the CPU to cause it to read existing instruction code the
wrong way. Any modifications to how instructions are to be interpreted
are right there in the block header, so malware that can't alter
code can't work around that by changing how it is to be read.
Among the features the headers allow to be added are VLIW features,Why would you want this ??
such as instruction predication and explicitly indicating whichHW does not seem to have much trouble doing this already.
instructions can execute in parallel.
This allows high-performanceHave any GBnOoO machines been successful ?
but lightweight (non-OoO) implementations if desired.
John Savard
It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.
I have now modified the 17-bit shift instructions in the diagram, so
that they can also apply to all 32 integer registers, and I have
corrected the opcodes on the page
http://www.quadibloc.com/arch/cw0101.htm
Why not decode assuming there is a block header and also decode as if
there were not a block header. Then you can multiplex (choose) later
which one prevails. This puts the choice at at least 4 gates of delay
into the decode cycle.
If so, process the header, and that will directly and immediately
reveal where every instruction in the block begins, so again the next
step has all the instructions being decoded in parallel.
You then have to route the instructions to the decoders. Are your
decoders expensive enough in a wide implementation that this matters?
The alternative is to have a no-header decoder running in parallel with
a header decoder and choose which to use.
You MAY be able to alter the headers later in the architecture's life,
but ultimately you sacrifice forward compatibility.
Among the features the headers allow to be added are VLIW features,
Why would you want this ??
This allows high-performance but lightweight (non-OoO) implementations
if desired.
Have any GBnOoO machines been successful ?
BGB wrote:
It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.
I have been thinking about this for a while::
It seems to me that if one wants a robust system, the HyperVisor must
support various serviced-HyperVisors. This second (less privileged HV)
is, in essence, a HV that can crash without allowing the whole system
to crash {just like virtual machines can crash and take their applications >with them.}
Secondly:: Running an ISR at HV level is a privilege inversion issue,
the HV has to look at data structures maintained by a (not necessarily >trustable) Guest OS--possibly corrupting the HV itself.
So while a 3 level system gives you most of what you want in a modern
system, it still has its own problems--that can be solved with a 4th.
mitchalsup@aol.com (MitchAlsup) writes:
BGB wrote:
It seems to me, one should be able to get away with 3 modes:
Machine / ISR;
Supervisor;
User.
I have been thinking about this for a while::
It seems to me that if one wants a robust system, the HyperVisor must >>support various serviced-HyperVisors. This second (less privileged HV)
is, in essence, a HV that can crash without allowing the whole system
to crash {just like virtual machines can crash and take their applications >>with them.}
Generally there must be a privilege level more privileged than
hypervisor, which controls the hardware - particularly if one
intends to 'schedule' multiple independent (not nested) hypervisors.
Then there is a requirement in the cloud for a nested hypervisor; this
can be done with a paravirtualized hypervisor, at some performance
cost, or with a true hardware supported nesting capability.
Secondly:: Running an ISR at HV level is a privilege inversion issue,
the HV has to look at data structures maintained by a (not necessarily >>trustable) Guest OS--possibly corrupting the HV itself.
Modern interrupt virtualization mechanisms (e.g. ARMv8 GICv4.1)
handle guest interrupts completely in the hardware, with no
hypervisor intervention involved in the most common cases
(e.g. software generated interprocessor interrupts, virtual
timer interrupts, message signaled interrupts, et alia).
So while a 3 level system gives you most of what you want in a modern >>system, it still has its own problems--that can be solved with a 4th.
Chris M. Thomasson wrote:
On 12/1/2023 6:15 PM, MitchAlsup wrote:
Cores send IPIs by using the little machine.....
Fwiw, how would your system handle this function from Microsoft:
https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers
Or, would that be kernel?
Core could send multiple IPIs in a loop or core could send a single IPI
to a kernel function that performs the loop.
Since performing 1 IPI requires 2 STs and does not require waiting on a response, it is probably easier if the core does the loop.
On 12/1/2023 6:15 PM, MitchAlsup wrote:
Cores send IPIs by using the little machine.....
Fwiw, how would your system handle this function from Microsoft:
https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers
Or, would that be kernel?
So while a 3 level system gives you most of what you want in a modern
system, it still has its own problems--that can be solved with a 4th.
On 11/24/23 9:43 AM, Robert Finch wrote:
[snip]
There is a lot of value in having a unique architecture.
A uniquely difficult architecture like x86 increases the barrier
to competition both from patents and organizational knowledge and
tools. While MIPS managed to suppress clones with its patent on
unaligned loads (please correct any historical inaccuracy), Intel
was better positioned to discourage software-compatible
competition — and not just financially.
I suspect that the bad reputation of x86 among computer architects
— especially with the biases from Computer Architecture: A
Quantitative Approach which substantially informs computer
architecture education — might also make finding talent more
difficult. However, the prominence of the x86 vendors (working on
something that actually gets produced and used by millions of
people is gratifying) and the challenge of working on a difficult architecture would also attract talent (and perhaps more qualified
talent).
The x86
has had a lot of things bolted on to it. It has adapted over time.
Being able to see how things have changed is valuable.
x86 provides more than one lesson on change/project management.
The binary lock-in advantage of x86 makes architectural changes
more challenging. While something like the 8080 to 8086 "assembly
compatible" transition might have been practical and long-term
beneficial from an engineering perspective, from a business
perspective such would validate binary translation, reducing the
competitive barriers.
(Itanium showed that mediocre hardware translation between x86 and
a rather incompatible architecture (and microarchitecture) would
have been problematic even if native Itanium code had competitive
performance. This seems reminiscent of the Pentium Pro's "issue"----------------
with 16-bit code; both seem to have been at least partially
marketing failures. On the other hand, ARM designed a 64-bit
architecture that is only moderately compatible with the 32-bit
architecture — flags being one example of compatibility — and 32-
bit support is now being mostly left behind for 64-bit
implementations.)
MIPS (even with its delayed branches, lack of variable length
encoding, etc.) would probably be a better architecture in 2023
than x86 was around 2010. The delayed branches might have been
deprecated, VLE might have been added in an additional mode, and
eventually complex-but-useful instructions would probably have
been added. (MIPS would almost certainly have caught SIMD widening
disease and had other temporarily useful extension additions, but
the tradeoffs in 1985 were closer to those of 2023.)
This seems reminiscent of the Pentium Pro's "issue" with 16-bit
code; both seem to have been at least partially marketing failures.
On the other hand, ARM designed a 64-bit architecture that is
only moderately compatible with the 32-bit architecture - flags
being one example of compatibility - and 32-bit support is now
being mostly left behind for 64-bit implementations.
On Fri, 01 Dec 2023 18:37:17 +0000, MitchAlsup wrote:
Have any GBnOoO machines been successful ?
Ah, you don't mean out-of-order 68000 machines. Of which there was only
one, the 68050. You mean "great big not out of order" machines. Of which there were none, the design being, no doubt, so outrageous as to not
even deserve the chance to fail, since it would have no chance to
succeed.
That's a very valid point, but any ISA for a "great big" machine can
have a subset which no longer requires a "great big" machine.
On Wed, 29 Nov 2023 17:15:00 +0000, Quadibloc wrote:
I have now modified the 17-bit shift instructions in the diagram, so
that they can also apply to all 32 integer registers, and I have
corrected the opcodes on the page
http://www.quadibloc.com/arch/cw0101.htm
And now I have completed the process of getting back to where I was
before,
by adding in the page
http://www.quadibloc.com/arch/cw0102.htm
which describes the instructions longer than 32 bits.
Have any GBnOoO machines been successful ?
On Fri, 01 Dec 2023 22:10:39 +0000, Quadibloc wrote:
That's a very valid point, but any ISA for a "great big" machine can
have a subset which no longer requires a "great big" machine.
Also, as you are well aware, Intel has included both "performance" and >"efficiency" cores in its latest generations of CPUs, similar to the >BIG.little architecture used for some ARM processors.
Well, another way to address the efficiency/little cores being a
waste of space would be to reduce the waste by making them smaller.
If their purpose is to save power consumption when nobody's using the >computer, to just keep the OS alive while it waits for the keyboard
or the mouse to ask it to do something... then they should be made
really little.
But in-order efficiency cores that are there when the demands are very
low?
mitchalsup@aol.com (MitchAlsup) writes:
Great Big in-order machines (why write non-OoO?):
On Sun, 03 Dec 2023 14:36:37 +0000, Anton Ertl wrote:
mitchalsup@aol.com (MitchAlsup) writes:
Great Big in-order machines (why write non-OoO?):
Of course, no doubt he was thinking of the Itanium, which was one
of the most resounding failures in recent years.
If one goes far enough back, of course, there's the IBM System/360
Model 85. Unlike the Model 91, it was in-order, yet it offered more performance! This was because it had one thing the Model 91 didn't,
a cache.
The Model 85 was actually a failure for IBM in sales terms, but as
that was because of an economic slump at the time it came out, IBM
was not deterred from re-using the design, with a few additions and
tweaks, in the IBM System/370 Model 165 and 168 a few years later.
And those systems were quite successful.
I've already noted that an in-order version of a great big architecture
might make for nice lightweight efficiency cores in a BIG.little type
design. But making those cores in-order has another nice benefit.
No Spectre. No Meltdown. So, when the computer is actually active,
these cores, instead of being a total waste of space, could be put
to use as a ready-made sandbox for executing code sourced from the
Internet.
John Savard
When you deconstruct a GBOoO machine into a LBIO machine you invariably
loose issue width,
which takes the pressure off {TLBs, Caches, Bus, ...}
the pipeline shrinks in stages,
taking even more pressure off those.
Model 85 and 91 were combined into 195 but this still failed compared to
CDC 7600.
On 12/3/2023 12:18 PM, MitchAlsup wrote:
snip
Is my assessment (interspersed below) of the effects of this correct?
When you deconstruct a GBOoO machine into a LBIO machine you invariably
loose issue width,
Which reduces performance
which takes the pressure off {TLBs, Caches, Bus, ...}
Which allows savings in ports, etc., thus further reducing gate count,
thus chip size, thus cost.
the pipeline shrinks in stages,
Which reduces the cost of mis-predicted branches, thus counterbalancing "some" of the performance loss from eliminating OoO. Also, further
reduces gate count.
taking even more pressure off those.
Overall, while the direction of the area/cost reduction, and performance
loss are clear, the magnitude of these is more difficult to predict
before actually doing it.
On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:
Model 85 and 91 were combined into 195 but this still failed compared to
CDC 7600.
I definitely remembered the Model 195.
Even if the CDC 7600 outsold it, though, in one way the Model 195 was
an enormous success. Its microarchitecture ended up being, in general
terms, copied by the Pentium Pro and the Pentium II.
So, today, all computers are made this way - OoO pipeline plus cache.
John Savard
My last year at AMD ('06); I was working on a 1-wide x86-64, eXcel
simulation indicated ½ the performance at 1/12 the area and likely 1/10
the power.
On Sun, 03 Dec 2023 22:34:56 +0000, MitchAlsup wrote:
My last year at AMD ('06); I was working on a 1-wide x86-64, eXcel
simulation indicated ½ the performance at 1/12 the area and likely 1/10
the power.
Given that OoO is a wildly inefficient way to improve
the single-thread performance of CPUs, which we use
because we don't have anything better, I'm not surprised
you've expressed the wish that more research be done on
using multiple CPUs in parallel.
Myself, I don't believe the parallel programming problem
is solvable; there will always be too many problems that
have critical serial parts that are too big. But that
doesn't mean that I think we're doomed to require big
hot CPUs that hog electricity.
Because the problem of writing small, bloat-free programs
_is_ solvable. Back in the days when all we had was Windows
3.1 running on 386 and 486 processors, that was enough to
do nearly everything we do with computers today.
We could still run word processors, do spreadsheets, even
run _Mathematica_. All most computer users would miss would
be a bit of graphical pizazz.
You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO by following one simple rule:: No microarchitectural changes until the
causing instruction retires. AND you can do this without loosing
performance.
On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:
You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO by
following one simple rule:: No microarchitectural changes until the
causing instruction retires. AND you can do this without loosing
performance.
I thought that the mitigations that _were_ costly in performance
were mostly attempts to approach following just that rule.
Since out-of-order is so expensive in power and transistors,
though, if mitigations do exact a performance cost, then
going to a simple CPU that is not out-of-order might be a
way to accept a loss of performance, but gain big savings in
power and die size, whereas mitigations make those worse.
John Savard
On Sun, 03 Dec 2023 22:34:56 +0000, MitchAlsup wrote:
My last year at AMD ('06); I was working on a 1-wide x86-64, eXcel
simulation indicated ½ the performance at 1/12 the area and likely 1/10
the power.
Given that OoO is a wildly inefficient way to improve
the single-thread performance of CPUs, which we use
because we don't have anything better, I'm not surprised
you've expressed the wish that more research be done on
using multiple CPUs in parallel.
Myself, I don't believe the parallel programming problem
is solvable; there will always be too many problems that
have critical serial parts that are too big. But that
doesn't mean that I think we're doomed to require big
hot CPUs that hog electricity.
Because the problem of writing small, bloat-free programs
_is_ solvable. Back in the days when all we had was Windows
3.1 running on 386 and 486 processors, that was enough to
do nearly everything we do with computers today.
We could still run word processors, do spreadsheets, even
run _Mathematica_. All most computer users would miss would
be a bit of graphical pizazz.
Now, it isn't in the interest of CPU makers and others in
the computer industry for users not to be strongly motivated
to run out and buy newer and faster processors every year
or two. The death of Dennard Scaling, and the tapering off
of Moore's Law, however, are taking the wind out of the sails
of that. Eventually, the improvements will be so minor that
the CPU makers won't have enough *money* to fund fabs that
probe the ultimate limits of feature size any longer.
For some users, CPUs made of some exotic material beyondGallium Arsenide.
silicon that was 10x as fast... but, because of yield
issues, could only be used to make small in-order CPUs,
so the CPUs are only 5x as fast... would be worth almost
any price. Because the parallel programming problem hasn't
been solved, whether or not it can be.
And I don't begrudge them such a development, as it would
be a step towards making better performance available to
everyone, as demand drives research into bringing costs
down.
What the rest of us really need is lighter-weight software
that isn't driven by the interests of computer makers instead
of computer users.
John Savard
On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:
You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO by
following one simple rule:: No microarchitectural changes until the
causing instruction retires. AND you can do this without loosing
performance.
I thought that the mitigations that _were_ costly in performance
were mostly attempts to approach following just that rule.
Since out-of-order is so expensive in power and transistors,
though, if mitigations do exact a performance cost, then
going to a simple CPU that is not out-of-order might be a
way to accept a loss of performance, but gain big savings in
power and die size, whereas mitigations make those worse.
John Savard
On 12/3/2023 3:25 PM, Quadibloc wrote:
We could still run word processors, do spreadsheets, even
run _Mathematica_. All most computer users would miss would
be a bit of graphical pizazz.
While I absolutely agree that there is too much resources spent on
"graphical pizzazz", and while you could run many/most of the same
programs, that doesn't mean there is no user benefit from faster CPUs.
For example, you probably could run some simulations, fluid dynamics,
finite element analysis, etc. but you were severely limited in the size
of the program you could run in an acceptable amount of elapsed time.
And applications like servers of various flavors certainly benefit from faster CPUs, as you need fewer of them. Not to mention driving graphics
at more realistic resolutions, etc.
So, no, it wasn't simply the greed of CPU makers that drive us to higher performance systems.
Quadibloc wrote:
On Sun, 03 Dec 2023 22:34:56 +0000, MitchAlsup wrote:
We could still run word processors, do spreadsheets, even
run _Mathematica_. All most computer users would miss would
be a bit of graphical pizazz.
Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??
If not, why are they still adding unused bloat to them ??
My desktops tend to last 7-9 years before blowing out a power
supply transistor. My laptops when the battery dies.
On 12/3/2023 5:25 PM, Quadibloc wrote:
On Sun, 03 Dec 2023 22:34:56 +0000, MitchAlsup wrote:
In my own efforts, I can note that a 50MHz CPU, with programs having
memory foot-prints measured in MB (or less) is "not entirely useless".
But, looking backwards, I am left to realize that, it seems, I am
nowhere near close to the levels of performance or efficiency of a lot
of these early systems.
Like, seemingly, often it is not so much that the CPU is too weak or
slow, but that my code code is still slow. Often, taking for granted
coding practices that were formed in the "relative abundance" of CPU
power in the early 2000s.
In nearly every other area of engineering, the design constraints were relatively constant; but in software, nearly everyone had the mistaken
belief that the exponential increases in computing speed and power would continue indefinitely.
Now it has been steadily falling off, but there has been a sort of
collective denial about it.
Now, it isn't in the interest of CPU makers and others in
the computer industry for users not to be strongly motivated
to run out and buy newer and faster processors every year
or two. The death of Dennard Scaling, and the tapering off
of Moore's Law, however, are taking the wind out of the sails
of that. Eventually, the improvements will be so minor that
the CPU makers won't have enough *money* to fund fabs that
probe the ultimate limits of feature size any longer.
I kinda suspect that when Moore's Law is good and dead, there may
actually be a bit of a back-slide in these areas, as the "best" fabs
will likely be more expensive to run and maintain than the "good but not
the best" fabs, and this will create a back-pressure towards whatever is
"the most bang for the buck" in terms of fab technology.
I also suspect that the transition from the past/current state, to this
state of things, is a world where x86-64 is unlikely to fare well.
Say, in this scenario, x86-64 would be left with an ultimately
unwinnable battle against the likes of ARM and RISC-V.
The exact form things will take will likely depend on a tradeoff:
Whether it is better to have a smaller number of cores getting the best possible single-thread performance;
Or, a larger number of cores each giving comparably worse single-thread performance, but there can be more of them for cheaper and less power.
Say, if you could have cores that only got 1/3 as much performance per
clock, but could afford to have 8x as many cores in total for the same
amount of silicon.
Or, say, people can find ways to make multi-threaded programming not
suck as much (say, if using an async join/promise and/or channels model rather than traditional multi-threading with shared memory and synchronization primitives).
Namely, with such models, it may be possible to make better use of a
many core system, with less pain and overhead than that associated with trying to spawn off large numbers of conventional threads and have them
all sitting around trying to lock mutexes for shared resources.
Though, not necessarily a great way to map this stuff onto "ye olde C",
so effectively one may end up with something in this case resembling the processes communicating in a form resembling COM objects or similar,
with the side effect that (given the structure of the internal dispatch loops), these "objects" can be self-synchronizing and thus don't need an explicit mutex (but, may potentially need a way for the task scheduler
to queue up in-flight requests, which are then handled asynchronously; possibly with a mechanism in place to indicate whether the request will
block the caller until it will be handled, or whether the caller will
resume immediately, potentially even though the called object has not
yet seen the request).
Things like async/promises could scale a little easier to "well, do this thing, potentially using as many cores as available". Though, async's
don't make as much sense on a primarily or exclusively single threaded system, and have an annoying level of overhead if emulated on top of conventional multithreading (it effectively needs a mutex protected work-queue which can itself become a bottleneck).
Ideally, one would need a mechanism to distribute and balance tasks
across the available cores that does not depend on needing to lock a
mutex. Say, for example, maybe using an inter-processor interrupt or
similar to "push" tasks or messages to the other cores, with some shared visible state for "how busy each core is" but not needing to lock
anything to look at this state.
BGB wrote:
Or, say, people can find ways to make multi-threaded programming not
suck as much (say, if using an async join/promise and/or channels model
rather than traditional multi-threading with shared memory and
synchronization primitives).
If you want multi-threaded programs to succeed you need to start writing
them in a language that is not inherently serial !! It is brain dead
easy to write embarrassingly parallel applications in a language like >Verilog. The programmer does not have to specify when or where a gate
is evaluated--that is the job of the environment (Verilog).....
I kinda suspect that when Moore's Law is good and dead, there may
actually be a bit of a back-slide in these areas, as the "best" fabs
will likely be more expensive to run and maintain than the "good but
not the best" fabs, and this will create a back-pressure towards
whatever is "the most bang for the buck" in terms of fab technology.
I also suspect that the transition from the past/current state, to
this state of things, is a world where x86-64 is unlikely to
fare well.
Things like async/promises could scale a little easier to "well, do
this thing, potentially using as many cores as available".
Gallium Arsenide.
On Mon, 04 Dec 2023 19:54:10 +0000, MitchAlsup wrote:
Gallium Arsenide.
I thought that while Gallium Arsenide was _once_ thought
of as something faster than silicon, Intel had, by using
it as a template for "stretched silicon", managed to
improve silicon enough to make it just as good as Gallium
Arsenide... or, at least, this seemed to be what they
were claiming.
John Savard
mitchalsup@aol.com (MitchAlsup) writes:
BGB wrote:
Or, say, people can find ways to make multi-threaded programming not
suck as much (say, if using an async join/promise and/or channels model
rather than traditional multi-threading with shared memory and
synchronization primitives).
If you want multi-threaded programs to succeed you need to start writing >>them in a language that is not inherently serial !! It is brain dead
easy to write embarrassingly parallel applications in a language like >>Verilog. The programmer does not have to specify when or where a gate
is evaluated--that is the job of the environment (Verilog).....
That is true, but only really usable when the resulting design
is realized on silicon. Verilog simulations won't win any
speed races, even with verilator.
The pressure against x86-64 is that one needs comparably expensive CPU
cores to get decent performance, whereas ARM and RISC-V can perform acceptably on cheaper cores.
The pressure would be in the direction of best perf/$, which will be
in-turn best perf per die area, which is not really a battle that x86 is likely to win in the a longer term sense.
If ARM or RISC-V catch up and end up being able to deliver more cores
that are faster and cheaper than what is realistically possible for x86
to offer, then x86's days will be numbered.
Overall, while the direction of the area/cost reduction, and performance
loss are clear, the magnitude of these is more difficult to predict
before actually doing it.
Given that OoO is a wildly inefficient way to improve
the single-thread performance of CPUs
Its microarchitecture ended up being, in general
terms, copied by the Pentium Pro and the Pentium II.
On 12/4/2023 4:21 PM, Stefan Monnier wrote:
I kinda suspect that when Moore's Law is good and dead, there may
actually be a bit of a back-slide in these areas, as the "best" fabs
will likely be more expensive to run and maintain than the "good but
not the best" fabs, and this will create a back-pressure towards
whatever is "the most bang for the buck" in terms of fab technology.
AFAIK this future arrived a few years ago: the lowest cost
per-transistor is not on the densest/smallest nodes any more, which is
why many SoCs don't bother to use those densest/smallest nodes.
OK.
I also suspect that the transition from the past/current state, to
this state of things, is a world where x86-64 is unlikely to
fare well.
I suspect that the ISA makes sufficiently little difference at this
point that it doesn't matter too much.
The pressure against x86-64 is that one needs comparably expensive CPU
cores to get decent performance, whereas ARM and RISC-V can perform >acceptably on cheaper cores.
On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:
You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO by
following one simple rule:: No microarchitectural changes until the
causing instruction retires. AND you can do this without loosing
performance.
I thought that the mitigations that _were_ costly in performance
were mostly attempts to approach following just that rule.
Since out-of-order is so expensive in power and transistors,
though, if mitigations do exact a performance cost, then
going to a simple CPU that is not out-of-order might be a
way to accept a loss of performance, but gain big savings in
power and die size, whereas mitigations make those worse.
On Mon, 04 Dec 2023 20:48:55 -0600, BGB wrote:
The pressure against x86-64 is that one needs comparably expensive CPU
cores to get decent performance, whereas ARM and RISC-V can perform
acceptably on cheaper cores.
The pressure would be in the direction of best perf/$, which will be
in-turn best perf per die area, which is not really a battle that x86 is
likely to win in the a longer term sense.
If ARM or RISC-V catch up and end up being able to deliver more cores
that are faster and cheaper than what is realistically possible for x86
to offer, then x86's days will be numbered.
I think this reasoning makes a lot of sense.
The trouble is that:
a) x86 has an enormous pool of software, and
b) it is possible to build x86 processors, with current processes,
that anyone can afford, and which have adequate performance, and
c) much of the cost of a computer system is in the box housing
the CPU, not just the CPU itself.
However, in my opinion, x86-64 threw away the biggest advantage of x86, because it repeated the mistake of the 80286. It wasn't designed to
make it easy and trivial for 16-bit Windows programs to run on 64-bit editions of Windows, without resorting to any fancy techniques like virtualization.
Instead, they should just run, without Microsoft having to make much
effort (of course, they would still have to thunk the OS calls).
Then Windows' huge advantage, which carries over to the x86 architecture
as well, the huge pool of software written for it, would be there in
full.
So Windows today seems to be in the situation that all that which is
not bloatware is lost. That makes it easier for a competing architecture
to win, it just has to not make the same mistake. Then lightweight
programs _plus_ less complicated instruction decoding will compound
the performance advantage of an alternate ISA.
John Savard
Scott Lurndal wrote:
That is true, but only really usable when the resulting design
is realized on silicon. Verilog simulations won't win any
speed races, even with verilator.
Because it treats each bit as if it had (at least) 4 states.
Verilog, with the model of 1-bit == 1-bit, would only have a 3× penalty;
but would allow one to use all 1M CPUs in a system; instantly, and with
out rewriting anything !
BGB <cr88192@gmail.com> writes:
The pressure against x86-64 is that one needs comparably expensive CPU >>cores to get decent performance, whereas ARM and RISC-V can perform >>acceptably on cheaper cores.
Are they cheaper?
One of my professors back in the late 70's was researching
data flow architectures. Perhaps it's time to reconsider the
unit of compute using single instructions, instead providing a
set of hardware 'functions' than can be used in a data flow environment.
QEMU does better emulation, but lacks any real way of sharing files with
the host OS.
On 11/24/23 9:49 PM, BGB wrote:
[snip]
Though, apparently, someone posted something recently showing RV64
and ARM64 to be much closer than expected, which is curious. The
main instructions that seem to have "the most bang for the buck"
are ones that ARM64 has equivalents of.
"An Empirical Comparison of the RISC-V and AArch64 Instruction
Sets" (Daniel Weaver and Simon McIntosh-Smith, 2023) used five
benchmarks, four scientific and STREAM. Just the fact that STREAM
did not use FMADD for the TRIAD portion slightly penalized AArch64
(though RISC-V will presumably add FMADD if it has not already).
SIMD was excluded based on the reasonable point that RISC-V has
not yet standardized its SIMD extension and "comparing the
different vector instruction sets across AArch64 and RISC-V is
beyond the scope of this initial comparison".
I rather suspect these benchmarks do not provide a good basis for
ISA design targeting minimum path length (much less performance).
The path lengths also varied considerably based on the compiler
version — a more recent version usually helping RISC-V more as
would be expected for a more recent ISA — though the results do
seem to point to general consistency of path length across
versions (one benchmark had negligible change for both ISAs, one
improved AArch64 only, two helped RISC-V only, and one helped both
ISAs but RISC-V more than AArch64).
I am somewhat surprised that indexed memory accesses did not
benefit AArch64 more (for such "scientific" benchmarks). AArch64's
need for a distinct comparison instruction for branches presumably
hurt, especially since loops were not unrolled. (AArch64 does, I
think, include a branch on equal/not-equal zero, so reverse
counted loops would have removed that disadvantage in some cases.)
Both RISC-V and AArch64 are RISC-oriented,
so one would expect the
most common operations to be present as instructions in both. The
differences would be mainly in special instructions (AArch64 has
many), memory addressing (AArch64 has more complex addressing
modes), branches (RISC-V has comparisons on integer values in the
branch instruction, AArch64 can sometimes set condition codes
without an additional instruction), and immediate sizes (AArch64
has larger base immediates — 16-bit vs. 12-bit and ways of
generating some larger immediates).
The special instructions seem unlikely to affect path length much
on such benchmarks and I suspect most of the constants are either
small integers or floating point values. This leaves branches and
memory accesses to affect path length.
A compiler or a web browser would have more interesting
instruction use, I suspect.
The benchmarks used were:
• STREAM [11]
A benchmark for measuring sustained memory bandwidth widely used
in industry, this consists of 4 simple kernels applied to elements
of arrays of size 10,000,000.
• CloverLeaf Serial [10]
A high energy physics simulation solving the compressible Euler
equations on a 2D Cartesian grid. This is broken down into a
series of kernels each of which loops over the entire grid. This
is run with default parameter.
• MiniBUDE [12, 15]
A mini app approximating the behaviour of a molecular docking
simulation used for drug discovery. Run with the bm1 input at 64
poses for one iteration (-n 64 -i 1 –deck /bm1).
• Lattice Boltzmann (LBM)
A d2q9-bgk Lattice Boltzmann algorithm, developed within the HPC
Research Group at the University of Bristol, optimised for serial
execution. Run with a grid size of 128x128 for 100 iterations.
• Minisweep [13]
A radiation transportation mini app reproducing the Denovo Sn
radiation transport behaviour used for nuclear reactor neutronics
modeling. Run with options –ncell_x 8 –ncell_y 16 –ncell_z 32
–ne 1 –na 32
The paper was not really very good.
While some would argue that
excluding cache misses and branch mispredictions from
consideration even for maximum ILP is silly — I do not have a
problem with such in a limit study — the lack of loop unrolling
(or value inference/prediction for incremented values) makes the
results less accurately reflect a true maximum. The benchmarks are
also such that parallelism is much higher than usual.
Comparing performance of ISAs in such a limit study (same
frequency) seems to mostly be comparing the dataflow traits of the
programs rather than the tradeoffs presented by the ISAs, though
there were notable differences when instruction latencies were
allowed to be more realistic.
Fair ISA comparisons are hard. ISA interacts with multiple aspects
of microarchitecture. One could present an optimized
implementation space (with the dimensions of energy, time-to-
completion, and area/yield/cost — one might have to model an
optimum financial binning!), but that would seem to involve an
enormous amount of work even with rough approximations.
More testing may be needed.
Choosing benchmarks (and what to measure) tends to be iterative.
On 12/5/2023 8:29 PM, MitchAlsup wrote:
Quadibloc wrote:
But, the decoder still worked as-is for 32-bit x86, and the CPU isn't
going to be running 16-bit and 64-bit code at the same time, ...
Granted, IIRC an issue was that when Long-Mode-Enable is set, the mode >bit-patterns for 16-bit mode were reused for 64-bit mode (and VM86 mode
went poof as well).
But, otherwise they might have needed to "get creative" and find a way
to encode more CPU modes.
Either was, would have been happier if MS had included a built-in
emulator for 16-bit stuff.
mitchalsup@aol.com (MitchAlsup) writes:
Scott Lurndal wrote:
That is true, but only really usable when the resulting design
is realized on silicon. Verilog simulations won't win any
speed races, even with verilator.
Because it treats each bit as if it had (at least) 4 states.
Actually, as I learned in HOPL-IV, Verilog won the speed race that
counts, the one against VHDL, because it has been designed around
these 4 states, and implementing them efficiently, whereas VHDL allows
more states.
Verilog, with the model of 1-bit == 1-bit, would only have a 3× penalty; >>but would allow one to use all 1M CPUs in a system; instantly, and with
out rewriting anything !
Given that simulation efficiency is the reason that Verilog won, your 1-bit-Verilog should be a winner. But what do you do about the high-impendance state of MOS?
- anton
On 12/6/2023 3:04 AM, Andreas Eder wrote:
On Di 05 Dez 2023 at 17:44, BGB <cr88192@gmail.com> wrote:
QEMU does better emulation, but lacks any real way of sharing files with >>> the host OS.
Look at what is described here:
https://en.wikibooks.org/wiki/QEMU/FreeDOS
You can simply mount the image (when qemu istn't running) and copy file
to and fro.
Windows can't mount filesystem images...
WSL1 can't do it either, and WSL2 doesn't work on my PC.
On 12/6/2023 11:07 AM, Scott Lurndal wrote:
BGB <cr88192@gmail.com> writes:
On 12/6/2023 3:04 AM, Andreas Eder wrote:
On Di 05 Dez 2023 at 17:44, BGB <cr88192@gmail.com> wrote:
QEMU does better emulation, but lacks any real way of sharing files with >>>>> the host OS.
Look at what is described here:
https://en.wikibooks.org/wiki/QEMU/FreeDOS
You can simply mount the image (when qemu istn't running) and copy file >>>> to and fro.
Windows can't mount filesystem images...
WSL1 can't do it either, and WSL2 doesn't work on my PC.
Maybe it's time to switch to linux? Or at least a dual-boot
setup?
:^) Fwiw, I remember using a lot of those handy harddrive caddies way
back 22'ish years ago. I remember one time I had a lot of them, Solaris, >Linux, WinNT4, WinME, MSDOS, ect...
Anton Ertl wrote:
But what do you do about the
high-impendance state of MOS?
This requires the 4-state model (to mimic anything similar to real circuitry.) >In any event, technologies smaller than 30nm no longer allow this form of >logic.
mitchalsup@aol.com (MitchAlsup) writes:
Anton Ertl wrote:
But what do you do about the
high-impendance state of MOS?
This requires the 4-state model (to mimic anything similar to real circuitry.)
In any event, technologies smaller than 30nm no longer allow this form of >>logic.
But is it prevented by static checking? If not, you still need to
represent it in simulation, and report it as a bug there.
- anton
Paul A. Clayton wrote:
On 11/24/23 9:49 PM, BGB wrote:
[snip]
Though, apparently, someone posted something recently showing RV64
and ARM64 to be much closer than expected, which is curious. The main
instructions that seem to have "the most bang for the buck" are ones
that ARM64 has equivalents of.
"An Empirical Comparison of the RISC-V and AArch64 Instruction
Sets" (Daniel Weaver and Simon McIntosh-Smith, 2023) used five
benchmarks, four scientific and STREAM. Just the fact that STREAM
did not use FMADD for the TRIAD portion slightly penalized AArch64
(though RISC-V will presumably add FMADD if it has not already).
SIMD was excluded based on the reasonable point that RISC-V has
not yet standardized its SIMD extension and "comparing the
different vector instruction sets across AArch64 and RISC-V is
beyond the scope of this initial comparison".
I rather suspect these benchmarks do not provide a good basis for
ISA design targeting minimum path length (much less performance).
Both ARM and RISC-V require close to 40% more instructions than My 66000.
So much for minimum path lengths.
AND, no ISA with more than about 200 instructions should be considered
RISC,
The path lengths also varied considerably based on the compiler
version — a more recent version usually helping RISC-V more as
would be expected for a more recent ISA — though the results do
seem to point to general consistency of path length across
versions (one benchmark had negligible change for both ISAs, one
improved AArch64 only, two helped RISC-V only, and one helped both
ISAs but RISC-V more than AArch64).
I am somewhat surprised that indexed memory accesses did not
benefit AArch64 more (for such "scientific" benchmarks). AArch64's
need for a distinct comparison instruction for branches presumably
hurt, especially since loops were not unrolled. (AArch64 does, I
think, include a branch on equal/not-equal zero, so reverse
counted loops would have removed that disadvantage in some cases.)
My data indicates the indexed advantage is in the 2%-3% range.
Both RISC-V and AArch64 are RISC-oriented,
Under a perverted view of what the R in RISC stands for.
[snip]
On 12/7/2023 2:14 PM, Marcus wrote:
Under a perverted view of what the R in RISC stands for.
I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.
Yeah.
Load/Store, and doesn't use a "variable number of bytes" encoding scheme (like x86/Z80/6502 variants).
Or, the 'R' could refer more to keeping instruction logic simple, rather
than minimizing the number of instructions that can exist in the
instruction listing.
Well, and probably that it is viable to implement a CPU core for the
entire ISA without needing a microcode ROM or similar.
Admittedly, I feel unease with instructions which violate the Load/Store model, which goes for both my experimental LDOP extension and the RISC-V
'A' extension (where essentially LDOP and 'A' represent the same basic
CPU functionality).
Though, it is "sort of passable" in that it is possible to implement
these by shoving a minimal ALU into the L1 cache, rather than needing to restructure the whole pipeline (as would be needed for a more general x86-like model).
Then again, not many people are going and being like "The A extension
makes RISC-V no longer RISC".
But, then again, there are people who go on about how "[Rm+Ro*Sc]"
addressing is "Not RISC", nevermind that nearly every other RISC
(besides RISC-V) had included it (whether or not they also included a
way to explicitly encode the scale, or if the scale was baked into the instruction, *).
On 2023-12-06, MitchAlsup wrote:
Paul A. Clayton wrote:
On 11/24/23 9:49 PM, BGB wrote:
[snip]
Though, apparently, someone posted something recently showing RV64
and ARM64 to be much closer than expected, which is curious. The main
instructions that seem to have "the most bang for the buck" are ones
that ARM64 has equivalents of.
"An Empirical Comparison of the RISC-V and AArch64 Instruction
Sets" (Daniel Weaver and Simon McIntosh-Smith, 2023) used five
benchmarks, four scientific and STREAM. Just the fact that STREAM
did not use FMADD for the TRIAD portion slightly penalized AArch64
(though RISC-V will presumably add FMADD if it has not already).
SIMD was excluded based on the reasonable point that RISC-V has
not yet standardized its SIMD extension and "comparing the
different vector instruction sets across AArch64 and RISC-V is
beyond the scope of this initial comparison".
I rather suspect these benchmarks do not provide a good basis for
ISA design targeting minimum path length (much less performance).
Both ARM and RISC-V require close to 40% more instructions than My 66000.
So much for minimum path lengths.
AND, no ISA with more than about 200 instructions should be considered
RISC,
I wonder if 200 is a fundamental constant for RISC vs CISC ;-)
I still struggle to find a good definition of "1 instruction".
For me
the definition is loosely "one distinct operation", and so there can be
many variants of one instruction (e.g. variants with register operands
or register + immediate operands all count as a single instruction),
that all carry out the same operation, but with different kinds of
operands or operand sizes.
In my ISA I refer to these as "major instructions", and each
instructions typically have several variants (currently up to 18
variants per major instruction, where different permutations of scalar
and vector register operands count as different variants of a single instruction, for instance).
If I count this way, I currently have 106 instructions, which by your definition safely puts MRISC32 in the "RISC" camp. However, if I count
every variant as a separate instruction, I blow the budget.
Also worth mentioning is that my current instruction encoding scheme
allows for 1535 major instructions, so there is still plenty of room for extensions (even though I already have pretty complete integer, floating-point and vector support).
Under a perverted view of what the R in RISC stands for.
I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.
On 12/7/2023 4:34 PM, MitchAlsup wrote:
BGB wrote:
On 12/7/2023 2:14 PM, Marcus wrote:
Under a perverted view of what the R in RISC stands for.
I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.
Yeah.
Load/Store, and doesn't use a "variable number of bytes" encoding
scheme (like x86/Z80/6502 variants).
Does variable number of words fit this criterion.
Variable number of words is probably OK, otherwise Thumb2 and RVC would
no longer be RISC...
Or, the 'R' could refer more to keeping instruction logic simple,
rather than minimizing the number of instructions that can exist in
the instruction listing.
In the end it is how do you fit K instructions through your pipeline in
fewer cycles than some on can fit 1.4×k instructions through their
pipeline.
I could probably save a number of instructions if BJX2 was not
Load/Store, but worth it?...
Say, without LDOP:
MOV 16, R6
MOV.L (R4, 0), R5
ADD R5, R6, R5
MOV.L R5, (R4, 0)
Vs, with LDOP:
ADDS.L 16, (R4, 0) //*
Or, maybe go further, and add, say:
INC.L (R4)
DEC.L (R4)
...
Well, and probably that it is viable to implement a CPU core for the
entire ISA without needing a microcode ROM or similar.
There is no microcode in My 66000 1-wide or 6-wide implementations.
But there is no reason one could not build a My 66000 using microcode
should that be the best choice for some implementation.
It is probably not viable to build a {bug for bug} compatible x86
without microcode.
OK.
Admittedly, I feel unease with instructions which violate the
Load/Store model, which goes for both my experimental LDOP extension
and the RISC-V 'A' extension (where essentially LDOP and 'A' represent
the same basic CPU functionality).
Though, it is "sort of passable" in that it is possible to implement
these by shoving a minimal ALU into the L1 cache, rather than needing
to restructure the whole pipeline (as would be needed for a more
general x86-like model).
It is SO EASY to track this dependency based on register forwarding
that creating a LdOp was done for some other reason.
?...
Then again, not many people are going and being like "The A extension
makes RISC-V no longer RISC".
BECAUSE RISC-V is already not RISC (less than 200 instructions)...
Fair enough.
Ironically, if I want to support 'A', this means needing to have the
'LDOP' extension enabled, even if I am not really a fan of the cost or implications of this mechanism...
But, 'A' is needed for 'RV64G', which is annoyingly, what would need to
be supported to be able to have any hope of compatibility with the Linux
on RISC-V ecosystem.
The common superset of BJX2 and RV64G (at least for the userland side of things) is a bit more complexity than I would prefer though.
Well, along with the annoyance of the CPU core having functionality that
may exist in one ISA but not the other (and, don't want to port over everything from RISC-V, as this would pollute my own ISA with things
that don't really fit my own vision).
BGB wrote:
Or, say, people can find ways to make multi-threaded programming not
suck as much (say, if using an async join/promise and/or channels
model rather than traditional multi-threading with shared memory and
synchronization primitives).
If you want multi-threaded programs to succeed you need to start writing
them in a language that is not inherently serial !! It is brain dead
easy to write embarrassingly parallel applications in a language like Verilog. The programmer does not have to specify when or where a gate
is evaluated--that is the job of the environment (Verilog).....
Marcus wrote:
On 2023-12-06, MitchAlsup wrote:
Paul A. Clayton wrote:
On 11/24/23 9:49 PM, BGB wrote:
[snip]
Though, apparently, someone posted something recently showing RV64
and ARM64 to be much closer than expected, which is curious. The
main instructions that seem to have "the most bang for the buck"
are ones that ARM64 has equivalents of.
"An Empirical Comparison of the RISC-V and AArch64 Instruction
Sets" (Daniel Weaver and Simon McIntosh-Smith, 2023) used five
benchmarks, four scientific and STREAM. Just the fact that STREAM
did not use FMADD for the TRIAD portion slightly penalized AArch64
(though RISC-V will presumably add FMADD if it has not already).
SIMD was excluded based on the reasonable point that RISC-V has
not yet standardized its SIMD extension and "comparing the
different vector instruction sets across AArch64 and RISC-V is
beyond the scope of this initial comparison".
I rather suspect these benchmarks do not provide a good basis for
ISA design targeting minimum path length (much less performance).
Both ARM and RISC-V require close to 40% more instructions than My
66000.
So much for minimum path lengths.
AND, no ISA with more than about 200 instructions should be
considered RISC,
I wonder if 200 is a fundamental constant for RISC vs CISC ;-)
I choose 200 as the upper bound since 100 is obviously too small
{even though I get by with 61} and any vectorized or SIMDed ISA
is way more than 200.
I still struggle to find a good definition of "1 instruction".
1 Instruction is 1 Spelling the assembly language programmer has to
remember.
For me
the definition is loosely "one distinct operation", and so there can be
many variants of one instruction (e.g. variants with register operands
or register + immediate operands all count as a single instruction),
that all carry out the same operation, but with different kinds of
operands or operand sizes.
I hold this same view.
In my ISA I refer to these as "major instructions", and each
instructions typically have several variants (currently up to 18
variants per major instruction, where different permutations of scalar
and vector register operands count as different variants of a single
instruction, for instance).
VVM makes this distinction unnecessary.
If I count this way, I currently have 106 instructions, which by your
definition safely puts MRISC32 in the "RISC" camp. However, if I count
every variant as a separate instruction, I blow the budget.
My 66000 has 61 instructions under this framework. This includes {flow control, Integer, Logical, Shift, Floating point, Transcendentals, conversions, privileged, vectorization, and SIMD}
Also worth mentioning is that my current instruction encoding scheme
allows for 1535 major instructions, so there is still plenty of room for
extensions (even though I already have pretty complete integer,
floating-point and vector support).
My 66000 encoding scheme supports 2048 1-operand instructions at the consumption of 1 Major OpCode. Only the 3-operand subGroup is stressed
for Minor OpCodes.
Under a perverted view of what the R in RISC stands for.
I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.
My point was that it should not be redefined into meaninglessness.
On 2023-12-06, MitchAlsup wrote:
Paul A. Clayton wrote:
On 11/24/23 9:49 PM, BGB wrote:
Both ARM and RISC-V require close to 40% more instructions than My 66000.
So much for minimum path lengths.
AND, no ISA with more than about 200 instructions should be considered
RISC,
I wonder if 200 is a fundamental constant for RISC vs CISC ;-)
Both RISC-V and AArch64 are RISC-oriented,
Under a perverted view of what the R in RISC stands for.
I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.
[snip]
On 07/12/2023 21:14, Marcus wrote:
On 2023-12-06, MitchAlsup wrote:
Paul A. Clayton wrote:
On 11/24/23 9:49 PM, BGB wrote:
Both ARM and RISC-V require close to 40% more instructions than My 66000. >>> So much for minimum path lengths.
AND, no ISA with more than about 200 instructions should be considered
RISC,
I wonder if 200 is a fundamental constant for RISC vs CISC ;-)
Both RISC-V and AArch64 are RISC-oriented,
Under a perverted view of what the R in RISC stands for.
I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.
[snip]
I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
it meant small and simple instructions, rather than a small number of >different instructions. The idea was that instructions should, on the
whole, be single-cycle and implemented directly in the hardware, rather
than multi-cycle using sequencers or microcode. You could have as many
as you want, and they could be as complicated to describe as you want,
as long as they were simple to implement. (I've worked with a few
PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
/lot/ of instructions!)
On 07/12/2023 21:14, Marcus wrote:
I wonder if 200 is a fundamental constant for RISC vs CISC ;-)
I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
it meant small and simple instructions, rather than a small number of >different instructions.
The idea was that instructions should, on the
whole, be single-cycle and implemented directly in the hardware, rather
than multi-cycle using sequencers or microcode.
You could have as many
as you want, and they could be as complicated to describe as you want,
as long as they were simple to implement. (I've worked with a few
PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
/lot/ of instructions!)
In practice, though I think RISC vs CISC is more often used to
distinguish between positions in a range of tradeoffs common in ISA
design, such as :
* fixed-size, fixed-format instruction codes vs variable encodings
* many orthogonal registers vs fewer specialised registers
* load/store vs advanced addressing modes
* "one thing at a time" vs combing common tasks in one instruction
But there's no clear boundaries.
The original 68k architecture was
always classified as "CISC". Then the later ColdFire versions were
called "Variable instruction length RISC", though there was a 90%
overlap in the ISA.
On 07/12/2023 21:14, Marcus wrote:
I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
it meant small and simple instructions, rather than a small number of different instructions. The idea was that instructions should, on the
whole, be single-cycle and implemented directly in the hardware, rather
than multi-cycle using sequencers or microcode.
You could have as many
as you want, and they could be as complicated to describe as you want,
as long as they were simple to implement. (I've worked with a few
PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
/lot/ of instructions!)
In practice, though I think RISC vs CISC is more often used to
distinguish between positions in a range of tradeoffs common in ISA
design, such as :
* fixed-size, fixed-format instruction codes vs variable encodings
* many orthogonal registers vs fewer specialised registers
* load/store vs advanced addressing modes
* "one thing at a time" vs combing common tasks in one instruction
David Brown <david.brown@hesbynett.no> writes:
I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.
[snip]
I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
it meant small and simple instructions, rather than a small number of >>different instructions. The idea was that instructions should, on the >>whole, be single-cycle and implemented directly in the hardware, rather >>than multi-cycle using sequencers or microcode. You could have as many
as you want, and they could be as complicated to describe as you want,
as long as they were simple to implement. (I've worked with a few
PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
/lot/ of instructions!)
Surely then, the PDP-8 can be counted as a RISC processor. There are
only 8 instructions defined by a 3-bit opcode, and due to the
instruction encoding, a single operate instruction can perform multiple (sequential) operations.
000 - AND - AND the memory operand with AC.
001 - TAD - Two's complement ADd the memory operand to <L,AC> (a 12 bit signed value (AC) w. carry in L).
010 - ISZ - Increment the memory operand and Skip next instruction if result is Zero.
011 - DCA - Deposit AC into the memory operand and Clear AC.
100 - JMS - JuMp to Subroutine (storing return address in first word of subroutine!).
101 - JMP - JuMP.
110 - IOT - Input/Output Transfer (see below).
111 - OPR - microcoded OPeRations (see below).
https://en.wikipedia.org/wiki/PDP-8#Instruction_set
David Brown <david.brown@hesbynett.no> writes:
On 07/12/2023 21:14, Marcus wrote:
On 2023-12-06, MitchAlsup wrote:
Paul A. Clayton wrote:
On 11/24/23 9:49 PM, BGB wrote:
Both ARM and RISC-V require close to 40% more instructions than My 66000. >>>> So much for minimum path lengths.
AND, no ISA with more than about 200 instructions should be considered >>>> RISC,
I wonder if 200 is a fundamental constant for RISC vs CISC ;-)
Both RISC-V and AArch64 are RISC-oriented,
Under a perverted view of what the R in RISC stands for.
I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.
[snip]
I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
it meant small and simple instructions, rather than a small number of
different instructions. The idea was that instructions should, on the
whole, be single-cycle and implemented directly in the hardware, rather
than multi-cycle using sequencers or microcode. You could have as many
as you want, and they could be as complicated to describe as you want,
as long as they were simple to implement. (I've worked with a few
PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
/lot/ of instructions!)
Surely then, the PDP-8 can be counted as a RISC processor. There are
only 8 instructions defined by a 3-bit opcode, and due to the
instruction encoding, a single operate instruction can perform multiple (sequential) operations.
000 - AND - AND the memory operand with AC.
001 - TAD - Two's complement ADd the memory operand to <L,AC> (a 12 bit signed value (AC) w. carry in L).
010 - ISZ - Increment the memory operand and Skip next instruction if result is Zero.
011 - DCA - Deposit AC into the memory operand and Clear AC.
100 - JMS - JuMp to Subroutine (storing return address in first word of subroutine!).
101 - JMP - JuMP.
110 - IOT - Input/Output Transfer (see below).
111 - OPR - microcoded OPeRations (see below).
https://en.wikipedia.org/wiki/PDP-8#Instruction_set
David Brown <david.brown@hesbynett.no> writes:
Is Coldfire a load/store architecture? If not, it's not a RISC.
On 08/12/2023 16:38, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
(I'm snipping because I pretty much agree with the rest of what you wrote.)
Is Coldfire a load/store architecture? If not, it's not a RISC.
I agree that there's a fairly clear boundary between a "load/store architecture" and a "non-load/store architecture". And I agree that it
is usually a more important distinction than the number of instructions,
or the complexity of the instructions, or any other distinctions.
But does that mean LSA vs. NLSA should be used to /define/ RISC vs CISC?
Things have changed a lot since the term "RISC" was first coined, and
maybe architectural and ISA features are so mixed that the terms "RISC"
and "CISC" have lost any real meaning. If that's the case, then we
should simply talk about LSA and NLSA architectures, and stop using
"RISC" and "CISC". I don't think trying to redefine "RISC" to mean
something different from its original purpose helps.
On 08/12/2023 16:19, Scott Lurndal wrote:
David Brown <david.brown@hesbynett.no> writes:
On 07/12/2023 21:14, Marcus wrote:
On 2023-12-06, MitchAlsup wrote:
Paul A. Clayton wrote:
On 11/24/23 9:49 PM, BGB wrote:
Both ARM and RISC-V require close to 40% more instructions than My 66000. >>>>> So much for minimum path lengths.
AND, no ISA with more than about 200 instructions should be considered >>>>> RISC,
I wonder if 200 is a fundamental constant for RISC vs CISC ;-)
Both RISC-V and AArch64 are RISC-oriented,
Under a perverted view of what the R in RISC stands for.
I think that "RISC" much more commonly refers to a "load/store
architecture where all instruction fit well in a pipelined
design". That is, the "R" is very misleading.
[snip]
I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
it meant small and simple instructions, rather than a small number of
different instructions. The idea was that instructions should, on the
whole, be single-cycle and implemented directly in the hardware, rather
than multi-cycle using sequencers or microcode. You could have as many
as you want, and they could be as complicated to describe as you want,
as long as they were simple to implement. (I've worked with a few
PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
/lot/ of instructions!)
Surely then, the PDP-8 can be counted as a RISC processor. There are
only 8 instructions defined by a 3-bit opcode, and due to the
instruction encoding, a single operate instruction can perform multiple
(sequential) operations.
000 - AND - AND the memory operand with AC.
001 - TAD - Two's complement ADd the memory operand to <L,AC> (a 12 bit signed value (AC) w. carry in L).
010 - ISZ - Increment the memory operand and Skip next instruction if result is Zero.
011 - DCA - Deposit AC into the memory operand and Clear AC.
100 - JMS - JuMp to Subroutine (storing return address in first word of subroutine!).
101 - JMP - JuMP.
110 - IOT - Input/Output Transfer (see below).
111 - OPR - microcoded OPeRations (see below).
https://en.wikipedia.org/wiki/PDP-8#Instruction_set
By my logic (such as it is - I don't claim it is in any sense
"correct"), the PDP-8 would definitely be /CISC/. It only has a few >instructions, but that is irrelevant (that was my point) - the
instructions are complex, and therefore it is CISC.
David Brown wrote:
On 08/12/2023 16:38, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
(I'm snipping because I pretty much agree with the rest of what you
wrote.)
Is Coldfire a load/store architecture? If not, it's not a RISC.
I agree that there's a fairly clear boundary between a "load/store
architecture" and a "non-load/store architecture". And I agree that
it is usually a more important distinction than the number of
instructions, or the complexity of the instructions, or any other
distinctions.
Would CDC 6600 be considered to have a LD/ST architecture ??
But does that mean LSA vs. NLSA should be used to /define/ RISC vs
CISC? Things have changed a lot since the term "RISC" was first
coined, and
It HAS been 43 years since being coined.
maybe architectural and ISA features are so mixed that the terms
"RISC" and "CISC" have lost any real meaning. If that's the case,
then we should simply talk about LSA and NLSA architectures, and stop
using "RISC" and "CISC". I don't think trying to redefine "RISC" to
mean something different from its original purpose helps.
The lack of general purpose registers doesn't disqualify it
from the RISC label in my opinion.
Likewise, the complexity that RISC was attempting to address
were instructions like the Vax POLY, MOVC3/MOCV5 and the
queuing instructions (insert & remove).
The entire RISC vs CISC argument seems somewhat contrived
in these modern times.
On 08/12/2023 20:39, MitchAlsup wrote:
David Brown wrote:
On 08/12/2023 16:38, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
(I'm snipping because I pretty much agree with the rest of what you
wrote.)
Is Coldfire a load/store architecture? If not, it's not a RISC.
I agree that there's a fairly clear boundary between a "load/store
architecture" and a "non-load/store architecture". And I agree that
it is usually a more important distinction than the number of
instructions, or the complexity of the instructions, or any other
distinctions.
Would CDC 6600 be considered to have a LD/ST architecture ??
I don't know - that was /long/ before my time!
But does that mean LSA vs. NLSA should be used to /define/ RISC vs
CISC? Things have changed a lot since the term "RISC" was first
coined, and
It HAS been 43 years since being coined.
maybe architectural and ISA features are so mixed that the terms
"RISC" and "CISC" have lost any real meaning. If that's the case,
then we should simply talk about LSA and NLSA architectures, and stop
using "RISC" and "CISC". I don't think trying to redefine "RISC" to
mean something different from its original purpose helps.
Surely then, the PDP-8 can be counted as a RISC processor.
Would CDC 6600 be considered to have a LD/ST architecture ??
For Itanium, binary translation provided better performance on
the same hardware, so it was more evident that the compatibility
had a mediocre performance target.
If you want multi-threaded programs to succeed you need to start writing
them in a language that is not inherently serial !! It is brain dead
easy to write embarrassingly parallel applications in a language like Verilog. The programmer does not have to specify when or where a gate
is evaluated--that is the job of the environment (Verilog).....
Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??
Scott Lurndal wrote:
The lack of general purpose registers doesn't disqualify it
from the RISC label in my opinion.
Then RISC is a meaningless term.
PDP-8 certainly is simple nor does it have many instructions, but it certainly is NOT RISC.
Did not have a large GPR register file
Was Not pipelined
Was Not single cycle execution
Did not overlap instruction fetch with execution
Did not rely on compiler for good code performance
mitchalsup@aol.com (MitchAlsup) writes:
Scott Lurndal wrote:
The lack of general purpose registers doesn't disqualify it
from the RISC label in my opinion.
Then RISC is a meaningless term.
PDP-8 certainly is simple nor does it have many instructions, but it
certainly is NOT RISC.
Did not have a large GPR register file
Was Not pipelined
Was Not single cycle execution
Did not overlap instruction fetch with execution
Did not rely on compiler for good code performance
Of course the PDP-8 is a RISC. These propperties may have been
common among some RISC processors, but they don't define what
RISC is. RISC is a design philosophy, not any particular set
of architectural features.
Tim Rentsch wrote:
mitchalsup@aol.com (MitchAlsup) writes:
Scott Lurndal wrote:
The lack of general purpose registers doesn't disqualify it
from the RISC label in my opinion.
Then RISC is a meaningless term.
PDP-8 certainly is simple nor does it have many instructions, but it
certainly is NOT RISC.
Did not have a large GPR register file
Was Not pipelined
Was Not single cycle execution
Did not overlap instruction fetch with execution
Did not rely on compiler for good code performance
Of course the PDP-8 is a RISC. These propperties may have been
common among some RISC processors, but they don't define what
RISC is. RISC is a design philosophy, not any particular set
of architectural features.
So what we can take from this is that RISC as a term has become meaningless.
Of course the PDP-8 is a RISC. These propperties may have been common
among some RISC processors, but they don't define what RISC is. RISC is
a design philosophy, not any particular set of architectural features.
MitchAlsup <mitchalsup@aol.com> schrieb:
If you want multi-threaded programs to succeed you need to start writing
them in a language that is not inherently serial !! It is brain dead
easy to write embarrassingly parallel applications in a language like
Verilog. The programmer does not have to specify when or where a gate
is evaluated--that is the job of the environment (Verilog).....
But the job of a programmer to keep everything that can be parallel in >mind... Would you write a compiler, or a word processor, in Verilog?
How much harder would that be, compared to a serial language?
My personal favorites for parallel programming are PGAS languages
like (yes, you guessed it) Fortran, where the central data
structure is the coarray.
You have to make sure that you synchronize before accessing data
that has been modified on another image.
On 12/7/23 9:36 PM, MitchAlsup wrote:
[snip]
My point on VLEness is that all the position and length
information is
found in the first container of the instruction and not determined by
a serial walk along the containers. IBM 360 is a lot less CISC
than x86.
Serial decode is definitely not RISC.
Small field determines length, pointers, and sizes; remains
RISCable if it does not violate other RISC tenets.
I would have guessed that encoding a length to next length
indicator would also be somewhat simple for decode when the
additional chunk contains only opcode extension that does not
affect routing (which includes some hints) or immediate data. In
terms of parsing the instruction stream into instructions, such is
not different than having a nop with the length specified in *its*
first container. [see ENDNOTE]
My 66000's instruction modifiers seem to add some decoding
complexity in that bits of the container are distributed to the
following instructions (which may themselves be variable length);
clearly, this is considered acceptable complexity. I think a
DOUBLE prefix was also proposed (architected? it was not in the 28
Jan 2020 version that I have) that encoded additional operands
into the prefix, forming a kind of explicit instruction fusion.
(I have a suspicion that a large-chunk instruction encoding with
borrow-lend across chunks could facilitate code density while
providing some of the advantages of fixed-length encoding. I have
not thought about this deeply, but I sense there may be problems
with allowing arbitrary bits to be borrowed. Limiting such to
immediates might reduce the exposure to danger. However, I
suspect that more emphasis should be on targeting an OoO
implementation than on code density.)
=== ENDNOTE ===
An encoding with multiple length specifiers might theoretically
reduce the overhead of encoding the length for the more common
short cases — perhaps by an entire bit!!!!☺ — but in addition to increasing the size overhead for longer instructions it would
split large fields by inserting the length extension specifier.
The extra size overhead also means that 32-bit and 64-bit
immediates could further bloat the instruction by requiring an
additional parcel for only a few bits. One bit of length
information effectively becomes a marker bit per parcel, which is
a technique that was used for some x86 pre-decoded caches and for
one ISA that encoded immediates.
If there were 16 instruction lengths, perhaps a split length
specifier *might* make sense, but My 66000's five instruction
lengths obviously does not take that much space. I believe the
lengths are not fully orthogonal, so it does not take 2.32 bits.
(I think only a store can be longer than 3 parcels, though perhaps
some 3-input compute operations might be theoretically able to use
two constant inputs.)
Yet if the extra bits are not on a critical path (such as register specifiers) such a clunkier mechanism might not be so horrible.
Maybe only 0.2 x86s (the unit of measurement for ISA horror☺).
Gallium Arsenide is 5×; hideously expensive, dangerous to the workers in
the FAB, and chemical disposal, low yield,.....
My VLE encoding (4-bits) deals with constants (±5-bits, 32-bits,
64-bits)
and operand sign control {rs1,rs2..rs1,-rs2..-rs1,rs2..-rs1,-rs2}
The trick is finding where you can position to place these bits such
that the same bits are used in {1-operand, 2-operand, 3-operand,
and memory references.} This means you can decode them prior to
determining the instruction subGroup. And you cannot more a 5-bit
register specifier.....
the trick to keeping it simple
would be to have _two_ sets of prefix bits, because there are only a few lengths for instructions, and a few lengths for constants used as
immediates - it's only the _combinations_ that get out of control.
On Wed, 13 Dec 2023 05:09:52 +0000, Quadibloc wrote:
the trick to keeping it simple would be to have _two_ sets of prefix
bits, because there are only a few lengths for instructions, and a few
lengths for constants used as immediates - it's only the _combinations_
that get out of control.
Actually, there is one other thing. So that the instructions for which immediates are used can have one consistent format, unlike the prefixes
for instruction length, which can have different numbers of bits, the prefixes for constant length need to all be the same number of bits in length.
MitchAlsup <mitchalsup@aol.com> schrieb:
Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??
In my case: Yes.
Besides making many things worse, the new formula editor (since
2010?) in Word is reasonable to work with, especially since it is
possible to use LaTeX notation now (and thus it is now possible
to paste from Maple).
Previously, I actually wrote some reports in LaTeX, going to some
trouble to make them appear visually like the Word template du jour
(but the formulas gave it away, they looked to nice for Word).
Formula _numbering_ - now that, Microsoft managed to make worse
(which simply comes naturally in LaTeX).
And, come to think of it, since Office 365 (I think) they now
allow direct use of svg files as graphics, allowing two
non-braindead ways of including pdf graphics in Word - either
via Inkscape (read as pdf, write as svg) or through command-line
tools (usually via Cygwin).
On 12/7/23 9:36 PM, MitchAlsup wrote:
[snip]
My point on VLEness is that all the position and length information is
found in the first container of the instruction and not determined by
a serial walk along the containers. IBM 360 is a lot less CISC than x86.
Serial decode is definitely not RISC.
Small field determines length, pointers, and sizes; remains RISCable
if it does not violate other RISC tenets.
I would have guessed that encoding a length to next length
indicator would also be somewhat simple for decode when the
additional chunk contains only opcode extension that does not
affect routing (which includes some hints) or immediate data. In
terms of parsing the instruction stream into instructions, such is
not different than having a nop with the length specified in *its*
first container. [see ENDNOTE]
=== ENDNOTE ===
An encoding with multiple length specifiers might theoretically
reduce the overhead of encoding the length for the more common
short cases — perhaps by an entire bit!!!!☺ — but in addition to increasing the size overhead for longer instructions it would
split large fields by inserting the length extension specifier.
The extra size overhead also means that 32-bit and 64-bit
immediates could further bloat the instruction by requiring an
additional parcel for only a few bits. One bit of length
information effectively becomes a marker bit per parcel, which is
a technique that was used for some x86 pre-decoded caches and for
one ISA that encoded immediates.
If there were 16 instruction lengths, perhaps a split length
specifier *might* make sense, but My 66000's five instruction
lengths obviously does not take that much space. I believe the
lengths are not fully orthogonal, so it does not take 2.32 bits.
(I think only a store can be longer than 3 parcels, though perhaps
some 3-input compute operations might be theoretically able to use
two constant inputs.)
Yet if the extra bits are not on a critical path (such as register specifiers) such a clunkier mechanism might not be so horrible.
Maybe only 0.2 x86s (the unit of measurement for ISA horror☺).
On 10/12/2023 11:39, Thomas Koenig wrote:
MitchAlsup <mitchalsup@aol.com> schrieb:
Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??
In my case: Yes.
Besides making many things worse, the new formula editor (since
2010?) in Word is reasonable to work with, especially since it is
possible to use LaTeX notation now (and thus it is now possible
to paste from Maple).
If I want to write something serious with formula, I use LaTeX.
On 10/12/2023 11:39, Thomas Koenig wrote:
MitchAlsup <mitchalsup@aol.com> schrieb:
Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??
In my case: Yes.
Besides making many things worse, the new formula editor (since
2010?) in Word is reasonable to work with, especially since it is
possible to use LaTeX notation now (and thus it is now possible
to paste from Maple).
If I want to write something serious with formula, I use LaTeX.
Previously, I actually wrote some reports in LaTeX, going to some
trouble to make them appear visually like the Word template du jour
(but the formulas gave it away, they looked to nice for Word).
What a strange thing to do - that sounds completely backwards to me!
I was happy when I had made a template for LibreOffice (it might have
been one of the forks of OpenOffice, pre-LibreOffice) that looked
similar to what I have for LaTeX. Then I could make reasonable-looking documents for customers that insisted on having docx format instead of pdf.
I don't think there has been much exciting or important (to me) added to
word processors for decades. Direct pdf generation was one, which
probably existed in Star Office (the ancestor of OpenOffice /
Apart from that, the only benefits I see of newer LibreOffice over older
ones is better handling of the insane chaos that MS Office uses for its
file formats. LibreOffice is /much/ better at this than MS Office is, especially if the file has been modified by a number of different MS
Office versions.
On Wed, 13 Dec 2023 12:31:50 +0000, Quadibloc wrote:
I went back and checked. When I proposed what a variable-length coding
for the Concertina instruction set would look like, at first glance it
seemed as though I didn't follow that rule:
Something like:
0 - 16 bits
1 - 32 bits, except
111011001 32 bits + 16 bits
111011010 32 bits + 32 bits
111011011 32 bits + 64 bits
1110111000 32 bits + 48 bits
1110111001 32 bits + 32 bits
1110111010 32 bits + 64 bits
1110111011 32 bits + 128 bits
11110 - 48 bits
11111 - 64 bits
Paul A. Clayton wrote:
On 12/7/23 9:36 PM, MitchAlsup wrote:
[snip]
My point on VLEness is that all the position and length information is
found in the first container of the instruction and not determined by
a serial walk along the containers. IBM 360 is a lot less CISC than x86. >>>
Serial decode is definitely not RISC.
Small field determines length, pointers, and sizes; remains RISCable
if it does not violate other RISC tenets.
I would have guessed that encoding a length to next length
indicator would also be somewhat simple for decode when the
additional chunk contains only opcode extension that does not
affect routing (which includes some hints) or immediate data. In
terms of parsing the instruction stream into instructions, such is
not different than having a nop with the length specified in *its*
first container. [see ENDNOTE]
If one designs the ISA on the assumption that there will be separate
stages for Fetch and Decode, and I think that's a good idea,
then there are two parses taking place, the external inter-instruction
parse performed by Fetch, and internal instruction field parse by Decode.
The Fetch length parse needs to be simple *except* that Fetch needs to
be able to pick off all conditional and unconditional branch, call, ret,
and consult the branch predictors for which-path information.
Additionally for BRcc/CALL Fetch needs access to the branch offset,
which is an internal parse, to add to the its future RIP
in case the predictor says to follow the alternate path.
And, as in my case, there could be multiple branch offset sizes
which interacts with the length parse, sign extension delay,
and the final RIP add result delay.
Everything else is an internal parse by Decode where it is mostly a matter
of chopping things up.
For instruction with immediates RISC-V designers seemed to be very concerned about sign/zero extension delay and the
location of the sign bit but I'm not sure why - to me it looks like a
single mux delay at the end. And if all immediates are parsed by Fetch, because it needs the BR/CALL offset, then these might arrive in the
Decode input buffer already parsed and sign extended.
=== ENDNOTE ===
An encoding with multiple length specifiers might theoretically
reduce the overhead of encoding the length for the more common
short cases — perhaps by an entire bit!!!!☺ — but in addition to
increasing the size overhead for longer instructions it would
split large fields by inserting the length extension specifier.
The extra size overhead also means that 32-bit and 64-bit
immediates could further bloat the instruction by requiring an
additional parcel for only a few bits.
One bit of length
information effectively becomes a marker bit per parcel, which is
a technique that was used for some x86 pre-decoded caches and for
one ISA that encoded immediates.
On 12/13/2023 8:47 AM, EricP wrote:
Paul A. Clayton wrote:
On 12/7/23 9:36 PM, MitchAlsup wrote:
[snip]
Luckily, if one can classify each instruction word into one of:
16-bit op;
32-bit scalar op;
32-bit bundle;
32-bit jumbo prefix.
Then looking at 1 or 2 instruction words for a 1-3 word instruction
isn't too much of an ask.
Additionally for BRcc/CALL Fetch needs access to the branch offset,
which is an internal parse, to add to the its future RIP
in case the predictor says to follow the alternate path.
And, as in my case, there could be multiple branch offset sizes
which interacts with the length parse, sign extension delay,
and the final RIP add result delay.
More or less how it is done in my case, except it works by computing PC
+ one of several different branch-sizes (8s, 11s, and 20s), and if the
corresponding branch hits (matches the pattern and is selected as
"taken") it then uses this output as the destination (via MUX'ing).
David Brown wrote:
On 10/12/2023 11:39, Thomas Koenig wrote:
MitchAlsup <mitchalsup@aol.com> schrieb:
Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??
In my case: Yes.
Besides making many things worse, the new formula editor (since
2010?) in Word is reasonable to work with, especially since it is
possible to use LaTeX notation now (and thus it is now possible
to paste from Maple).
If I want to write something serious with formula, I use LaTeX.
When I want to write an unmisunderstandable formula I use CorelDraw
and then export as *.jpg. {Everything, except NGs like this, can take *.jpgs.} And a Draw program can create symbols that are not in char-
acture Maps.
David Brown wrote:
On 10/12/2023 11:39, Thomas Koenig wrote:
MitchAlsup <mitchalsup@aol.com> schrieb:
Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??
In my case: Yes.
Besides making many things worse, the new formula editor (since
2010?) in Word is reasonable to work with, especially since it is
possible to use LaTeX notation now (and thus it is now possible
to paste from Maple).
If I want to write something serious with formula, I use LaTeX.
Previously, I actually wrote some reports in LaTeX, going to some
trouble to make them appear visually like the Word template du jour
(but the formulas gave it away, they looked to nice for Word).
What a strange thing to do - that sounds completely backwards to me!
I was happy when I had made a template for LibreOffice (it might have
been one of the forks of OpenOffice, pre-LibreOffice) that looked
similar to what I have for LaTeX. Then I could make
reasonable-looking documents for customers that insisted on having
docx format instead of pdf.
I don't think there has been much exciting or important (to me) added
to word processors for decades. Direct pdf generation was one, which
probably existed in Star Office (the ancestor of OpenOffice /
*.pdf arrives in Word ~2000 (maybe before).
<snip>
Apart from that, the only benefits I see of newer LibreOffice over
older ones is better handling of the insane chaos that MS Office uses
for its file formats. LibreOffice is /much/ better at this than MS
Office is, especially if the file has been modified by a number of
different MS Office versions.
I still require people sending me *.docx to convert it back to
WORD2003 format *.doc and retransmitting it. It is surprising how many
people don't know how to do that.
David Brown wrote:
On 10/12/2023 11:39, Thomas Koenig wrote:
MitchAlsup <mitchalsup@aol.com> schrieb:
Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??
In my case: Yes.
Besides making many things worse, the new formula editor (since
2010?) in Word is reasonable to work with, especially since it is
possible to use LaTeX notation now (and thus it is now possible
to paste from Maple).
If I want to write something serious with formula, I use LaTeX.
Previously, I actually wrote some reports in LaTeX, going to some
trouble to make them appear visually like the Word template du jour
(but the formulas gave it away, they looked to nice for Word).
What a strange thing to do - that sounds completely backwards to me!
I was happy when I had made a template for LibreOffice (it might have
been one of the forks of OpenOffice, pre-LibreOffice) that looked
similar to what I have for LaTeX. Then I could make reasonable-looking
documents for customers that insisted on having docx format instead of pdf.
I don't think there has been much exciting or important (to me) added to
word processors for decades. Direct pdf generation was one, which
probably existed in Star Office (the ancestor of OpenOffice /
*.pdf arrives in Word ~2000 (maybe before).
I still require people sending me *.docx to convert it back to
WORD2003 format *.doc and retransmitting it. It is surprising how
many people don't know how to do that.
On 13/12/2023 23:40, MitchAlsup wrote:
I was happy when I had made a template for LibreOffice (it might have
been one of the forks of OpenOffice, pre-LibreOffice) that looked
similar to what I have for LaTeX. Then I could make
reasonable-looking documents for customers that insisted on having
docx format instead of pdf.
I don't think there has been much exciting or important (to me) added
to word processors for decades. Direct pdf generation was one, which
probably existed in Star Office (the ancestor of OpenOffice /
*.pdf arrives in Word ~2000 (maybe before).
Word 2007, according to Wikipedia, google, and the never-wrong internet community.
On 12/13/2023 8:47 AM, EricP wrote:
And, as in my case, there could be multiple branch offset sizes
which interacts with the length parse, sign extension delay,
and the final RIP add result delay.
So, have 1 size for conditional and 1 size for unconditional--
AND THEN, you build a sign extending adder that adds 64+16 and
produces the correct sign extended 64-bit result. Guys it is not
1975 anymore !! Why do you think this is a serial process ??
Also Note:: The address produced by the branch target adder only
needs to produce enough bits to index the cache {you can sort out
all the harder stuff later} in the DECODE stage of the pipeline.
A check for page crossing (because you don't necessarily need to access
the TLB) finishes the problem.
MitchAlsup wrote:
On 12/13/2023 8:47 AM, EricP wrote:
And, as in my case, there could be multiple branch offset sizes
which interacts with the length parse, sign extension delay,
and the final RIP add result delay.
So, have 1 size for conditional and 1 size for unconditional--
I have 2 branch formats, one for small 16b size offset with 16b opspec,
and one for medium 32b and large 64b offsets with 32b opspec,
Later I added a 3rd format for compare and branch when there are
two variable size immediates, one for offset and one for compare value.
The offset is the first immediate so it starts in a known buffer location.
AND THEN, you build a sign extending adder that adds 64+16 and
produces the correct sign extended 64-bit result. Guys it is not
1975 anymore !! Why do you think this is a serial process ??
I didn't say serial.
I was thinking of starting all 3 offsets sizes 16b, 32b, 64b,
adds immediately before knowing the instruction type or size,
then use the actual type and size to select the correct result.
The 64b adds could be further subdivided as 4 * 16b adders then
combine the size select with 16b carry select to assemble a 64b result.
Which is why I said I thought this was just a mux delay at the end.
Also Note:: The address produced by the branch target adder only
needs to produce enough bits to index the cache {you can sort out
all the harder stuff later} in the DECODE stage of the pipeline.
A check for page crossing (because you don't necessarily need to access
the TLB) finishes the problem.
In the hypothetical design I have in mind the instruction bytes
get parsed from fetch buffers, whose job is to hide the pipeline
latency to I$L1, and also allow prefetch for possible alternate path.
It also allows local looping and replay out of the fetch buffers.
In that design the full 64b parse RIP is need as a tag for
selecting from the multiple fetch buffers.
On 10/12/2023 11:39, Thomas Koenig wrote:
Previously, I actually wrote some reports in LaTeX, going to some
trouble to make them appear visually like the Word template du jour
(but the formulas gave it away, they looked to nice for Word).
What a strange thing to do - that sounds completely backwards to me!
On Sun, 10 Dec 2023 10:56:36 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:
My personal favorites for parallel programming are PGAS languages
like (yes, you guessed it) Fortran, where the central data
structure is the coarray.
Mileage varies considerably and I don't intend to start a language
war: a lot has to due with the history of parallel applications a
person has developed. I can respect your point of view even though I
don't agree with it.
My favorite model is CSP (ala Hoare) with no shared memory. Which is
not to say I don't use threads, but I try to design programs such that threads (mostly) are not sharing writable data structures.
You have to make sure that you synchronize before accessing data
that has been modified on another image.
And that's where most languages fall down: they provide either
primitives which are too low level and too hard to use correctly, or
they provide high level mechanisms that don't scale well and are too
limiting in actual use.
David Brown <david.brown@hesbynett.no> schrieb:
On 10/12/2023 11:39, Thomas Koenig wrote:
Previously, I actually wrote some reports in LaTeX, going to some
trouble to make them appear visually like the Word template du jour
(but the formulas gave it away, they looked to nice for Word).
What a strange thing to do - that sounds completely backwards to me!
When you work at a company that prescribes (sort of) a certain
format, that is one possibility. I did the cover sheet in Word,
though, and pasted it together as PDF.
On 12/14/2023 6:41 AM, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup) writes:
David Brown wrote:
On 10/12/2023 11:39, Thomas Koenig wrote:
MitchAlsup <mitchalsup@aol.com> schrieb:
Question (to everyone):: Has your word processor or spreadsheet
added anything USEFUL TO YOU since 2000 ??
In my case: Yes.
Besides making many things worse, the new formula editor (since
2010?) in Word is reasonable to work with, especially since it is
possible to use LaTeX notation now (and thus it is now possible
to paste from Maple).
If I want to write something serious with formula, I use LaTeX.
Previously, I actually wrote some reports in LaTeX, going to some
trouble to make them appear visually like the Word template du jour
(but the formulas gave it away, they looked to nice for Word).
What a strange thing to do - that sounds completely backwards to me!
I was happy when I had made a template for LibreOffice (it might have
been one of the forks of OpenOffice, pre-LibreOffice) that looked
similar to what I have for LaTeX. Then I could make reasonable-looking >>>> documents for customers that insisted on having docx format instead
of pdf.
I don't think there has been much exciting or important (to me)
added to
word processors for decades. Direct pdf generation was one, which
probably existed in Star Office (the ancestor of OpenOffice /
*.pdf arrives in Word ~2000 (maybe before).
Are you sure about that? IIRC it was a decade later before
adobe wasn't required.
<snip>
I still require people sending me *.docx to convert it back to
WORD2003 format *.doc and retransmitting it. It is surprising how
many people don't know how to do that.
I ask for PDF's. I have no ability to read windows office formats
of any type without using star/open/libre office, and I detest WYSIWYG
word processors of all stripes.
Try to stay far away from windows office docs, they can be filled with interesting macros, well back in the day! I do remember a lot of print
to PDF programs. Mock up a printer device, print, produce a file.
George Neuner <gneuner2@comcast.net> schrieb:
On Sun, 10 Dec 2023 10:56:36 -0000 (UTC), Thomas Koenig >><tkoenig@netcologne.de> wrote:
And that's where most languages fall down: they provide either
primitives which are too low level and too hard to use correctly, or
they provide high level mechanisms that don't scale well and are too
limiting in actual use.
I think Fortran has gotten many things right here, at least for the
domain of scientific computing - the complexity is manageable.
For those who are interested, I've written a short tutorial about
Fortran coarrays, which can be found at
https://github.com/tkoenig1/coarray-tutorial/blob/main/tutorial.md
On 12/6/23 2:54 AM, Anton Ertl wrote:
scott@slp53.sl.home (Scott Lurndal) writes:[snip]
One of my professors back in the late 70's was researching
data flow architectures. Perhaps it's time to reconsider the
unit of compute using single instructions, instead providing a
set of hardware 'functions' than can be used in a data flow environment.
We already have data-flow microarchitectures since the mid-1990s, with
the success of OoO execution. And the "von Neumann" ISAs have proven
to be a good and long-term stable interface between software and these
data-flow microarchitectures, whereas the data-flow ISAs of the 1970s
and their microarchitectures turned out to be uncompetetive.
I suspect that a superior interface could be designed which
exploits diverse locality (i.e., data might naturally be closer to
some computational resources than to others) and communication
(and storage) costs and budgets (budgets being related to urgency
and importance).
I think the original dataflow architectures
attempted to be very general with significant overhead for
readiness determination and communication. They also (as I
understand) lacked value prediction whereas OoO effectively uses
value prediction for branches.
I changed where, in the opcode space, the supplementary memory-reference instructions were located. This allowed me to have a few more bits
available for them.
On 12/6/23 2:54 AM, Anton Ertl wrote:
scott@slp53.sl.home (Scott Lurndal) writes:[snip]
One of my professors back in the late 70's was researching
data flow architectures. Perhaps it's time to reconsider the
unit of compute using single instructions, instead providing a
set of hardware 'functions' than can be used in a data flow environment.
We already have data-flow microarchitectures since the mid-1990s, with
the success of OoO execution. And the "von Neumann" ISAs have proven
to be a good and long-term stable interface between software and these
data-flow microarchitectures, whereas the data-flow ISAs of the 1970s
and their microarchitectures turned out to be uncompetetive.
I suspect that a superior interface could be designed which
exploits diverse locality (i.e., data might naturally be closer to
some computational resources than to others) and communication
(and storage) costs and budgets (budgets being related to urgency
and importance). I think the original dataflow architectures
attempted to be very general with significant overhead for
readiness determination and communication. They also (as I
understand) lacked value prediction whereas OoO effectively uses
value prediction for branches.
Quadibloc <quadibloc@servername.invalid> schrieb:
For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.
This breaks with the central tenet of the /360, the PDP-11,
the VAX, and all RISC architectures: (Almost) all registers are general-purpose registers.
This would make your ISA very un-S/360-like.
On Tue, 19 Dec 2023 03:36:06 +0000, Quadibloc wrote:
I changed where, in the opcode space, the supplementary
memory-reference instructions were located. This allowed me to have a
few more bits available for them.
I've moved them again, making even more space available... because in my
last change, I made the mistake of using the opcode space that I was
already using for block headers. I couldn't reduce the amount of
information in a block header by two bits, by using a combination of ten
bits instead of eight to indicate a block header, so I had to do my rearranging in this place instead.
Quadibloc <quadibloc@servername.invalid> writes:
[IBM Model 195]
Its microarchitecture ended up being, in general terms, copied by the >>Pentium Pro and the Pentium II.
Not really. The Models 91 and 195 only have OoO for FP, not for
integers.
On Tue, 05 Dec 2023 11:07:09 +0000, Anton Ertl wrote:
Quadibloc <quadibloc@servername.invalid> writes:
[IBM Model 195]
Its microarchitecture ended up being, in general terms, copied by the >>>Pentium Pro and the Pentium II.
Not really. The Models 91 and 195 only have OoO for FP, not for
integers.
As do the Pentium Pro and the Pentium II. (The Motorola 68050 did it the other way around, only having OoO for integers, and not for FP,
figuring, I guess, that integers are used the most, so this would create better performance numbers.)
On Tue, 05 Dec 2023 11:07:09 +0000, Anton Ertl wrote:
Quadibloc <quadibloc@servername.invalid> writes:
[IBM Model 195]
Its microarchitecture ended up being, in general terms, copied by the >>>Pentium Pro and the Pentium II.
Not really. The Models 91 and 195 only have OoO for FP, not for
integers.
As do the Pentium Pro and the Pentium II.
(The Motorola 68050 did it
the other way around, only having OoO for integers, and not for FP,
figuring, I guess, that integers are used the most, so this would
create better performance numbers.)
Quadibloc <quadibloc@servername.invalid> writes:
On Tue, 05 Dec 2023 11:07:09 +0000, Anton Ertl wrote:
Quadibloc <quadibloc@servername.invalid> writes:
[IBM Model 195]
Its microarchitecture ended up being, in general terms, copied by
the Pentium Pro and the Pentium II.
Not really. The Models 91 and 195 only have OoO for FP, not for
integers.
As do the Pentium Pro and the Pentium II.
This is the first time I have seen that claimed. What makes you think
so?
Everything I have read about the Pentium Pro indicates that it has
complete OoO with speculation and precise exceptions (and neither
speculation nor precise exceptions would work with FP-only OoO, as demonstrated by the Model 91 which has neither and is infamous for its imprecise exceptions).
(The Motorola 68050 did it
the other way around, only having OoO for integers, and not for FP, >figuring, I guess, that integers are used the most, so this would
create better performance numbers.)
According to <https://en.wikipedia.org/wiki/Motorola_68000_series#68050_and_68070>,
there was no 68050. According to
<https://en.wikipedia.org/wiki/68060>.
|The 68060 shares most architectural features with the P5 Pentium.
Both |have a very similar superscalar in-order dual instruction
pipeline |configuration
I.e., no OoO.
- anton
On 12/19/2023 11:34 AM, Quadibloc wrote:
Well, I felt that _some_ compromises had to be made; otherwise, there
was no way instructions with base-index addressing _and_ 16-bit
displacements would fit into 32 bits.
So this isn't a decision I can reverse. Yes, it has its problems,
but it's an unavoidable result of my goal of combining aspects of RISC
and CISC in a single ISA.
As I see it, there are two major situations:
Stack frames and structs, where a 16-bit displacement is likely
overkill;
Global variables, where for the general case it is almost entirely insufficient.
On 12/19/2023 11:34 AM, Quadibloc wrote:
On Fri, 10 Nov 2023 22:03:23 +0000, Thomas Koenig wrote:
Quadibloc <quadibloc@servername.invalid> schrieb:
For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.
This breaks with the central tenet of the /360, the PDP-11,
the VAX, and all RISC architectures: (Almost) all registers are
general-purpose registers.
This would make your ISA very un-S/360-like.
Well, I felt that _some_ compromises had to be made; otherwise,
there was no way instructions with base-index addressing _and_
16-bit displacements would fit into 32 bits.
So this isn't a decision I can reverse. Yes, it has its problems,
but it's an unavoidable result of my goal of combining aspects of
RISC and CISC in a single ISA.
As I see it, there are two major situations:
Stack frames and structs, where a 16-bit displacement is likely overkill; Global variables, where for the general case it is almost entirely insufficient.
On Tue, 19 Dec 2023 17:19:24 -0600, BGB wrote:
On 12/19/2023 11:34 AM, Quadibloc wrote:
Well, I felt that _some_ compromises had to be made; otherwise, there
was no way instructions with base-index addressing _and_ 16-bit
displacements would fit into 32 bits.
So this isn't a decision I can reverse. Yes, it has its problems,
but it's an unavoidable result of my goal of combining aspects of RISC
and CISC in a single ISA.
As I see it, there are two major situations:
Stack frames and structs, where a 16-bit displacement is likely
overkill;
Global variables, where for the general case it is almost entirely
insufficient.
Yes, but does that mean that 16-bit displacements are a bad idea?
The Motorola 68000, the 8086, the PowerPC, and lots of other
architectures all had them.
So: what are 16-bit variables _for_? *Local* variables, of course.
Allocate one base register to the start of the data area for a program,
and another base register to the start of the program area for a
program, and you're done.
The architecture also provides _one_ base register that works with
15-bit displacements. This allows instructions to have a smaller format
if that base register is used.
And then there's another seven registers allocated as base registers
that work with 12-bit displacements. If you want to save the base
registers with 16-bit displacements, then you can use those.
On 12/19/2023 6:17 PM, MitchAlsup wrote:
BGB wrote:
On 12/19/2023 11:34 AM, Quadibloc wrote:
On Fri, 10 Nov 2023 22:03:23 +0000, Thomas Koenig wrote:
Quadibloc <quadibloc@servername.invalid> schrieb:
For 32-bit instructions, the only implication is that the first few >>>>>> integer registers would be used as index registers, and the last few >>>>>> would be used as base registers, which is likely to be true in any >>>>>> case.
This breaks with the central tenet of the /360, the PDP-11,
the VAX, and all RISC architectures: (Almost) all registers are
general-purpose registers.
This would make your ISA very un-S/360-like.
Well, I felt that _some_ compromises had to be made; otherwise,
there was no way instructions with base-index addressing _and_
16-bit displacements would fit into 32 bits.
So this isn't a decision I can reverse. Yes, it has its problems,
but it's an unavoidable result of my goal of combining aspects of
RISC and CISC in a single ISA.
As I see it, there are two major situations:
Stack frames and structs, where a 16-bit displacement is likely overkill; >>> Global variables, where for the general case it is almost entirely
insufficient.
EMBench is filled with stack frames illustrating RISC-V's 12-bit
immediates are not big enough.
As mentioned before, if you scale the displacement here, it is like it
is 3 bits bigger.
RISC-V is a weak case here because:
The displacements are unscaled;
The displacements are signed.
For stack frames, this effectively loses 4 bits, so RISC-V's 12-bit displacement is effectively more equivalent to 8 bits with my scheme...
Well, combined with the issue that exceeding the +/- 2K limit in RISC-V
sucks (there is no low-cost fallback strategy).
But, generally, not many stack frames seem to have much issue with the current 4K limit.
If it were more of an issue, could potentially add a few ops to extend
the limit to around 32K in XG2 mode. Say:
MOV.Q (SP, Disp12u*8), Rn
MOV.Q Rn, (SP, Disp12u*8)
On 12/19/2023 7:12 PM, Quadibloc wrote:
Yes, but does that mean that 16-bit displacements are a bad idea?It is, if they end up hurting the encoding in some other way, like
making the register fields smaller or eating too much of the opcode
space.
BGB wrote:
RISC-V is a weak case here because:
The displacements are unscaled;
The displacements are signed.
And yet, it has sucked all the oxygen out of the room..........
Granted, I just ran into a watched benchmarks that makes RISC-V look
less optimal then those sucking the oxygen out of the room.
On 12/16/2023 7:28 AM, David Brown wrote:
On 15/12/2023 03:59, Chris M. Thomasson wrote:
On 12/14/2023 6:41 AM, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup) writes:
I ask for PDF's. I have no ability to read windows office formats
of any type without using star/open/libre office, and I detest WYSIWYG >>>> word processors of all stripes.
Try to stay far away from windows office docs, they can be filled
with interesting macros, well back in the day! I do remember a lot of
print to PDF programs. Mock up a printer device, print, produce a file.
They are only a problem if you use MS Office. LibreOffice, and its
predecessors, disable the macros by default.
PDF also supports dangerous links and Javascript.
Indeed!
It's not a problem if you use a decent pdf viewer, but if you use
Adobe Acrobat on Windows, you can definitely be at risk.
Well, just make sure the PDF reader has javascript turned off all
around. Trust in it.
On Tue, 05 Dec 2023 11:07:09 +0000, Anton Ertl wrote:
Quadibloc <quadibloc@servername.invalid> writes:
[IBM Model 195]
Its microarchitecture ended up being, in general terms, copied by the
Pentium Pro and the Pentium II.
Not really. The Models 91 and 195 only have OoO for FP, not for
integers.
As do the Pentium Pro and the Pentium II. (The Motorola 68050 did it
the other way around, only having OoO for integers, and not for FP,
figuring, I guess, that integers are used the most, so this would
create better performance numbers.)
John Savard
On 12/16/2023 7:28 AM, David Brown wrote:
On 15/12/2023 03:59, Chris M. Thomasson wrote:
On 12/14/2023 6:41 AM, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup) writes:
David Brown wrote:
On 10/12/2023 11:39, Thomas Koenig wrote:
MitchAlsup <mitchalsup@aol.com> schrieb:
Question (to everyone):: Has your word processor or spreadsheet >>>>>>>> added anything USEFUL TO YOU since 2000 ??
In my case: Yes.
Besides making many things worse, the new formula editor (since
2010?) in Word is reasonable to work with, especially since it is >>>>>>> possible to use LaTeX notation now (and thus it is now possible
to paste from Maple).
If I want to write something serious with formula, I use LaTeX.
Previously, I actually wrote some reports in LaTeX, going to some >>>>>>> trouble to make them appear visually like the Word template du jour >>>>>>> (but the formulas gave it away, they looked to nice for Word).
What a strange thing to do - that sounds completely backwards to me! >>>>>
I was happy when I had made a template for LibreOffice (it might have >>>>>> been one of the forks of OpenOffice, pre-LibreOffice) that looked
similar to what I have for LaTeX. Then I could make
reasonable-looking
documents for customers that insisted on having docx format instead >>>>>> of pdf.
I don't think there has been much exciting or important (to me)
added to
word processors for decades. Direct pdf generation was one, which >>>>>> probably existed in Star Office (the ancestor of OpenOffice /
*.pdf arrives in Word ~2000 (maybe before).
Are you sure about that? IIRC it was a decade later before
adobe wasn't required.
<snip>
I still require people sending me *.docx to convert it back to
WORD2003 format *.doc and retransmitting it. It is surprising how
many people don't know how to do that.
I ask for PDF's. I have no ability to read windows office formats
of any type without using star/open/libre office, and I detest WYSIWYG >>>> word processors of all stripes.
Try to stay far away from windows office docs, they can be filled with
interesting macros, well back in the day! I do remember a lot of print
to PDF programs. Mock up a printer device, print, produce a file.
They are only a problem if you use MS Office. LibreOffice, and its
predecessors, disable the macros by default.
PDF also supports dangerous links and Javascript.
Indeed!
On 21/12/2023 04:00, Chris M. Thomasson wrote:
On 12/16/2023 7:28 AM, David Brown wrote:
On 15/12/2023 03:59, Chris M. Thomasson wrote:
On 12/14/2023 6:41 AM, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup) writes:
They are only a problem if you use MS Office. LibreOffice, and itsI ask for PDF's. I have no ability to read windows office formats >>>>> of any type without using star/open/libre office, and I detest WYSIWYG >>>>> word processors of all stripes.
Try to stay far away from windows office docs, they can be filled
with interesting macros, well back in the day! I do remember a lot of
print to PDF programs. Mock up a printer device, print, produce a file. >>>
predecessors, disable the macros by default.
PDF also supports dangerous links and Javascript.
Indeed!
It's not a problem if you use a decent pdf viewer, but if you use
Adobe Acrobat on Windows, you can definitely be at risk.
Well, just make sure the PDF reader has javascript turned off all
around. Trust in it.
"Trust in it" ?
Some readers /are/ trustworthy. Adobe's are not - Acrobat reader has
endless lists of security holes. I haven't had it installed on a PC for
many years, so things may have changed, but in comparison to any other
reader it was huge, slow, and required continuous upgrading to deal with >vulnerabilities, requiring a reboot of Windows each time. Horrible
software.
On Linux, common readers like evince don't support javascript - you can
trust them!
Huh???
I'm sure Andy Glew would disagree re the PPro!
On Sun, 10 Dec 2023 10:51:03 -0800, Tim Rentsch wrote:
Of course the PDP-8 is a RISC. These propperties may have been common
among some RISC processors, but they don't define what RISC is. RISC is
a design philosophy, not any particular set of architectural features.
I can't agree.
Your final sentence may be true enough, but I think that the architectural feature of being a load-store architecture is very much indicative of
whether the RISC design philosophy was being followed. Of course, it isn't absolutely _decisive_, as Concertina II demonstrates.
The PDP-8 is just a very small computer, with a very small instruction
set, designed before the RISC design philosophy was even concieved of.
Tim Rentsch wrote:
mitchalsup@aol.com (MitchAlsup) writes:
Scott Lurndal wrote:
The lack of general purpose registers doesn't disqualify it
from the RISC label in my opinion.
Then RISC is a meaningless term.
PDP-8 certainly is simple nor does it have many instructions, but it
certainly is NOT RISC.
Did not have a large GPR register file
Was Not pipelined
Was Not single cycle execution
Did not overlap instruction fetch with execution
Did not rely on compiler for good code performance
Of course the PDP-8 is a RISC. These propperties may have been
common among some RISC processors, but they don't define what
RISC is. RISC is a design philosophy, not any particular set
of architectural features.
So what we can take from this is that RISC as a term has become meaningless.
Quadibloc <quadibloc@servername.invalid> writes:
The PDP-8 is just a very small computer, with a very small instruction
set, designed before the RISC design philosophy was even concieved of.
That it was designed before is irrelevant. All that matters is that the
end result is consistent with that philosophy.
[This is long and much less organized than I wished, but I feel
rushed to get this written while I have a day off.]
On 11/24/23 9:49 PM, BGB wrote:> On 11/24/2023 12:24 PM,
MitchAlsup wrote:
[snip]Paul A. Clayton wrote:
management).I suspect you could write a multi-volume treatise on x86 about
hardware-software interface design and management (including the
social and economic considerations of project/product
Ignoring human factors, including those outside the organization
owning the interface, seems attractive to a certain engineering
mindset but human factors are significant design considerations.
It would be more beneficial to the world just to build an
architecture
without any of those flaws--just to show them how its done.
I thought My 66000 was very close to being completed (with
refinements coming slowly and generally being compatible at the
software level). Yes, there is lots of work getting the proof-of-
concept more publicly recognized and lots of work exploring
details of various implementations.
(Sadly, even if open source high-quality HDL implementations for a
variety of interesting design points were published, My 66000
seems unlikely to get much more adoption than Open RISC. Prophets
have to speak, but people seem at least as likely to kill a
prophet as to accept the prophet's message.)
I wish there were world enough and time for everyone (especially
experts) to publish their experience and wisdom and everyone to
interact with that wisdom, but I can intellectually (if not
emotionally) recognize that recording history is often not as
critical as making history.
People can probably debate what is ideal.
Certainly. Yet there are different degrees of expertise. I believe
I am more qualified to critique an ISA than even most computer
programmers. Mitch Alsup (who has designed hardware for at least
four ISAs — SPARC, x86, Motorola 88k, and an unspecified GPU
architecture as well as done compiler and other related work) is
more qualified to critique an ISA than most professional computer
architects.
There can also be different goals. Critiquing an ISA independent
of its goals is unjust (except as a warning about goal
constraints), but changing the goals to blunt criticism
An ISA designed for teaching and research (the initial purpose of
RISC-V) is unlikely to be excellent for general-purpose designs.
Features which are elegant tend to be difficult to appreciate for
new students; elegance involves complexity and synergy, which is
more information even though it compresses nicely when the entire
context is known.
There seem to be people around who see RISC-V as the model of
perfection.
I would _like_ to think that all such people are noobs (or people
who use "perfection" rather loosely). I doubt even Mitch Alsup
considers My 66000 the model of perfection in ISA design, "merely"
a model of unusual excellence superior to all other published ISAs
for general purpose computing.
I tend to agree with Mitch though I am still skeptical about VVM
and slightly skeptical about ESM. I trust a hardware designer to
know that VVM is implementable with equivalent Power-Performance-
Area even when I cannot see how, but I am not certain it addresses
all the use cases of SIMD, specifically in-register blocking and isolated/limited SIMD use.
For ESM, I am not confident that idiom
recognition will be cheap enough to avoid the need for special
atomics (again I do basically trust a hardware designer's
expertise) and I disagree mildly about the capacity guarantees. (I
also disagree about the importance of reserving opcodes common for
data as perpetually undefined to add a barrier to executing data.
Since those opcodes could be reclaimed later if somehow opcode
space became scarce, this is a rather trivial objection.)
There might be a few areas where I think AArch64 may benefit from
being less abstracted from the implementation. Load register pair
seems a nice feature; My 66000 could provide such (and more) with
idiom recognition (two or more loads or stores using the same base
address register and slightly different offsets could avoid
multiple accesses in many cases).
I do not have a good sense of when idiom recognition should be
preferred over "explicit" encoding. Both introduce complexity in
hardware and compilers. For idiom recognition, an optimizing
compiler adds another consideration for scheduling code and
sometimes choosing whether to do more conceptual work that is
faster (and sometimes less actual work by the hardware) with
uncertainty about the performance impact for different
implementations; high-performance hardware becomes more complex in
having to recognize the idiom and covert it to the
microarchtitecture's functional support. Idiom recognition also
has a code density cost. However, simple but complete
implementations are simpler (and not subsetted) than for explicit instructions, some idioms appear without explicit compiler
intention often enough to justify special handling, and handling
such in microarchitecture reduces the interface complexity.
For explicit instructions, a compiler need not use them (in which
case they are useless frills wasting opcode, decoder, and backend
resources) or even know they exist,
but an optimizing compiler
would have to try to recognize idioms to convert to special
instructions. Complete hardware (compatibility issues) must pay
the costs to implement the special instructions. Minor variations
of special instructions that are discovered to be common or useful
require hardware idiom-recognition to convert _and_ compilers are
unlikely to have made any effort to facilitate idiom recognition
and it is more difficult to justify the hardware effort for less
common cases.
Some of the AArch64 conditional instructions seem clever in
exploiting the number of variable operands. My 66000's PRED
provides **MUCH** more flexibility, though at the cost of hardware
complexity and code density.
I agree with Mitch Alsup that having to paste constants together
in software (or load them as if variable data) is suboptimal
generally. (There may be some cases where the importance of static
text size [or working set] justifies the extra effort of a level
of indirection, but such would generally seem to be a performance
loser.)
I disagree, where some things seem to be corner cutting in areas
where doing so is a foot gun, and other areas being needlessly
expensive (and some things in the reaches of "extensions land"
being just kinda absurd).>
In some ways, it is (as I see it) better to define some things and
leave them as optional, rather than define little, and leave
everyone else to make an incoherent mess of things.
One of the benefits of such is being able to approach elegance;
nonce extensions have difficulty appropriating synergy.
I do not really understand the hostility to subsetting.
Then again, likely there is disagreements as to what sorts of
features seem meaningful, wasteful, or needless extravagance.
This is as it should be. Special purpose or experimental features
should be viewed as "wasteful" when the target of those features
is not shared. The contention also concerns the limited space for standardized extensions within a single encoding space.
Standardized extensions can avoid redundant effort and some
incompatibility, but without modes to break-up the encoding space
the more extensions means less free encoding space.
This also introduces the argument about extensions, coprocessors,
and accelerators. Accelerators are obviously least tied to the ISA
interface, but changing an accelerator can be effectively as
incompatible as an ISA change. (Of course, microarchitecture
changes can break software performance.)
RISC-V's early encoding choices were probably quite suitable for a
teaching and research RISC ISA. Research would emphasize easy
extensibility for isolated efforts (VLE and lots of unassigned
opcode space facilitates such). Compatibility is something of an anti-consideration; researchers should be free to add any
functionality they wish without consideration to an "ecosystem".
The commercial interest in open source implementations and even
just license-free ISA use changed the goals. This interest
expanded such that people were considering the possibility of
competing with ARM not just in the microcontroller area but more
generally.
Expanded interest also exposed weakness in organization.
Commercial interests wanted closed-door meetings, open systems
people wanted public information. The "prestige" of a _standard_
extension motivates standardizing more localized extensions, the
limited extension space motivates rushing to stake claims, the
increased value of the opcode space encourages conflict.
Some
think idiom recognition is so cheap that the bar for new
instructions should be high, some think the flexibility of RISC-V
encoding should make the bar low. Some think only "simple"
instructions should be provided, some think complex instructions
can easily be justified. The founders seem to have been,
understandably, unprepared to handle the volume of conflict
resolution involved.
Granted, it does seem like x86 probably needs to be retired at
some point...
Nah. Intel has already proposed expanding the register count to 32
and possibly simplifying some of the architecture (mostly system-
level aspects, I think).
Adding yet another encoding that retained the architectural
features is another possibility, but I doubt Intel/AMD would move
to such an encoding. The value of x86 is primarily legacy
software. Providing a cleaner encoding hints that legacy software
support might be dropped ("why add a radically different encoding
if there is not the intent to drop support for the legacy
encoding?"). That fear would reduce the value of legacy binary
support, increasing the relative attractiveness of ARM or other
alternatives.
I do not see any hope for ISA excellence.
Granted, it does seem like x86 probably needs to be retired at some
point...
On Fri, 24 Nov 2023 20:49:57 -0600, BGB wrote:
Granted, it does seem like x86 probably needs to be retired at some
point...
While in a certain sense, this is an undoubtedly true statement,
my initial reaction to it was of the ROTFL nature.
There's so much software out there that is distributed only in
binary form that runs only on x86 that retiring the x86 by fiat
while it's still so actively in use just won't happen, no matter
how bad it may be.
This is why I miss the 680x0 architecture so much. If that were
still out there as an active competitor to x86, then because this
alternative _is_ something better... even if not _radically_
better, since it's still CISC, not something really different like
RISC... then I could envisage, over time, the market for it gradually
growing while that for the x86 gradually shrinks.
At the present time, RISC-V and ARM are the contenders. Microsoft has
a version of Windows that runs on ARM. Apple now uses ARM processors
in its current Macintosh computers, and is claiming that their
performance is superior to x86 processors.
Right now, though, there's no real motive for people to go from x86
to ARM.
In time, something surely will happen to change matters, and new
computer architectures will rise up to prominence. Right now, though,
signs of movement away from x86 to something else are few.
John Savard
There's so much software out there that is distributed only in
binary form that runs only on x86 that retiring the x86 by fiat
while it's still so actively in use just won't happen, no matter
how bad it may be.
This is why I miss the 680x0 architecture so much. If that were
still out there as an active competitor to x86, then because this
alternative _is_ something better...
even if not _radically_
better, since it's still CISC, not something really different like
RISC... then I could envisage, over time, the market for it gradually
growing while that for the x86 gradually shrinks.
On 11/24/23 9:49 PM, BGB wrote:> On 11/24/2023 12:24 PM, MitchAlsup
wrote:
There seem to be people around who see RISC-V as the model of
perfection.
I would _like_ to think that all such people are noobs (or people who
use "perfection" rather loosely).
I doubt even Mitch Alsup considers My
66000 the model of perfection in ISA design, "merely"
a model of unusual excellence superior to all other published ISAs for general purpose computing.
I do not see any hope for ISA excellence.
The 68000 is worse than IA-32, because it does not have general-purpose registers, while IA-32 does. And the 68000 then grew baroque extensions
in the 68020, at a time when the rest of the world already knew that
such things are more hindrance than help.
16-bit instructions take 3/4ths of the OpCode Map of RISC-V. If you
dropped the compressed instructions, I can fit then entire My 66000 ISA
into the vacated space.....
On Wed, 03 Jan 2024 07:29:52 +0000, Anton Ertl wrote:
The 68000 is worse than IA-32, because it does not have general-purpose
registers, while IA-32 does. And the 68000 then grew baroque extensions
in the 68020, at a time when the rest of the world already knew that
such things are more hindrance than help.
It is true that in addition to eight general registers, the 68000 also
had eight _address_ registers. But in the addressing modes that used
address registers, there was a bit to use a general register instead,
so I don't think one can say that the 68000 didn't have general registers.
As for the 68020: with the 68000, the only address mode that let you form >addresses by adding the contents of two registers to a displacement had a >displacement of eight bits. The 68020 let you use a 16-bit displacement
in that mode. Since base-index addressing is so fundamental to accessing >arrays, I think that the 68020 added at least _one_ thing that was
essential rather than superfluous.
Given that David Patterson, one of the designers of MIPS, was on the
RISC-V design team, though, I can quite understand if many people
expect the RISC-V design to be a paragon of excellence - even before
they had looked at it.
Quadibloc <quadibloc@servername.invalid> writes:
Essential? How often do you use a reg+reg+disp addressing mode where
the displacement does not fit in 8 bits?
Looking at a glibc-2.31 AMD64 binary:
[~:145982] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|wc -l
341019
[~:145984] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
124
[~:145986] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
119
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Quadibloc <quadibloc@servername.invalid> writes:
Essential? How often do you use a reg+reg+disp addressing mode where
the displacement does not fit in 8 bits?
Looking at a glibc-2.31 AMD64 binary:
[~:145982] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|wc -l
341019
[~:145984] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
124
[~:145986] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
119
I'm not sure glibc is a representative sample. It's far more likely
for application code to have structures larger than 128 bytes.
Ok, so I measured the main firefox binary (firefox puts a lot of stuff
in shared libaries, so the main binary contains only a part of the
code):
[~:145999] objdump -d /usr/lib/firefox-esr/firefox-esr|wc -l
129114
[~:146000] objdump -d /usr/lib/firefox-esr/firefox-esr|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
134
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Quadibloc <quadibloc@servername.invalid> writes:
Essential? How often do you use a reg+reg+disp addressing mode where
the displacement does not fit in 8 bits?
Looking at a glibc-2.31 AMD64 binary:
[~:145982] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|wc -l
341019
[~:145984] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
124
[~:145986] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
119
I'm not sure glibc is a representative sample. It's far more likely
for application code to have structures larger than 128 bytes.
The 68000 is worse than IA-32, because it does not have
general-purpose registers, while IA-32 does. And the 68000 then grew
baroque extensions in the 68020, at a time when the rest of the world
already knew that such things are more hindrance than help. And the hindrance showed, when the 68040 and 68060 took longer than Intel's counterparts, and much longer than the competing RISCs: The two-wide
50MHz 68060 appeared in the same year as the 4-wide 266MHz 21164.
scott@slp53.sl.home (Scott Lurndal) writes: >>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Quadibloc <quadibloc@servername.invalid> writes:
Essential? How often do you use a reg+reg+disp addressing mode where >>>>the displacement does not fit in 8 bits?
Looking at a glibc-2.31 AMD64 binary:
[~:145982] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|wc -l
341019
[~:145984] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
124
[~:145986] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
119
I'm not sure glibc is a representative sample. It's far more likely
for application code to have structures larger than 128 bytes.
Ok, so I measured the main firefox binary (firefox puts a lot of stuff
in shared libaries, so the main binary contains only a part of the
code):
[~:145999] objdump -d /usr/lib/firefox-esr/firefox-esr|wc -l
129114
[~:146000] objdump -d /usr/lib/firefox-esr/firefox-esr|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
134
The next part was a copy-paste error. Here's the correct number:
[~:146002] objdump -d /usr/lib/firefox-esr/firefox-esr|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
12
At least for Firefox your explanation with the larger structures does
not seem to hold. Looking at the larger displacements, many don't
seem to be due to field offsets:
[~:146004] objdump -d /usr/lib/firefox-esr/firefox-esr|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |uniq -c
1 0x1000
7 0x10000
2 0x1010
8 0x180
2 0x1c0
2 0x2000
6 0x20000
3 0x280
2 0x2b0
1 0x30b
4 0x320
20 0x359d3e2a
8 0x380
1 0x4d0
20 0x5a827999
6 0x600
20 0x6ed9eba1
20 0x70e44324
1 0x8000000
Anyway, in the Firefox binary slightly more than 0.1% of the
instructions have offsets outside the signed 8-bit range. Still does
not seem essential to me.
On 1/1/2024 2:12 PM, Paul A. Clayton wrote:
I wish there were world enough and time for everyone (especially
experts) to publish their experience and wisdom and everyone to
interact with that wisdom, but I can intellectually (if not
emotionally) recognize that recording history is often not as
critical as making history.
Might make sense if Mitch put his specifications and other stuff up on
GitHub or something?...
At least, assuming it is meant to be open.
Then, it becomes something one can look at, at their leisure.
My main concern with PRED is that it seems like it will involve some
amount of implicit architectural state which would need to be dealt with somehow in interrupt handlers (and "pipeline state" is extra hairy).
Well, and also "make hardware do all of this stuff" isn't really part of
my philosophy. Or, effectively, any state that may exist, the interrupt handler needs to make sure it can save/restore it correctly.
I agree with Mitch Alsup that having to paste constants together
in software (or load them as if variable data) is suboptimal
generally. (There may be some cases where the importance of static
text size [or working set] justifies the extra effort of a level
of indirection, but such would generally seem to be a performance
loser.)
Yeah, I can also agree with this.
Though it seems a point of disagreement that I consider jumbo-prefixes
and (occasionally) dropping constants into temporary registers, to be acceptable.
The jumbo-prefix scheme does effectively still break the constant into pieces, but, at least all the pieces get reassembled within a single clock-cycle (unlike the multi-instruction case).
Does still have the annoyance of needing to have relocs for these cases
(and it is also desirable to try to limit the number of reloc types).
I disagree, where some things seem to be corner cutting in areas
where doing so is a foot gun, and other areas being needlessly
expensive (and some things in the reaches of "extensions land"
being just kinda absurd).>
In some ways, it is (as I see it) better to define some things and
leave them as optional, rather than define little, and leave
everyone else to make an incoherent mess of things.
One of the benefits of such is being able to approach elegance;
nonce extensions have difficulty appropriating synergy.
I do not really understand the hostility to subsetting.
Yeah.
Though, I sometimes wonder if defining everything up-front, and then
allowing for implementations to use subsets, may make the ISA spec seem
more threatening.
Say, "Look at all this stuff, all this complexity", when someone doing a minimal implementation can safely ignore "most of it".
Then again, likely there is disagreements as to what sorts of
features seem meaningful, wasteful, or needless extravagance.
This is as it should be. Special purpose or experimental features
should be viewed as "wasteful" when the target of those features
is not shared. The contention also concerns the limited space for
standardized extensions within a single encoding space.
Standardized extensions can avoid redundant effort and some
incompatibility, but without modes to break-up the encoding space
the more extensions means less free encoding space.
This also introduces the argument about extensions, coprocessors,
and accelerators. Accelerators are obviously least tied to the ISA
interface, but changing an accelerator can be effectively as
incompatible as an ISA change. (Of course, microarchitecture
changes can break software performance.)
Yeah.
Then there may also be things like putting devices in MMIO, but then
needing some way to detect if the device, or certain functionality is present.
Options like, "well, write this magic bit pattern to this MMIO register,
read it back, and see how the bits are set" is a little tacky.
Quadibloc wrote:
Given that David Patterson, one of the designers of MIPS, was on the
RISC-V design team, though, I can quite understand if many people
expect the RISC-V design to be a paragon of excellence - even before
they had looked at it.
Hennessy was Stanford MIPS in 1981, Patterson was RISC-1 at Berkeley in 1981.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Quadibloc <quadibloc@servername.invalid> writes:
Essential? How often do you use a reg+reg+disp addressing mode where
the displacement does not fit in 8 bits?
Looking at a glibc-2.31 AMD64 binary:
[~:145982] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|wc -l
341019
[~:145984] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
124
[~:145986] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
119
I'm not sure glibc is a representative sample. It's far more likely
for application code to have structures larger than 128 bytes.
On Wed, 03 Jan 2024 07:29:52 +0000, Anton Ertl wrote:
The 68000 is worse than IA-32, because it does not have general-purpose
registers, while IA-32 does. And the 68000 then grew baroque extensions
in the 68020, at a time when the rest of the world already knew that
such things are more hindrance than help.
It is true that in addition to eight general registers, the 68000 also
had eight _address_ registers. But in the addressing modes that used
address registers, there was a bit to use a general register instead,
so I don't think one can say that the 68000 didn't have general registers.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:...
scott@slp53.sl.home (Scott Lurndal) writes: >>>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Quadibloc <quadibloc@servername.invalid> writes:
Essential? How often do you use a reg+reg+disp addressing mode where >>>>>the displacement does not fit in 8 bits?
232294: 48 89 85 08 d7 ff ff mov %rax,-0x28f8(%rbp)
23229b: 48 8d 95 08 d7 ff ff lea -0x28f8(%rbp),%rdx
2322a2: 48 8b 85 70 cd ff ff mov -0x3290(%rbp),%rax
Why isn't 0x3290 in the output of the grep?
Essential? How often do you use a reg+reg+disp addressing mode where
the displacement does not fit in 8 bits?
On 1/3/2024 11:05 AM, MitchAlsup wrote:
My main concern with PRED is that it seems like it will involve somePRED state is 8-bits in thread-header.
amount of implicit architectural state which would need to be dealt
with somehow in interrupt handlers (and "pipeline state" is extra hairy). >>
Yeah, but presumably it is a mask that shifts 1 bit per every
instruction in the pipeline. If an interrupt occurs, then whatever state
gets captured needs to be correct WRT the pipeline stage that the
interrupt is captured off of.
Granted, I guess this isn't really too much different (in premise) than needing to get PC / SR / registers into a coherent state.
Well, and also "make hardware do all of this stuff" isn't really part
of my philosophy. Or, effectively, any state that may exist, the
interrupt handler needs to make sure it can save/restore it correctly.
Note: HW is responsible for saving and restoring state in My 66000, not SW. >>
I did it full software in my case, but mostly to try to save cost on a mechanism that is used comparably infrequently.
Like, need to try to find the cheapest possible mechanism that still
allows state to be saved/restored well enough that the program doesn't
just explode whenever an interrupt occurs.
Though, I sometimes wonder if defining everything up-front, and then
allowing for implementations to use subsets, may make the ISA spec
seem more threatening.
This is my plan ! And it makes the ISA way cleaner than "anyone can add
an extension" RISC-V model.
Yeah.
Consistency at the tradeoff of now people have to see a full ISA spec,
rather than say:
Integer ISA spec;
FPU ISA spec;
Privileged Mode spec;
...
All as separate specification documents.
Options like, "well, write this magic bit pattern to this MMIO
register, read it back, and see how the bits are set" is a little tacky.
Cores are devices and have a configuration page in configuration space
you can directly read core capabilities from here. L2s are similar.
So, CPUID is merely a LD to config space.
Traditional way configuration worked as I understood it on older systemsAttempt a read access to an I/O page, if read returns device is present
was say:
Write values to IO ports;
Read values back;
See if response is what is expected (say, if you only get 00 or FF,
assume hardware is absent or doesn't work);
Hope that some other hardware isn't at that address which totally owns
the PC.
On 1/3/2024 10:58 AM, Quadibloc wrote:
On Wed, 03 Jan 2024 15:17:16 +0000, Anton Ertl wrote:
Essential? How often do you use a reg+reg+disp addressing mode where
the displacement does not fit in 8 bits?
Every time I access an array element!
Because presumably the array will be in somewhere in a 64K byte chunk of
memory with an associated USING statement, so I need base register + 16
bit displacement to specify the start of the array, and an index register
to point to the element within the array.
Not sure if this is relevant. If the 64K byte chunk was aligned on a 64K
byte boundary, then we can round a pointer to somewhere in the chunk
down to the nearest 64K byte boundary. This gives us a pointer to the beginning of the chunk. I used this trick in some of my per-thread
memory allocators. To free memory a thread would round the address down
to the nearest chunk size an push the memory into a list. Memory
allocations had to be at least the size of a word, or they would get
rounded up to word size.
It is best to avoid the 64KB limitations altogether; allowing .data to
be "significantly" far away while still allowing single instruction
access.
On Wed, 03 Jan 2024 22:42:17 +0000, MitchAlsup wrote:
It is best to avoid the 64KB limitations altogether; allowing .data to
be "significantly" far away while still allowing single instruction
access.
I agree with that. However, my solution to that is a different
one, which indeed is not so efficient.
Immediates in my design are strictly for immediate mode operations,
and can't also be used as absolute addresses, as you are doing.
Instead, what I have is "array mode", which is a kind of post-indexed indirect addressing (array addresses are put in a short segment that
a special base register points to). So the array address is referenced
by a short displacement, but that means an extra memory access is
needed, instead of the address being in the instruction stream.
Can I modify my instruction format to allow for instead using your
more efficient solution to this problem? There probably is room;
change a 12-bit displacement to an 11-bit displacement, and 11
bits is plenty when I only need six bits...
John Savard
On 1/3/2024 2:55 PM, MitchAlsup wrote:
I did it full software in my case, but mostly to try to save cost on a
mechanism that is used comparably infrequently.
I used the same mechanism for prologue and epilogue sequences, so it gets
used often.
OK.
Though, not having either is technically the cheapest option.
All as separate specification documents.
I have an ISA specification document, how unprivileged SW uses ISA as a
document, and how privileged SW uses ISA as a document; all with cross
document pointers. Having separate documents allows the noon-proprietary
ISA to be distributed allowing full access to ISA but knowledge of
privileged state. {{There are no privileged instructions, but there is
privileged state.}} I still have the privileged document under NDA.
Hmm...
My stuff is all public (in my GitHub repository), had assumed that
anyone that might want to do their own implementation would be free to
do so.
Also made an effort to avoid anything which lacks prior art from at
least 20 years ago.
OK, I don't have any real performance counters at the ISA level.
The microsecond counter was mostly so that programs using functions like "clock()" wouldn't burn too much CPU time with system calls (for some
types of programs, it is not uncommon to make rapid-fire calls trying to
get the current time in milliseconds or microseconds).
LDD R7,IP,R3<<3,.L00BK123.foo - .]
Is not an absolute address! IP is added as the base register and "-." ,
as part of the displacement, subtracts that very same IP value. So, the displacement is not absolute, but a trick is used to make it smell as if
it were.
Architecture is as much about what to leave out as what to put in.
Quadibloc wrote:
Right now, though, there's no real motive for people to go from x86 to
ARM.
You know, as much as I hate Intel and x86, I hate Apple even more.
On Tue, 02 Jan 2024 20:41:07 +0000, MitchAlsup wrote:
16-bit instructions take 3/4ths of the OpCode Map of RISC-V. If you
dropped the compressed instructions, I can fit then entire My 66000 ISA
into the vacated space.....
Ouch! 16-bit instructions took up 1/4th of the opcode space of Concertina
II, and that turned out to be too much, and I had to drop them.
But then, RISC-V was designed with little or no regard for code density,
Yes, the PDP-8 did have a small and simple instruction set.
But that is _not_ what the meaning of RISC is commonly understood
to be.
Quadibloc <quadibloc@servername.invalid> writes:
Yes, the PDP-8 did have a small and simple instruction set.
But that is _not_ what the meaning of RISC is commonly understood to
be.
My comments about the PDP-8 and RISC were not about what the meaning of
RISC is comonly understood (or commonly misunderstood)
to be. Rather they are about the meaning of RISC as described by the
people who originally defined the term. Please see my longer response
to John Levine.
BGB wrote:
Also made an effort to avoid anything which lacks prior art from at
least 20 years ago.
Yes, over my 35+ year career I was exposed to 10s of thousands of patents.
I tried rigorously to avoid the ones still in effect. I did borrow a few
of my patents knowing their expiration dates. I also have a clean record
of my <potential> inventions identifying when they were first conceived.
On Thu, 04 Jan 2024 04:32:42 -0800, Tim Rentsch wrote:
Quadibloc <quadibloc@servername.invalid> writes:
Yes, the PDP-8 did have a small and simple instruction set.
But that is _not_ what the meaning of RISC is commonly understood
to be.
My comments about the PDP-8 and RISC were not about what the
meaning of RISC is comonly understood (or commonly misunderstood)
to be. Rather they are about the meaning of RISC as described by
the people who originally defined the term. Please see my longer
response to John Levine.
I'm not sure how this helps you, because the original definition
includes the current common understanding, being a superset of it.
Current common understanding:
All instructions the same length.
Load-store architecture.
Relatively large register file (32 or more registers)
Original definition:
All the above, plus:
All instructions execute in one cycle.
John Savard
I think that the fact that they left 3/4ths of the encoding space to
16-bit instructions shows that they care quite a bit for encoding size.
If they did not, they would not have the C extension (the 16-bit instructions) at all.
On Tue, 02 Jan 2024 20:41:07 +0000, MitchAlsup wrote:
16-bit instructions take 3/4ths of the OpCode Map of RISC-V. If you
dropped the compressed instructions, I can fit then entire My 66000 ISA
into the vacated space.....
Ouch! 16-bit instructions took up 1/4th of the opcode space of
Concertina II, and that turned out to be too much, and I had to drop
them.
On Thu, 04 Jan 2024 04:32:42 -0800, Tim Rentsch wrote:
Quadibloc <quadibloc@servername.invalid> writes:
Yes, the PDP-8 did have a small and simple instruction set.
But that is _not_ what the meaning of RISC is commonly understood to
be.
My comments about the PDP-8 and RISC were not about what the meaning of
RISC is comonly understood (or commonly misunderstood)
to be. Rather they are about the meaning of RISC as described by the
people who originally defined the term. Please see my longer response
to John Levine.
I'm not sure how this helps you, because the original definition
includes the current common understanding, being a superset of it.
Current common understanding:
All instructions the same length.
Load-store architecture.
Relatively large register file (32 or more registers)
Original definition:
All the above, plus:
All instructions execute in one cycle.
John Savard
On Wed, 03 Jan 2024 14:41:51 +0000, Quadibloc wrote:
On Tue, 02 Jan 2024 20:41:07 +0000, MitchAlsup wrote:
16-bit instructions take 3/4ths of the OpCode Map of RISC-V. If you
dropped the compressed instructions, I can fit then entire My 66000 ISA
into the vacated space.....
Ouch! 16-bit instructions took up 1/4th of the opcode space of
Concertina II, and that turned out to be too much, and I had to drop
them.
And this, of course, highlights another flaw of Concertina II, especially when contrasted with MY 66000.
Concertina II uses virtually every scrap of available opcode space within
the 32-bit instruction word.
Just recently, I came up with an ingenious
way to add one bit to the available (non-prefix) portion of the
zero-overhead instruction/header (which lets me sneak in an operate instruction using a pseudo-immediate without using a whole 32-bit
instruction slot to provide the three bits needed to reserve space for
the pseudo-immediate value)... which allowed the set of opcodes I
wanted to provide, _and_ allowed me to do a zero-overhead version of
the new extra-long absolute address instruction (only for loads and
stores) as well.
Another recent change to the architecture was including instructions
longer than 32 bits as part of the basic 32-bit instruction set without headers (through "composed instructions")... because I knew I needed
a larger opcode space desperately and couldn't just restrict its
availability to where it could be implemented efficiently.
John Savard
On Wed, 03 Jan 2024 02:47:11 +0000, MitchAlsup wrote:
Quadibloc wrote:
Right now, though, there's no real motive for people to go from x86 to
ARM.
You know, as much as I hate Intel and x86, I hate Apple even more.
I can appreciate the sentiment, as the restrictions on Apple's App Store
mean that iOS devices are simply not an option I can consider. And, of course, Macs tend not to be upgradeable, and this seems to be so that
Apple can charge higher prices.
But there's also Windows on ARM. And there's the whole smartphone
ecosystem of Android. But all these things together don't provide an incentive to leave x86.
PowerPC and SPARC also exist as RISC alternatives, besides ARM and
RISC-V, but they've been forgotten, bypassed, sidelined, or whatever.
John Savard
MitchAlsup wrote:
BGB wrote:
Also made an effort to avoid anything which lacks prior art from at
least 20 years ago.
Yes, over my 35+ year career I was exposed to 10s of thousands of patents. >> I tried rigorously to avoid the ones still in effect. I did borrow a few
of my patents knowing their expiration dates. I also have a clean record
of my <potential> inventions identifying when they were first conceived.
IANAL
With the rule change from "first to invent" to "first to file"
is having a date record of inventions any use?
There is also the question of whether writing about something
on the internet counts as "publication" and might block patenting.
A quicky search finds this:
How Publications Affect Patentability https://www.utoledo.edu/research/TechTransfer/Publish_and_Perish.html
"The Internet: A message describing an invention on a web site or to a
public newsgroup will be considered as published on the day prior to
the posting"
On 1/4/2024 9:09 AM, EricP wrote:
MitchAlsup wrote:
BGB wrote:
Also made an effort to avoid anything which lacks prior art from at
least 20 years ago.
Yes, over my 35+ year career I was exposed to 10s of thousands of
patents.
I tried rigorously to avoid the ones still in effect. I did borrow a few >>> of my patents knowing their expiration dates. I also have a clean
record of my <potential> inventions identifying when they were first
conceived.
IANAL
With the rule change from "first to invent" to "first to file"
is having a date record of inventions any use?
There is also the question of whether writing about something
on the internet counts as "publication" and might block patenting.
A quicky search finds this:
How Publications Affect Patentability
https://www.utoledo.edu/research/TechTransfer/Publish_and_Perish.html
"The Internet: A message describing an invention on a web site or to a
public newsgroup will be considered as published on the day prior to
the posting"
My concern was more with the possibility of lawyers being jerks...
But, if one mostly sticks to design features that were already in use
20-30 years ago; there isn't much the lawyers can do...
Granted, one could argue that this does not cover every possible way in
which these features could be combined, which is a possible area for
concern.
Though, for the most part, it seems that the "enforcement" is mostly
used against either direct re-implementations of a patented technology,
or against popular common-use technologies that can be "interpreted" to somehow infringe on a patent (even if the artifact described is often
almost entirely different), rather than going after ex-nihilo hobby
projects or similar.
....
I would really like MS to go back to windows 7 {last one I liked}.....
On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:
I would really like MS to go back to windows 7 {last one I liked}.....
Finally, something we both agree on!
Quadibloc <quadibloc@servername.invalid> writes:
On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:
I would really like MS to go back to windows 7 {last one I liked}.....
Finally, something we both agree on!
Really, there has never been an usable Windows release.....
Unix forever! :-)
To make it slightly less evil, have the address in the workspace pointer point into an on-chip static RAM instead of extenal DRAM.
This is headed in the right direction. Make context switching something
easy to pull off.
On Fri, 05 Jan 2024 15:18:46 +0000, Quadibloc wrote:
To make it slightly less evil, have the address in the workspace pointer
point into an on-chip static RAM instead of extenal DRAM.
And then when you're switching from one virtualized operating
system to another, you have to do a "big context switch" where
you save and restore all the registers _and_ that on-chip
static RAM!
However, that can be cured.
On Fri, 24 Nov 2023 03:11:17 +0000, MitchAlsup wrote:
This is headed in the right direction. Make context switching something
easy to pull off.
Oh, dear. You've just given me an evil idea.
On a System/360, context switching wasn't too bad. You just save
and restore the 16 general registers and the floating-point
registers.
On a more recent CPU, you might have to save and restore the
general registers, the floating-point registers, and the SIMD
registers.
On Concertina II, in addition to 32 integer registers, 32
floating-point registers, 16 SIMD registers, there are also
eight 64-element vector registers!
On the Texas Instruments TI 9900, there were 16 general registers
which were 16 bits long - but they were in memory, so context
switching was _really_ fast, you just saved and restored the
workspace pointer!
So the evil idea is...
while the CPU does have real registers in order to run at an
acceptable speed, allow it to also run in "slow mode" with
a workspace pointer and all the registers in RAM!
To make it slightly less evil, have the address in the
workspace pointer point into an on-chip static RAM instead
of extenal DRAM.
Quadibloc wrote:
On Fri, 05 Jan 2024 15:18:46 +0000, Quadibloc wrote:
To make it slightly less evil, have the address in the workspace pointer >>> point into an on-chip static RAM instead of extenal DRAM.
And then when you're switching from one virtualized operating
system to another, you have to do a "big context switch" where
you save and restore all the registers _and_ that on-chip
static RAM!
I submit the proper place for memory resident register files and
thread-state is in DRAM. Then, writing a single control register
can switch between user threads, and writing 2 control registers
switches between GuestOSs,.....
However, that can be cured.
Yes, by placing the data in the right place at the beginning.
Scott Lurndal wrote:
ARM64 keeps all the VM
state in a small number of system registers that the HV can
swap as necessary.
My 66000 memory maps all control registers so even a remote CPU
can diddle with stuff a local CPU will see instantaneously
{mainly for debug of dead core}.
mitchalsup@aol.com (MitchAlsup) writes:
Quadibloc wrote:
On Fri, 05 Jan 2024 15:18:46 +0000, Quadibloc wrote:
To make it slightly less evil, have the address in the workspace pointer >>>> point into an on-chip static RAM instead of extenal DRAM.
And then when you're switching from one virtualized operating
system to another, you have to do a "big context switch" where
you save and restore all the registers _and_ that on-chip
static RAM!
I submit the proper place for memory resident register files and >>thread-state is in DRAM. Then, writing a single control register
can switch between user threads, and writing 2 control registers
switches between GuestOSs,.....
Doesn't this cost at least one cache line in L1?
Intel and AMD do this for the virtual machine state, but there's
an access cost to read from dram.
ARM64 keeps all the VM
state in a small number of system registers that the HV can
swap as necessary.
Quadibloc wrote:
On Fri, 05 Jan 2024 15:18:46 +0000, Quadibloc wrote:
To make it slightly less evil, have the address in the workspace
pointer point into an on-chip static RAM instead of extenal DRAM.
And then when you're switching from one virtualized operating system to
another, you have to do a "big context switch" where you save and
restore all the registers _and_ that on-chip static RAM!
I submit the proper place for memory resident register files and
thread-state is in DRAM. Then, writing a single control register can
switch between user threads, and writing 2 control registers switches
between GuestOSs,.....
My 66000 memory maps all control registers so even a remote CPU can
diddle with stuff a local CPU will see instantaneously {mainly for debug
of dead core}.
On Fri, 05 Jan 2024 19:49:18 +0000, MitchAlsup wrote:
Quadibloc wrote:
On Fri, 05 Jan 2024 15:18:46 +0000, Quadibloc wrote:
To make it slightly less evil, have the address in the workspace
pointer point into an on-chip static RAM instead of extenal DRAM.
And then when you're switching from one virtualized operating system to
another, you have to do a "big context switch" where you save and
restore all the registers _and_ that on-chip static RAM!
I submit the proper place for memory resident register files and
thread-state is in DRAM. Then, writing a single control register can
switch between user threads, and writing 2 control registers switches
between GuestOSs,.....
That would certainly make my "evil idea" less evil.
But, at first glance, that seems like something that
couldn't possibly be true. Registers are in constant
use by the processor, so accessing them should be very
fast. DRAM is slow!
Of course, though, a little bit of context shows that
you're not as badly wrong as you might seem at first
glance. Any computer these days with any pretensions
to efficiency has cache.
Oops: I missed reading "memory-resident" above; you did
not claim that _all_ register files belong in RAM, just
that my idea of having a special internal memory to allow
putting registers in memory was a bad one (which I won't
try to deny).
John Savard
On Fri, 05 Jan 2024 23:15:21 +0000, MitchAlsup wrote:
My 66000 memory maps all control registers so even a remote CPU can
diddle with stuff a local CPU will see instantaneously {mainly for debug
of dead core}.
Oh, darn. I was going to save money by not providing proper cache
coherency hardware in implementations of Concertina II, but that
means I couldn't provide this useful feature!
Just kidding... sort of.
Mapping control registers to RAM is something I would never have
thought of, but I would indeed put pins on the package, the function
of which would be openly documented, to allow accessing chip internals.
My perverted purpose in doing so, though, was not so much for legitimate debugging as to permit my chips to be used in retrocomputing toys...
A computer with a *real front panel* just like in the old days, not
just one like on the Altair that only handles the external memory bus!
As for cache coherency... well, of course that has to be supported
for a computer to actually work the way it's supposed to without
error. However, the way I would handle it is like this:
The CPU only bothers about cache coherency for cached data from
memory that has been _explicitly marked as shared_.
So the
CPUs connected to the same memory have a message bus between them;
when one requests some memory to be shared, it sends a message
out about that, and doesn't use that memory until it gets acknowledged; _then_ the CPUs that are sharing a certain area of memory notify
each other when they write to that area of memory.
The CPUs have to be told - they don't try to keep track of everything
anyone else might be doing on the bus.
However, I haven't really thought through this aspect of CPU chip
design. Since a microprocessor needs to handle the full speed of the
memory bus in order to talk to memory, possibly bus monitoring is
simpler than a conversational protocol handling only the memory that
"needs" to be monitored.
Come to think of it, though, perhaps a CPU needs to be able to do this
both ways - bus monitoring for normal multi-CPU motherboards, and a conversational protocol so the chips can also be used in NUMA systems.
John Savard
Quadibloc wrote:
The CPU only bothers about cache coherency for cached data from memory
that has been _explicitly marked as shared_.
So, shared instruction sections are marked exclusive ?!?
So, thread-local-storage is marked shared if a pointer to its cats is constructed !?!
Can a Hypervisor share code sections with Guest OS ??
The CPUs have to be told - they don't try to keep track of everything
anyone else might be doing on the bus.
But certainly, when writing a buffer in VA[k] to disk, the core caches
have to be snooped so the disk gets the correct data.
On Thu, 04 Jan 2024 04:32:42 -0800, Tim Rentsch wrote:
Quadibloc <quadibloc@servername.invalid> writes:
Yes, the PDP-8 did have a small and simple instruction set.
But that is _not_ what the meaning of RISC is commonly understood to
be.
My comments about the PDP-8 and RISC were not about what the meaning of
RISC is comonly understood (or commonly misunderstood)
to be. Rather they are about the meaning of RISC as described by the
people who originally defined the term. Please see my longer response
to John Levine.
I'm not sure how this helps you, because the original definition
includes the current common understanding, being a superset of it.
Current common understanding:
All instructions the same length.
Load-store architecture.
Relatively large register file (32 or more registers)
Stuff that is only shared for reading isn't a coherency issue.
BGB wrote:
On 1/4/2024 9:09 AM, EricP wrote:
MitchAlsup wrote:
BGB wrote:
Also made an effort to avoid anything which lacks prior art from at
least 20 years ago.
Yes, over my 35+ year career I was exposed to 10s of thousands of
patents.
I tried rigorously to avoid the ones still in effect. I did borrow a few >>>> of my patents knowing their expiration dates. I also have a clean
record of my <potential> inventions identifying when they were first
conceived.
IANAL
With the rule change from "first to invent" to "first to file"
is having a date record of inventions any use?
There is also the question of whether writing about something
on the internet counts as "publication" and might block patenting.
A quicky search finds this:
How Publications Affect Patentability
https://www.utoledo.edu/research/TechTransfer/Publish_and_Perish.html
"The Internet: A message describing an invention on a web site or to a
public newsgroup will be considered as published on the day prior to
the posting"
My concern was more with the possibility of lawyers being jerks...
I can alleviate you concerns--they are.
But, if one mostly sticks to design features that were already in use
20-30 years ago; there isn't much the lawyers can do...
And written in books or published in papers.
Granted, one could argue that this does not cover every possible way in
which these features could be combined, which is a possible area for
concern.
Though, for the most part, it seems that the "enforcement" is mostly
used against either direct re-implementations of a patented technology,
or against popular common-use technologies that can be "interpreted" to
somehow infringe on a patent (even if the artifact described is often
almost entirely different), rather than going after ex-nihilo hobby
projects or similar.
Also note: if you are not making money by using something claimed in their patent, they can sue but they cannot get any money. So, it is not worth
their time.....
Quadibloc wrote:
Since out-of-order is so expensive in power and transistors, though, if
mitigations do exact a performance cost, then going to a simple CPU
that is not out-of-order might be a way to accept a loss of
performance, but gain big savings in power and die size, whereas
mitigations make those worse.
18 years ago, when I quit building CPUs professionally, GBOoO
performance was 2× what an 1-wide IO could deliver. In those 18 years
the CPU makers have gone from 2× to 3× performance while the execution window has grown from 48 to 300 instructions.
Clearly an unsustainable µArchitectural direction.
All instructions the same length.
So, Power10, RISC-V and 32-bit ARM (which has Thumb) are not RISC.
Good to know.
On Thu, 04 Jan 2024 04:32:42 -0800, Tim Rentsch wrote:
Quadibloc <quadibloc@servername.invalid> writes:
Yes, the PDP-8 did have a small and simple instruction set.
But that is _not_ what the meaning of RISC is commonly understood to
be.
My comments about the PDP-8 and RISC were not about what the meaning of
RISC is comonly understood (or commonly misunderstood)
to be. Rather they are about the meaning of RISC as described by the
people who originally defined the term. Please see my longer response
to John Levine.
I'm not sure how this helps you, because the original definition
includes the current common understanding, being a superset of it.
Current common understanding:
All instructions the same length.
Load-store architecture.
Relatively large register file (32 or more registers)
Original definition:
All the above, plus:
All instructions execute in one cycle.
Thomas Koenig <tkoenig@netcologne.de> writes:
All instructions the same length.
So, Power10, RISC-V and 32-bit ARM (which has Thumb) are not RISC.
Good to know.
RISC-V without the C extension would be, but the C extension would
make it non-RISC. Likewise ARM A32 would be, but A32/T32 would not.
Power has instructions that are not 32 bits in size? Since when?
On 1/5/2024 9:01 AM, Quadibloc wrote:
On Fri, 05 Jan 2024 14:25:27 +0000, Scott Lurndal wrote:
Quadibloc <quadibloc@servername.invalid> writes:
On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:
I would really like MS to go back to windows 7 {last one I liked}.....
Finally, something we both agree on!
Really, there has never been an usable Windows release.....
Unix forever! :-)
It certainly is true that Linux has some major advantages. People
have had to put up with Windows, though, because some software is
only available for it.
Windows merits:
More software support;
Has nearly all of the games;
Thomas Koenig <tkoenig@netcologne.de> writes:
All instructions the same length.
So, Power10, RISC-V and 32-bit ARM (which has Thumb) are not RISC.
Good to know.
RISC-V without the C extension would be, but the C extension would
make it non-RISC. Likewise ARM A32 would be, but A32/T32 would not.
Power has instructions that are not 32 bits in size? Since when?
- anton
On 1/5/2024 9:01 AM, Quadibloc wrote:
On Fri, 05 Jan 2024 14:25:27 +0000, Scott Lurndal wrote:
Quadibloc <quadibloc@servername.invalid> writes:
On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:
I would really like MS to go back to windows 7 {last one I liked}.....
Finally, something we both agree on!
Really, there has never been an usable Windows release.....
Unix forever! :-)
It certainly is true that Linux has some major advantages. People
have had to put up with Windows, though, because some software is
only available for it.
Windows merits:
More software support;
Has nearly all of the games;
No endless fights with trying to get the GPU and sound hardware working;
Much less needing to fight with hardware driver issues in general;
....
Linux merits:
You can mount nearly anything anywhere;
Can do low-level HDD copies, have more freedom for how to partition and format drives, more available filesystems, ...
Accessing files on Linux is generally significantly faster (though, allegedly, this isn't so much because of the filesystem itself, but
rather because antivirus software and Windows Defender tend to hook the filesystem access and scan every file that is being read/written, ...).
Though, in a Windows style environment, it is generally preferable to
have a small number of comparably large files, than a large number of
small files.
General coding experience is not that much different either way.
If one sticks to mainstream languages and writes code in a portable way,
they can use mostly similar code on either (apart from code dealing with
the parts that differ).
....
On 1/6/2024 12:50 PM, MitchAlsup wrote:
BGB wrote:
On 1/5/2024 9:01 AM, Quadibloc wrote:
On Fri, 05 Jan 2024 14:25:27 +0000, Scott Lurndal wrote:
Quadibloc <quadibloc@servername.invalid> writes:
On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:
I would really like MS to go back to windows 7 {last one I
liked}.....
Finally, something we both agree on!
Really, there has never been an usable Windows release.....
Unix forever! :-)
It certainly is true that Linux has some major advantages. People
have had to put up with Windows, though, because some software is
only available for it.
Windows merits:
More software support;
Has nearly all of the games;
No endless fights with trying to get the GPU and sound hardware working; >>> Much less needing to fight with hardware driver issues in general;
....
Linux merits:
You can mount nearly anything anywhere;
Can do low-level HDD copies, have more freedom for how to partition
and format drives, more available filesystems, ...
You can back the whole thing up such that recovery is but a DD away.
I often use Linux + DD to do low level copies of HDDs, which mostly
works (and can often get an OK drive copy), except in cases where people ignored the drive failing for long enough that it is basically entirely failed, and then this is turned into a massive pain (modern Linux seems
to drop drives about as soon as it encounters and irrecoverable IO
error, which is super annoying for data recovery tasks).
For my main PC, mostly still running Windows.
For the most part, "everything just works", except when MS is doing
something annoying.
May or may not "jump ship" at some point though unless MS backs off on
some of the stuff they pulled with Win11 (if/when Win10 starts to get unusable).
On 1/6/2024 10:21 AM, Quadibloc wrote:
On Mon, 04 Dec 2023 20:03:47 +0000, MitchAlsup wrote:
Quadibloc wrote:
Since out-of-order is so expensive in power and transistors, though, if >>>> mitigations do exact a performance cost, then going to a simple CPU
that is not out-of-order might be a way to accept a loss of
performance, but gain big savings in power and die size, whereas
mitigations make those worse.
18 years ago, when I quit building CPUs professionally, GBOoO
performance was 2× what an 1-wide IO could deliver. In those 18 years
the CPU makers have gone from 2× to 3× performance while the execution >>> window has grown from 48 to 300 instructions.
Clearly an unsustainable µArchitectural direction.
Yes, the law of diminishing returns means that even if Moore's Law
still lives on, they can't go _much_ further in that direction.
Yes.
And, even then, 2x .. 3x vs a 1-wide isn't THAT big of an advantage,
given the GBOoO is going to use a lot more die area and power.
<snip>But do they have any other directions they can go in to get more
performance?
We have heard of a few:
1) Switch to a new, faster, semiconductor material if it becomes
possible.
3) If we can't make the processors faster, provide more of them.
This is being done - first they put two CPUs on a chip, then four,
and now we're seeing quite a few.
This is where I had assumed small static scheduled CPUs could have merit.
This is where I had assumed small static scheduled CPUs could have merit. >>OoO costs roughly 3× In Order power and provides 1.4× performance (hand
waving accuracy). GB, on the other hand, costs roughly 4× and provides
1.4× performance. So, overall, the last factor of 2× in performance
costs 12× in area and power and are generally surrounded with larger
caches to
keep up with the larger throughput raising the area (but not so much the
power) again.
OK.
I guess the question is, say, the cost/benefit tradeoffs between OoO vs static-scheduled 'LIW' (granted, 'LIW' (*) is probably fairly similar to
an in-order superscalar, except possibly a little cheaper since it can
leaves out one of the expensive parts of an in-order superscalar...).
*: Say, designed for a maximum of 2 or 3 instructions/clock, with
explicit tagging for parallel execution (where the 'V' in 'VLIW'
seemingly tends to also imply wider execution often with an absence of
useful things like interlock handling or register forwarding...).
Also, assuming that one has a "doesn't suck" compiler for it...
....
BGB wrote:
I guess the question is, say, the cost/benefit tradeoffs between OoO vs
static-scheduled 'LIW' (granted, 'LIW' (*) is probably fairly similar
to an in-order superscalar, except possibly a little cheaper since it
can leaves out one of the expensive parts of an in-order
superscalar...).
*: Say, designed for a maximum of 2 or 3 instructions/clock, with
explicit tagging for parallel execution (where the 'V' in 'VLIW'
seemingly tends to also imply wider execution often with an absence of
useful things like interlock handling or register forwarding...).
Also, assuming that one has a "doesn't suck" compiler for it...
Is there a question here ??
What if the goal isn't "fastest single-thread performance", but instead,
best performance relative to die area and per watt?...
I can also note that I am still using a cellphone with running on
in-order Cortex A53 cores...
Like, seemingly ARM has gone one direction, moving to primarily OoO
cores for newer designs, but then a lot of cellphones are seemingly
like, "Meh, whatever, we will just stick with a 8x Cortex-A53 chip from >MediaTek..."
But, if OoO were clearly superior, presumably people would have stopped
using the Cortex-A53 ...
But, there were still chips being released in 2023 using exclusively A53 >cores (and they appear to still be popular in cellphones).
Say, for example: https://en.wikipedia.org/wiki/Moto_E7
(Though, this is a model from 2020/2021).
So, rather than (V)LIW competing against OoO, maybe it can compete
against in-order superscalar? ...
Or, with the higher end of the microcontroller space?...
My thinking is not so much that one should have an ISA that mandates
VLIW, but instead, focuses on avoiding a few of the expensive parts of >in-order superscalar (namely the logic for figuring out whether
instructions can be executed in parallel).
Like, seemingly ARM has gone one direction, moving to primarily OoO
cores for newer designs, but then a lot of cellphones are seemingly
like, "Meh, whatever, we will just stick with a 8x Cortex-A53 chip
from MediaTek..."
But, if OoO were clearly superior, presumably people would have
stopped using the Cortex-A53 ...
On Sun, 07 Jan 2024 01:14:52 -0600, BGB wrote:
What if the goal isn't "fastest single-thread performance", but instead,
best performance relative to die area and per watt?...
If _that_ were the case, we would _already_ be using in-order CPUs, and
the wasteful nature of out-of-order execution would have precluded its >adoption entirely.
And OoO isn't the _only_ wretchedly excessive thing about today's >microprocessors. The small feature sizes that allow a single die to
contain eight complete CPUs with a great big out-of-order design
are attained by means of chip fabs that cost billions of dollars to
build. Couldn't we have just stopped at, say, 33nm or something.
fThe commpetitive demands Intel and AMD face - the desires of us as
consumers - are what prevents this from happening, and I see no hope
for the world to change to what might be seen as th path of virtue in
this area.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Power has instructions that are not 32 bits in size? Since when?
Since version 3.1 of the ISA (vulgo Power10), they have the prefixed >instructions, which take up two 32-bit words. An example:
[tkoenig@cfarm120 ~]$ cat add.c
unsigned long int foo(unsigned long x)
{
return x + 0xdeadbeef;
}
[tkoenig@cfarm120 ~]$ gcc -c -O3 -mcpu=power10 add.c
[tkoenig@cfarm120 ~]$ objdump -d add.o
add.o: file format elf64-powerpcle
Disassembly of section .text:
0000000000000000 <foo>:
0: ad de 00 06 paddi r3,r3,3735928559
4: ef be 63 38
8: 20 00 80 4e blr
There is a restriction that the prefixed instructions cannot
cross a 64-byte boundary.
Quadibloc <quadibloc@servername.invalid> writes:
On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:
I would really like MS to go back to windows 7 {last one I liked}.....
Finally, something we both agree on!
Really, there has never been an usable Windows release.....
Unix forever! :-)
On Sat, 06 Jan 2024 22:43:58 +0000, MitchAlsup wrote:
BGB wrote:
I guess the question is, say, the cost/benefit tradeoffs between OoO vs
static-scheduled 'LIW' (granted, 'LIW' (*) is probably fairly similar
to an in-order superscalar, except possibly a little cheaper since it
can leaves out one of the expensive parts of an in-order
superscalar...).
*: Say, designed for a maximum of 2 or 3 instructions/clock, with
explicit tagging for parallel execution (where the 'V' in 'VLIW'
seemingly tends to also imply wider execution often with an absence of
useful things like interlock handling or register forwarding...).
Also, assuming that one has a "doesn't suck" compiler for it...
Is there a question here ??
I think it's clear what the _answer_ is:
"You just described the Itanium. It failed big time, so your answer
is no."
Now, if you don't know the question, but you do have the answer, if it's something as enigmatic as "42", and you only have a vague description of
the question: "The great question of life, the Universe, and everything", then the process of recovering the actual working of the question can be
very convoluted, involving pan-dimensional beings disguising themselves
as white mice.
However, in this case, I don't think it's that difficult.
To be, or not to be, that is the question.
Whether 'tis nobler in the mind to suffer the thermal issues and
excessive power consumption resulting from the outrageous transistor
counts of Great Big Out-of-Order microarchitectures,
or to oppose them with an ISA which directly handles the pipeline in
VLIW or even RISC fashion, and by opposing them, end them...
I recall that I derived the following understanding of _your_
answer to this question some time ago, but I may have misunderstood
what you were writing:
(begin my description of what I think your answer is)
VLIW-style ISAs have failed to serve as a replacement for OoO
execution.
But that does not mean we are without hope of finding something
better. The problem is that the standard textbooks have failed to
properly represent what OoO is _for_.
The scoreboard in the Control Data 6600 is just briefly mentioned,
and then it's noted that it couldn't solve all the hazards related
to RAW and WAR and so on, and then the Tomasulo came along for the
IBM System/360 Model 91, and did it _right_.
That misses the fact that register hazards aren't the only thing
that OoO execution helps with. It also helps with *cache misses*.
And the 6600-style scoreboard is adequate to deal with cache misses.
Therefore, if you want to make a computer that replaces today's
bloated GBOoO designs, without the transistor bloat, but which
offers performance that competes with them, what you need to do
is indeed take care of the register hazards the way RISC
architectures have done... but then, instead of abolishing OoO
from your design after you've done that, keep the basic and
reasonable 6600-style scoreboard so that cache misses don't
kill your performance.
(end description)
I may have gotten it badly wrong, as I pieced it together from
little things you wrote here and there on various occasions.
But at least now we have a straw man to point at and debate.
John Savard
Thomas Koenig <tkoenig@netcologne.de> writes:
There is a restriction that the prefixed instructions cannot
cross a 64-byte boundary.
Ouch. This means that Power with prefixed instructions is the second instruction set (after MIPS with its architectural delayed loads)
where concatenating instruction blocks between two labels may result
in invalid code; on all other (~10) instruction sets I looked at this
works fine, including IA-64. Fortunately, for Power that's easy to
fix by compiling with -mno-prefixed,
Quadibloc <quadibloc@servername.invalid> tried to write:
The competitive demands Intel and AMD face - the desires of us as
consumers - are what prevents this from happening, and I see no hope for >>the world to change to what might be seen as the path of virtue in this >>area.
Nobody forces you to replace your CPU with one with a denser process. If
you want to use a 32nm CPU, get, e.g., an Intel Sandy Bridge. Or you
can get a Raspi 3, where the SoC is made in 40nm (according to <https://wikimovel.com/index.php?title=Broadcom_BCM2837>), and which
uses in-order processing as well.
Quadibloc wrote:
That misses the fact that register hazards aren't the only thing that
OoO execution helps with. It also helps with *cache misses*.
One CAN solve the other hazards with another SB should, one choose.
but if that _isn't_ the answer, then what the answer could possibly be
that could explain such counter-productive behavior evades me
completely.
Fortunately, the assembler will do this for you:
[tkoenig@cfarm120 ~]$ cat foo.s
.file "add.c"
.machine power10
.abiversion 2
.section ".text"
.align 2
.p2align 4,,15
.globl foo
.type foo, @function
foo:
..LFB0:
.cfi_startproc
.localentry foo,1
addi 3,3,0
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
blr
.long 0
.byte 0,0,0,0,0,0,0,0
.cfi_endproc
[tkoenig@cfarm120 ~]$ gcc -c foo.s
[tkoenig@cfarm120 ~]$ objdump -d foo.o
foo.o: file format elf64-powerpcle
Disassembly of section .text:
0000000000000000 <foo>:
0: 00 00 63 38 addi r3,r3,0
4: ad de 00 06 paddi r3,r3,3735928559
8: ef be 63 38
c: ad de 00 06 paddi r3,r3,3735928559
10: ef be 63 38
14: ad de 00 06 paddi r3,r3,3735928559
18: ef be 63 38
1c: ad de 00 06 paddi r3,r3,3735928559
20: ef be 63 38
24: ad de 00 06 paddi r3,r3,3735928559
28: ef be 63 38
2c: ad de 00 06 paddi r3,r3,3735928559
30: ef be 63 38
34: ad de 00 06 paddi r3,r3,3735928559
38: ef be 63 38
3c: 00 00 00 60 nop
40: ad de 00 06 paddi r3,r3,3735928559
44: ef be 63 38
So, unless you prefer to write direct machine code, this should
not be an issue.
On Sun, 07 Jan 2024 14:30:53 +0000, Anton Ertl wrote:
Quadibloc <quadibloc@servername.invalid> tried to write:
The competitive demands Intel and AMD face - the desires of us as >>>consumers - are what prevents this from happening, and I see no hope for >>>the world to change to what might be seen as the path of virtue in this >>>area.
Nobody forces you to replace your CPU with one with a denser process. If
you want to use a 32nm CPU, get, e.g., an Intel Sandy Bridge. Or you
can get a Raspi 3, where the SoC is made in 40nm (according to
<https://wikimovel.com/index.php?title=Broadcom_BCM2837>), and which
uses in-order processing as well.
What I wrote didn't contradict what you are saying in your response.
I am not saying that Intel and AMD are forcing us to buy newer and
faster microprocessors. (I could say that *Microsoft* is forcing us
to buy newer and faster microprocessors, by refusing to continue
issuing security updates for Windows 7, or, for that matter,
Windows XP, Windows 98, or even Windows 3.1. Then I would be
disagreeing with you, but I wasn't getting into that part of
the issue.)
Instead, what I wrote said that we, as consumers, are so greedy
for ever faster computers that we are the ones to blame for forcing
Intel and AMD to resort to techniques that require expensive fabs
to make the chips, and that require the chips to have enormous
numbers of transistors for each individual core.
John Savard
Quadibloc wrote:
I am not saying that Intel and AMD are forcing us to buy newer and
faster microprocessors. (I could say that *Microsoft* is forcing us to
buy newer and faster microprocessors, by refusing to continue issuing
security updates for Windows 7, or, for that matter, Windows XP,
Windows 98, or even Windows 3.1. Then I would be disagreeing with you,
but I wasn't getting into that part of the issue.)
I am calling Strawman on this::
I am of the opinion that the SW that arrives with a box/laptop be the
same over the lifetime of the product. I turn all updating of SW off and remove power at time I am not using the device to prevent MS from
updating things I DON'T want updated--this includes security patches.
On Sun, 07 Jan 2024 19:21:20 +0000, MitchAlsup wrote:
Quadibloc wrote:
That misses the fact that register hazards aren't the only thing that
OoO execution helps with. It also helps with *cache misses*.
One CAN solve the other hazards with another SB should, one choose.
Now that is something I did not know.
In fact, if I am understanding what you are saying here correctly:
It is possible to design an out-of-order CPU which addresses all the
basic types of register hazard, just as those designed using the
Tomasulo algorithm or those which equivalently use register renaming
instead, by using a modified form of the scoreboard of the Control
Data 6600.
Doing so would be more efficient, as the transistor count would be significantly lower.
....then, of course, my questiion is why isn't this what everyone is
doing already?
I mean, the answer *could* be that:
Only I, Mitch Alsup, know how this can be done. The world will have
to await my patent filing to find out how...
which is, in fact, a fair answer; you deserve to be paid for such
a valuable invention...
but if that _isn't_ the answer, then what the answer could possibly
be that could explain such counter-productive behavior evades me
completely.
John Savard
One has to be careful as a SB has a quadratic component where Tomasulo
has (heavy weight) linear component.
But as Windows users know very well
through sad experience is that if you don't keep your computer
up-to-date,
to patch vulnerabilities in Windows as soon as they are discovered...
your computer could end up infected within minutes of being connected to
the Internet.
Windows merits:
More software support;
Has nearly all of the games;
No endless fights with trying to get the GPU and sound hardware working;
Much less needing to fight with hardware driver issues in general;
...
Linux merits:
You can mount nearly anything anywhere;
Can do low-level HDD copies, have more freedom for how to partition and >format drives, more available filesystems, ...
Though, in a Windows style environment, it is generally preferable to
have a small number of comparably large files, than a large number of
small files.
General coding experience is not that much different either way.
If one sticks to mainstream languages and writes code in a portable way,
they can use mostly similar code on either (apart from code dealing with
the parts that differ).
...
On 1/7/2024 3:30 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
So, rather than (V)LIW competing against OoO, maybe it can compete
against in-order superscalar? ...
Not in smartphones, where software compatibility is a required
feature.
In smartphones, the program is typically being AOT'ed from a VM (such as >Dalvik), rather than distributing binaries as native ARM code.
From the POV of a Dalvik style VM, it shouldn't really matter that much.
Even there, the benefits of a common platform means that the industry
is consolidating on ARM; e.g., Philips (now NXP) made the Trimedia
processors (VLIW), but terminated development in 2010. Some users,
such as WD defecting to RISC-V to avoid the ARM tax, but RISC-V still
provides a common platform. Are you (or anyone else) able to provide
a VLIW platform that outcompetes ARM and RISC-V?
Trimedia (and the TMS320C6x) line differ partly in that they were true
VLIW, rather than "LIW". So, in this case, I was imagining something
more similar to the ESP32 (LX6) or Qualcomm Hexagon or similar.
But, if RISC-V is run with similar restrictions on the pipeline, for
some of the programs tested (such as Doom), it seems to require
executing around twice as many instructions for a similar amount of work
(*).
Though, this is not true of Dhrystone, where seemingly RISC-V executes
fewer instructions.
On Sun, 07 Jan 2024 14:30:53 +0000, Anton Ertl wrote:
Quadibloc <quadibloc@servername.invalid> tried to write:
The competitive demands Intel and AMD face - the desires of us as >>>consumers - are what prevents this from happening, and I see no hope for >>>the world to change to what might be seen as the path of virtue in this >>>area.
Nobody forces you to replace your CPU with one with a denser process. If
you want to use a 32nm CPU, get, e.g., an Intel Sandy Bridge. Or you
can get a Raspi 3, where the SoC is made in 40nm (according to
<https://wikimovel.com/index.php?title=Broadcom_BCM2837>), and which
uses in-order processing as well.
What I wrote didn't contradict what you are saying in your response.
I am not saying that Intel and AMD are forcing us to buy newer and
faster microprocessors. (I could say that *Microsoft* is forcing us
to buy newer and faster microprocessors, by refusing to continue
issuing security updates for Windows 7, or, for that matter,
Windows XP, Windows 98, or even Windows 3.1.
Instead, what I wrote said that we, as consumers, are so greedy
for ever faster computers that we are the ones to blame for forcing
Intel and AMD to resort to techniques that require expensive fabs
to make the chips, and that require the chips to have enormous
numbers of transistors for each individual core.
Thomas Koenig wrote:
Fortunately, the assembler will do this for you:
[tkoenig@cfarm120 ~]$ cat foo.s
.file "add.c"
.machine power10
.abiversion 2
.section ".text"
.align 2
.p2align 4,,15
.globl foo
.type foo, @function
foo:
..LFB0:
.cfi_startproc
.localentry foo,1
addi 3,3,0
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
blr
.long 0
.byte 0,0,0,0,0,0,0,0
.cfi_endproc
[tkoenig@cfarm120 ~]$ gcc -c foo.s
[tkoenig@cfarm120 ~]$ objdump -d foo.o
foo.o: file format elf64-powerpcle
Disassembly of section .text:
0000000000000000 <foo>:
0: 00 00 63 38 addi r3,r3,0
4: ad de 00 06 paddi r3,r3,3735928559
8: ef be 63 38
c: ad de 00 06 paddi r3,r3,3735928559
10: ef be 63 38
14: ad de 00 06 paddi r3,r3,3735928559
18: ef be 63 38
1c: ad de 00 06 paddi r3,r3,3735928559
20: ef be 63 38
24: ad de 00 06 paddi r3,r3,3735928559
28: ef be 63 38
2c: ad de 00 06 paddi r3,r3,3735928559
30: ef be 63 38
34: ad de 00 06 paddi r3,r3,3735928559
38: ef be 63 38
3c: 00 00 00 60 nop
40: ad de 00 06 paddi r3,r3,3735928559
44: ef be 63 38
How did 17 adds become 8 ??
So, unless you prefer to write direct machine code, this should
not be an issue.
On 1/7/2024 10:41 PM, George Neuner wrote:
On Sat, 6 Jan 2024 11:48:40 -0600, BGB <cr88192@gmail.com> wrote:
Windows merits:
More software support;
Has nearly all of the games;
No endless fights with trying to get the GPU and sound hardware working; >>> Much less needing to fight with hardware driver issues in general;
- DLLs can have private heaps
Pros/cons it seems. I would have considered this a con.
Thomas Koenig wrote:
Fortunately, the assembler will do this for you:
[tkoenig@cfarm120 ~]$ cat foo.s
.file "add.c"
.machine power10
.abiversion 2
.section ".text"
.align 2
.p2align 4,,15
.globl foo
.type foo, @function
foo:
..LFB0:
.cfi_startproc
.localentry foo,1
addi 3,3,0
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
paddi 3,3,3735928559
blr
.long 0
.byte 0,0,0,0,0,0,0,0
.cfi_endproc
[tkoenig@cfarm120 ~]$ gcc -c foo.s
[tkoenig@cfarm120 ~]$ objdump -d foo.o
foo.o: file format elf64-powerpcle
Disassembly of section .text:
0000000000000000 <foo>:
0: 00 00 63 38 addi r3,r3,0
4: ad de 00 06 paddi r3,r3,3735928559
8: ef be 63 38
c: ad de 00 06 paddi r3,r3,3735928559
10: ef be 63 38
14: ad de 00 06 paddi r3,r3,3735928559
18: ef be 63 38
1c: ad de 00 06 paddi r3,r3,3735928559
20: ef be 63 38
24: ad de 00 06 paddi r3,r3,3735928559
28: ef be 63 38
2c: ad de 00 06 paddi r3,r3,3735928559
30: ef be 63 38
34: ad de 00 06 paddi r3,r3,3735928559
38: ef be 63 38
3c: 00 00 00 60 nop
40: ad de 00 06 paddi r3,r3,3735928559
44: ef be 63 38
How did 17 adds become 8 ??
On 1/8/2024 3:59 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
If even VLIW could not compete with ARM in a certain embedded niche,
why should LIW?
Because:
True VLIW relies heavily on being able to extract a good amount of ILP,
but falls on its face if not enough ILP is available;
A vaguely RISC-like LIW design (similar to ESP32 or similar), can still
be performance competitive even with fairly meager ILP (where
effectively it functions like a normal RISC just with explicit tagging
rather than a superscalar fetch).
But, if RISC-V is run with similar restrictions on the pipeline, for
some of the programs tested (such as Doom), it seems to require
executing around twice as many instructions for a similar amount of work >>> (*).
The design philosphy of RISC-V favours having simple instructions and
combining them in the decoder over providing combined instructions, so
one would expect more executed instructions for RV64G(C) than for ARM
A64, which favours a fixed 32-bit format with instructions that do as
much as fits in 32 bits (i.e., precombined instructions). But if
combining in the decoder works, that does not mean that the programs
take longer to execute with a similarly capable back-end.
Combining stuff in the decoder is expensive though...
I've brought back the 15-bit short instructions, but now only within an alternate or supplementary set of 32-bit instructions.
On Mon, 08 Jan 2024 00:44:23 +0000, MitchAlsup wrote:
Quadibloc wrote:
I am not saying that Intel and AMD are forcing us to buy newer and
faster microprocessors. (I could say that *Microsoft* is forcing us to
buy newer and faster microprocessors, by refusing to continue issuing
security updates for Windows 7, or, for that matter, Windows XP,
Windows 98, or even Windows 3.1. Then I would be disagreeing with you,
but I wasn't getting into that part of the issue.)
I am calling Strawman on this::
I am of the opinion that the SW that arrives with a box/laptop be the
same over the lifetime of the product. I turn all updating of SW off and
remove power at time I am not using the device to prevent MS from
updating things I DON'T want updated--this includes security patches.
The fact that I am seeing your posts here means that you _have_ tried >connecting a computer to the Internet. Which invalidates the first
response to that which comes to mind.
Perhaps you use Linux or something. But as Windows users know very well >through sad experience is that if you don't keep your computer up-to-date,
to patch vulnerabilities in Windows as soon as they are discovered...
your computer could end up infected within minutes of being connected to
the Internet.
John Savard
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
There is a restriction that the prefixed instructions cannot
cross a 64-byte boundary.
Ouch. This means that Power with prefixed instructions is the second
instruction set (after MIPS with its architectural delayed loads)
where concatenating instruction blocks between two labels may result
in invalid code; on all other (~10) instruction sets I looked at this
works fine, including IA-64. Fortunately, for Power that's easy to
fix by compiling with -mno-prefixed,
Or by inserting NOPs in the right places; otherwise you lose the >functionality for Power10.
Fortunately, the assembler will do this for you:
So, unless you prefer to write direct machine code, this should
not be an issue.
On 1/7/2024 3:21 AM, Quadibloc wrote:
But when it comes even to the humble low-end laptop, Intel found it
necessary to redesign their Atom processor to be a lightweight OoO
chip, instead of the in-order design it originally had.
Though, to be fair:
Without OoO, x86 performance is effectively dog-crap.
For many other ISA's, like 64-bit ARM, the performance holds up a lot
better, and the up-front performance gains from in-order to OoO seems to
be comparably smaller.
Since, throw a crappy codegen at an x86, and it will happily accept it
and run at nearly the same speed as the better codegen;
but throw it at
an A53, and one find that it seemingly performs 3x-5x worse than the
code that GCC produces
On 1/8/2024 3:59 AM, Anton Ertl wrote:
If all programs used just Dalvik, yes, you would "just" need to write
a Dalvik implementation for your VLIW. But reality is, that there are
enough programs that are written or have components distributed as
native code to make your non-ARM architecture uncompetetive, even with
a working binary translator.
Possibly so.
But, there were things like Atom-based Android devices, and stuff still >worked there as well, so...
Trimedia (and the TMS320C6x) line differ partly in that they were true
VLIW, rather than "LIW". So, in this case, I was imagining something
more similar to the ESP32 (LX6) or Qualcomm Hexagon or similar.
If even VLIW could not compete with ARM in a certain embedded niche,
why should LIW?
Because:
True VLIW relies heavily on being able to extract a good amount of ILP,
but falls on its face if not enough ILP is available;
A vaguely RISC-like LIW design (similar to ESP32 or similar), can still
be performance competitive even with fairly meager ILP (where
effectively it functions like a normal RISC just with explicit tagging
rather than a superscalar fetch).
The design philosphy of RISC-V favours having simple instructions and
combining them in the decoder over providing combined instructions, so
one would expect more executed instructions for RV64G(C) than for ARM
A64, which favours a fixed 32-bit format with instructions that do as
much as fits in 32 bits (i.e., precombined instructions). But if
combining in the decoder works, that does not mean that the programs
take longer to execute with a similarly capable back-end.
Combining stuff in the decoder is expensive though...
But, what if one can have something that is at least a little more >performance competitive, but also "free and open" like RISC-V.
But, still kinda "glass cannon" performance on the A53, it seems to
behave like it does something like:
Look at two instructions;
Can we run these in parallel?
If yes, do so.
If no, execute each sequentially.
With full latency penalties if you try to load something and then
immediately do arithmetic on it, ...
On Mon, 08 Jan 2024 21:22:43 +0000, Quadibloc wrote:
I've brought back the 15-bit short instructions, but now only within an
alternate or supplementary set of 32-bit instructions.
Not being able to restrain myself when there are yet further depths of wretched excess to be plunged into, I have now added two additional
alternate sets of 32-bit instructions, for a total of three.
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:Or by inserting NOPs in the right places; otherwise you lose the
There is a restriction that the prefixed instructions cannotOuch. This means that Power with prefixed instructions is the second
cross a 64-byte boundary.
instruction set (after MIPS with its architectural delayed loads)
where concatenating instruction blocks between two labels may result
in invalid code; on all other (~10) instruction sets I looked at this
works fine, including IA-64. Fortunately, for Power that's easy to
fix by compiling with -mno-prefixed,
functionality for Power10.
The instruction blocks are opaque for this technique, so there is no
way to know where "the right places" would be. And the benefit we get
from code-block copying and everything that builds on it far exceeds
what the prefix instructions are likely to buy. E.g., on Power 10
(numbers are times in seconds):
sieve bubble matrix fib fft
0.075 0.099 0.042 0.110 0.032 with code-block copying
0.181 0.184 0.123 0.230 0.119 without code-block copying
Fortunately, the assembler will do this for you:
It does not, because we copy (binary) machine-code blocks.
So, unless you prefer to write direct machine code, this should
not be an issue.
Yes, we copy machine code.
- anton
Anton Ertl wrote:[Code-block copying]
The instruction blocks are opaque for this technique, so there is no
way to know where "the right places" would be.
How about for POWER10 prefixed instructions always emit them as
prefix
inst
nop
Then when you copy the code block check the 64B boundary.
If the prefix and inst cross it then move the nop up and prefix,inst down
nop
prefix
inst
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:[Code-block copying]
The instruction blocks are opaque for this technique, so there is no
way to know where "the right places" would be.
How about for POWER10 prefixed instructions always emit them as
prefix
inst
nop
Then when you copy the code block check the 64B boundary.
If the prefix and inst cross it then move the nop up and prefix,inst down
nop
prefix
inst
As mentioned, the code blocks are opaque to the copying technique; the program that copies knows nothing about the instructions in the code
block, and in particular it would not know whether it contains a
Power3.1 prefix instruction and where.
The difficulty of recognizing a Power Prefix instruction is low: It
has major opcode 1.
However, changing the position of instructions requires handling
relocations in branches, which is probably not what you want to do.
I have to say that your application is the first one I ever
heard about that just pastes binary blobs of executables
together. How do you manage branches which exceed the normal
range (or is this something that cannot happen)?
On Tue, 09 Jan 2024 00:38:35 +0000, Quadibloc wrote:
On Mon, 08 Jan 2024 21:22:43 +0000, Quadibloc wrote:
I've brought back the 15-bit short instructions, but now only within
an alternate or supplementary set of 32-bit instructions.
Not being able to restrain myself when there are yet further depths of
wretched excess to be plunged into, I have now added two additional
alternate sets of 32-bit instructions, for a total of three.
I have now done something more important: after showing the prefix bits
which make for Composed Instructions, I now show the formats of those instructions themselves on the page
http://www.quadibloc.com/arch/cw010201.htm
If the first bit of an instruction prefix is 1, it will be a leftward
decoded prefix. Leftward decoded prefixes allow opcode space to be
shared between prefixes that make different kinds of modifications to an instruction depending on what kind of instruction it is.
On Mon, 08 Jan 2024 21:22:43 +0000, Quadibloc wrote:
I've brought back the 15-bit short instructions, but now only within an
alternate or supplementary set of 32-bit instructions.
Not being able to restrain myself when there are yet further depths of wretched excess to be plunged into, I have now added two additional
alternate sets of 32-bit instructions, for a total of three.
the
instruction prefixes the prefix before the now-defined prefix prefixes
the instruction.
When the convert bit is 1, and the prefix bits are 00, let that indicate
that the 16 bits referenced are the start of an _instruction prefix_.
The instruction being prefixed will have to have its own prefix bits set
to 11 all the way through, including at its first 16 bits, so that no
attempt will be made to decode it without taking the instruction prefix
into account.
I have decided to indeed add instruction prefixes for use with variable-length instructions to the instruction set, but *not* to
require the use of the additional header with the "convert" bit
for them.
Instead, instruction prefixes will use some of the unused space at the
end of the opcode space of the 17-bit short instructions.
Though, the real proof would be if it can be implemented effectively onI do not see any hope for ISA excellence.Why? MY 66000 exists, and it is excellent.
a typical Spartan or Artix class FPGA and also deliver on some of the other claims while doing so (and at a decent clock speed).
The idioms recognized in My 66150 core:
CMP Rt,--,-- ; BBit Rt,label
Calk Rd,--,-- ; BCnd Rd,label
LD Rd,[--] ; BCnd Rd,label
ST Rd,[--] ; Calk --,--,--
CALL Label ; BR Label
These all CoIssue (both instruction pass through the pipeline
The idioms recognized in My 66150 core:
CMP Rt,--,-- ; BBit Rt,label
Calk Rd,--,-- ; BCnd Rd,label
LD Rd,[--] ; BCnd Rd,label
ST Rd,[--] ; Calk --,--,--
CALL Label ; BR Label
These all CoIssue (both instruction pass through the pipeline
Sorry, what's "Calk"?
Oh, and what's "BR" (oh, wait, do you mean that the two "Label"s don't
have to be the same, so you're talking about calling Label1 and setting
the return address to Label2? Right, yes, that must be it, sorry for
being dense).
Stefan
In time, something surely will happen to change matters, and new
computer architectures will rise up to prominence. Right now, though,
signs of movement away from x86 to something else are few.
History has shown (RISC-vs-CISC being a prime example) that changes to
the underlying technology affect which ISA performs best.
On 1/11/2024 2:02 AM, Anton Ertl wrote:
Auto-increment:
One has an operation that either needs to do something weird with the >register ports, or, more likely, needs to be decoded as two operations
in the pipeline. It is also comparably infrequent, as "*ptr++" isn't
used *that* often.
MitchAlsup wrote:
Chris M. Thomasson wrote:
Think of LL/SC... If one did not honor the reservation granule....
well... Shit.. False sharing on a reservation granule can cause live
lock and damage forward progress wrt some LL/SC setups.
One should NEVER (N. E. V. E. R.) attempt ATOMIC stuff on an unaligned
container. Only aligned containers possess ATOMIC-smelling properties.
This is so obviously correct that you should not have needed to mention
it. Hammering HW with unaligned (maybe even page-straddling) LOCKed
updates is something that should only ever be done for testing purposes.
Though, as I see it, 64-bit ARM still has a few concerning features in
these areas:
Auto-increment addressing;
ALU status-flag bits;
...
Auto-increment:
One has an operation that either needs to do something weird with the >register ports, or, more likely, needs to be decoded as two operations
in the pipeline.
It is also comparably infrequent, as "*ptr++" isn't
used *that* often.
ALU status flags:
The flags themselves are fairly rarely used in practice,
but the cost of
keeping these sorts of flags consistent in the pipeline is not so cheap.
And, possibly the cost difference between a 1-bit status flag and, say,
4 or 5 flag bits, isn't that large. In either case, may make sense to
limit which instructions may update flags (unlike x86)
and possibly only
allow them in "lane 1" or whatever the equivalent is (the secondary ALUs
only doing non-flags-updating forms).
BGB <cr88192@gmail.com> writes:
Though, as I see it, 64-bit ARM still has a few concerning features in >>these areas:
Auto-increment addressing;
ALU status-flag bits;
...
Both are features that the architects of MIPS (and its descendents,
including Alpha and RISC-V) considered so concerning that they do not
feature them in their architecture. The architects of A64 knew these concerns, yet decided to include these features, so they obviously
were sure that they could implement these features efficiently at
bearable cost.
clearly, I'll
need to take another look at the MIPS and/or
the Alpha to see what it is that they _are_
doing, to understand how it fits into the
RISC philosophy.
BGB <cr88192@gmail.com> writes:
Though, as I see it, 64-bit ARM still has a few concerning features in >>these areas:
Auto-increment addressing;
ALU status-flag bits;
...
Both are features that the architects of MIPS (and its descendents,
including Alpha and RISC-V) considered so concerning that they do not
feature them in their architecture. The architects of A64 knew these >concerns, yet decided to include these features, so they obviously
were sure that they could implement these features efficiently at
bearable cost.
Auto-increment:
One has an operation that either needs to do something weird with the >>register ports, or, more likely, needs to be decoded as two operations
in the pipeline.
Yes. A64 seems to be designed to "do something weird with the
register ports"; it has instructions that write three registers and >instructions that read four registers.
Oh, silly me. I remembered shortly after: what
is combined with a branch to make a conditional
branch is not an operate instruction, but a
test of the contents of a specified register.
That's clearly basic enough to fit with RISC,
but if the carry out from an operation is what
you want to test, then awkwardness ensues.
On Mon, 13 Nov 2023 16:10:20 +0100, Terje Mathisen wrote:
MitchAlsup wrote:
Chris M. Thomasson wrote:
Think of LL/SC... If one did not honor the reservation granule....
well... Shit.. False sharing on a reservation granule can cause live
lock and damage forward progress wrt some LL/SC setups.
One should NEVER (N. E. V. E. R.) attempt ATOMIC stuff on an unaligned
container. Only aligned containers possess ATOMIC-smelling properties.
This is so obviously correct that you should not have needed to mention
it. Hammering HW with unaligned (maybe even page-straddling) LOCKed
updates is something that should only ever be done for testing purposes.
While older machines used an "exchange" instruction for something
atomic, the IBM 360 had the "Test and Set" instruction which had a single-byte operand, to avoid the issue.
However, qualifications are needed to make the statement "obviously
correct". Basically, one should never attempt an atomic operation on
an unaligned value in memory... on a machine that does paging. Because
the unaligned value _might_ cross a page boundary.
Otherwise, there's no problem. And a computer certainly _could_ be
aware that precautions are needed for atomic instructions, and
proceed with their execution only after all the memory pages involved
were brought into memory, and locked there. That would still mean
the computer would be slowed unnecessarily, but error-free operation
can be guaranteed.
So if someone wanted, they could design a computer which didn't mind
atomic operations on unaligned values all that much.
John Savard
As even the PDP-8 had auto-increment addressing, certainly
its cost must be bearable, if cost is thought of as the
number of transistors required to implement it.
Although
that feature seens odd for a RISC design even to me.
To my mind, ALU status bits are at least essential
for things like add-with-carry for multiple-precision
arithmetic.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Yes. A64 seems to be designed to "do something weird with the
register ports"; it has instructions that write three registers and >>instructions that read four registers.
It has instructions that read or write 8 registers (64 bytes).
BGB <cr88192@gmail.com> writes:
Though, as I see it, 64-bit ARM still has a few concerning features in >>these areas:
Auto-increment addressing;
ALU status-flag bits;
...
Both are features that the architects of MIPS (and its descendents,
including Alpha and RISC-V) considered so concerning that they do not
feature them in their architecture. The architects of A64 knew these concerns, yet decided to include these features, so they obviously
were sure that they could implement these features efficiently at
bearable cost.
Auto-increment:
One has an operation that either needs to do something weird with the >>register ports, or, more likely, needs to be decoded as two operations
in the pipeline.
Yes. A64 seems to be designed to "do something weird with the
register ports"; it has instructions that write three registers and instructions that read four registers.
It is also comparably infrequent, as "*ptr++" isn't
used *that* often.
Out of 192106 instructions in /bin/bash (in Debian 11), there are 1829
instructions with pre-increment "]!"; most of them are stp (store
pair) instructions, and the increment is usually negative and often
smaller than the size of the two registers, the address register is
usually sp. So the usual use seems to be for saving caller-saved or callee-saved registers.
Out of these 195815 instructions, 3002 use post-increment "],"; most
of them are ldp (load pair) instructions, and the increment is usually positive, and the address register is usually sp (2688 cases). So
most of these cases seem to be due to loading caller-saved registers
after the call or callee-saved registers before the return.
Overall, there are 25197 loads and stores that use sp as address
register, out of 61387 loads and stores.
[a76:~:536] objdump -d /bin/bash|grep "^ "|wc -l
192106
[a76:~:537] objdump -d /bin/bash|grep ']!'|wc -l
1829
[a76:~:538] objdump -d /bin/bash|grep '[[][a-z].*],'|wc -l
3002
[a76:~:539] objdump -d /bin/bash|grep 'sp],'|wc -l
2688
[a76:~:540] objdump -d /bin/bash|grep '[[]sp'|wc -l
25197
[a76:~:541] objdump -d /bin/bash|grep '[[][a-z]'|wc -l
61387
ALU status flags:
The flags themselves are fairly rarely used in practice,
Conditional branches tend to be quite frequent.
but the cost of
keeping these sorts of flags consistent in the pipeline is not so cheap.
Intel uses as many physical flags registers as physical integer
registers (280 each on Tigerlake and Golden Cove), ARM somewhat less
than the integer registers. And the register renamer needs to keep
track of them separately (for AMD64 it needs to keep track of C, O,
and NZP separately; I expect that A64 is better in this respect).
Yes, not cheap, but obviously manageable.
And, possibly the cost difference between a 1-bit status flag and, say,
4 or 5 flag bits, isn't that large. In either case, may make sense to
limit which instructions may update flags (unlike x86)
Actually, updating all flags in every instruction would make the implementation easier, too: every flag-using instruction would only be
able to use the result of the previous instruction, no need to store
flags longer. However, I guess it might be harder to program with
such a model.
My take is that GPRs should have additional carry and overflow flags
(which are not stored and loaded with the usual store and load
instructions); they have the information of the N and Z flags already.
This makes tracking the flags easy, and also allows programs to deal
with multiple live carry flags, as needed for multi-precision
multiplication.
and possibly only
allow them in "lane 1" or whatever the equivalent is (the secondary ALUs >>only doing non-flags-updating forms).
That's an implementation issue. Do ARM A64 implementations have such restrictions?
- anton
As for ALU status flag bits, I think they're a feature that
should be kept. But one concern with them relates to the
same reason that some early RISC architectures had branch
delay slots.
So a common way in which this concern is mitigated is for
RISC architectures to include, in instructions that can
affect the condition codes, a bit that controls whether or
not they do so. That way, other operate instructions can
be placed between an instruction that sets the condition
codes and the branch instruction that tests them.
The PowerPC architecture went further, also perhaps
addressing another concern with ALU status bits, by
having multiple sets of condition codes, so that the
condition codes would behave more like registers,
rather than being a unique resource.
To my mind, ALU status bits are at least essential
for things like add-with-carry for multiple-precision
arithmetic. Otherwise, one would need multiple
awkward instructions to perform the same function.
And since RISC typically has only load and store
memory-reference instructions, thus limiting each
instruction to one basic action, a design that
apparently forces operate instructions to be
combined with conditional branch instructions
seems to be the opposite of RISC. I presume that
they _don't_ solve the problem by including a
conditional skip in operate instructions, that
can skip over a jump instruction that follows
them (sort of like a PDP-8!)... clearly, I'll
need to take another look at the MIPS and/or
the Alpha to see what it is that they _are_
doing, to understand how it fits into the
RISC philosophy.
John Savard
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes: >>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Yes. A64 seems to be designed to "do something weird with the
register ports"; it has instructions that write three registers and >>>>instructions that read four registers.
It has instructions that read or write 8 registers (64 bytes).
Yes, I think there are crypto instructions or somesuch that handle a
lot of registers in one instruction, and I guess that they take
multiple cycles, so accessing many registers can be distributed across >>multiple cycles.
These aren't crypto. They are intended to allow atomic
64-byte transactions initiated by the cpu, generally to
on-chip coprocessors (it's called FEAT_LS64).
They use eight consecutive registers.
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Yes. A64 seems to be designed to "do something weird with the
register ports"; it has instructions that write three registers and >>>instructions that read four registers.
It has instructions that read or write 8 registers (64 bytes).
Yes, I think there are crypto instructions or somesuch that handle a
lot of registers in one instruction, and I guess that they take
multiple cycles, so accessing many registers can be distributed across >multiple cycles.
But
apparently ARM, Apple, and others have found ways to implement
auto-increment at bearable cost.
On Tue, 19 Dec 2023 07:22:10 +0000, Quadibloc wrote:
On Tue, 19 Dec 2023 03:36:06 +0000, Quadibloc wrote:
I changed where, in the opcode space, the supplementary
memory-reference instructions were located. This allowed me to have a
few more bits available for them.
I've moved them again, making even more space available... because in my
last change, I made the mistake of using the opcode space that I was
already using for block headers. I couldn't reduce the amount of
information in a block header by two bits, by using a combination of ten
bits instead of eight to indicate a block header, so I had to do my
rearranging in this place instead.
And now, with what I've learned from this experience, I've made further changes. I've increased the length of the opcode field in the supplementary memory-reference instructions that were moved to be among the other memory-reference instructions, so as to have enough for the different
sizes of the various types to be supported.
But in addition, I have now engaged in what some may see as an act of
pure evil.
Once again there are supplementary memory-reference instructions among
the operate instructions as well. *These*, however, provide for the conventional integer and floating-point types, CISC-style memory to
register operate instructions! So even within the basic 32-bit instruction set, although _these_ instructions are highly restricted in register use
and addressing modes, the pretense of being a load-store architecture
has been dropped!
Stefan Monnier wrote:
MitchAlsup [2024-01-10 22:32:50] wrote:A calculation instruction {ADD, AND, ...}
The idioms recognized in My 66150 core:Sorry, what's "Calk"?
CMP Rt,--,-- ; BBit Rt,label
Calk Rd,--,-- ; BCnd Rd,label
LD Rd,[--] ; BCnd Rd,label
ST Rd,[--] ; Calk --,--,--
CALL Label ; BR Label
These all CoIssue (both instruction pass through the pipeline
MitchAlsup [2024-01-10 22:32:50] wrote:
Stefan Monnier wrote:
MitchAlsup [2024-01-10 22:32:50] wrote:A calculation instruction {ADD, AND, ...}
The idioms recognized in My 66150 core:Sorry, what's "Calk"?
CMP Rt,--,-- ; BBit Rt,label
Calk Rd,--,-- ; BCnd Rd,label
LD Rd,[--] ; BCnd Rd,label
ST Rd,[--] ; Calk --,--,--
CALL Label ; BR Label
These all CoIssue (both instruction pass through the pipeline
Hmmm, what's the benefit of co-issuing an ST with a Calk?
Stefan
RISC-V has enough inelegance that considering it a model of
perfection implies, in my opinion, significant noobiness (or
perhaps what I might consider poor taste).
Courage often is doing the right thing even when it is (by all
rational examination) pointless.
While older machines used an "exchange" instruction for something
atomic, the IBM 360 had the "Test and Set" instruction which had a single-byte operand, to avoid the issue.
On Fri, 12 Jan 2024 04:09:30 -0000 (UTC), Quadibloc wrote:
While older machines used an "exchange" instruction for something
atomic, the IBM 360 had the "Test and Set" instruction which had a
single-byte operand, to avoid the issue.
That assumes it doesn’t create a new issue. Like some 16-bit architecture
I remember from the 1980s, that could not do single-byte bus cycles. So writing a byte involved a read-modify-write sequence of bus operations.
Try doing that atomically ...
On 1/10/2024 7:03 PM, Chris M. Thomasson wrote:
On 11/13/2023 7:10 AM, Terje Mathisen wrote:
Actually, it was experimented with wrt artificially triggering a bus
lock on Intel via unaligned access and dummy LOCK RMW (iirc) to > implement a user space RCU wrt remote memory barriers. Dave Dice comes
to mind. I am having trouble trying to find the god damn paper! I know I
read it before.
I need to point out that that unaligned access that would trigger an
actual bus lock is when the access straddled a l2 cache line wrt the
LOCK'ed RMW.
On 2/16/2024 1:05 PM, Chris M. Thomasson wrote:
On 2/16/2024 7:20 AM, Scott Lurndal wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 1/10/2024 7:03 PM, Chris M. Thomasson wrote:
On 11/13/2023 7:10 AM, Terje Mathisen wrote:
Actually, it was experimented with wrt artificially triggering a bus >>>>> lock on Intel via unaligned access and dummy LOCK RMW (iirc) to >
implement a user space RCU wrt remote memory barriers. Dave Dice comes >>>>> to mind. I am having trouble trying to find the god damn paper! I
know I
read it before.
I need to point out that that unaligned access that would trigger an
actual bus lock is when the access straddled a l2 cache line wrt the
LOCK'ed RMW.
You don't actually _need_ an unaligned access to trigger an actual
bus lock - if you can arrange for sufficient contention to a single
line, the processor may eventually grab the bus lock to make forward
progress.
True, but I think a LOCK'ed RMW on unaligned memory that straddles a
cache line triggers one right off the bat? There was something called
QPI that abused this to get remote memory barriers. I got a response a
while back from my friend Dmitry Vyukov that we both read the paper but
it seems to have been taken down. Dave Dice comes to mind.
I remember the Q in QPI was for quiescence.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 299 |
Nodes: | 16 (2 / 14) |
Uptime: | 34:48:10 |
Calls: | 6,682 |
Files: | 12,222 |
Messages: | 5,342,988 |