• Concertina II Progress

    From Quadibloc@21:1/5 to All on Wed Nov 8 21:33:59 2023
    Some progress has been made in advancing a small step towards sanity
    in the description of the Concertina II architecture described at

    http://www.quadibloc.com/arch/ct17int.htm

    As Mitch Alsup has rightly noted, I want to have my cake and eat it
    too. I want an instruction format that is quick to fetch and decode,
    like a RISC format. I want RISC-like banks of 32 registers, and I
    want the CISC-like addressing modes of the IBM System/360, but with
    16-bit displacements, not 12-bit displacements.

    I want memory-reference instructions to still fit in 32 bits, despite
    asking for so much more capacity.

    So what I had done was, after squeezing as much as I could into a basic instruction format, I provided for switching into alternate instruction
    formats which made different compromises by using the block headers.

    This has now been dropped. Since I managed to get the normal (unaligned) memory-reference instruction squeezed into so much less opcode space that
    I also had room for the aligned memory-reference format without compromises
    in the basic instruction set, it wasn't needed to have multiple instruction formats.

    I had to change the instructions longer than 32 bits to get them in the
    basic instruction format, so now they're less dense.

    Block structure is still used, but now for only the two things it's
    actually needed for: reserving part of a block as unused for the pseudo-immediates, and for VLIW features (explicitly indicating
    parallelism, and instruction predication).

    The ISA is still tremendously complicated, since I've put room in it for
    a large assortment of instructions of all kinds, but I think it's
    definitely made a significant stride towards sanity.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Quadibloc on Thu Nov 9 00:43:27 2023
    On 11/8/2023 3:33 PM, Quadibloc wrote:
    Some progress has been made in advancing a small step towards sanity
    in the description of the Concertina II architecture described at

    http://www.quadibloc.com/arch/ct17int.htm

    As Mitch Alsup has rightly noted, I want to have my cake and eat it
    too. I want an instruction format that is quick to fetch and decode,
    like a RISC format. I want RISC-like banks of 32 registers, and I
    want the CISC-like addressing modes of the IBM System/360, but with
    16-bit displacements, not 12-bit displacements.


    Ironically, I am getting slightly better reach on average with (scaled)
    9-bit (and 10) bit displacements than RISC-V gets with 12 bits...

    Say:
    DWORD:
    12s, Unscaled: +/- 2K
    9u, 4B Scale : + 2K
    10s, 4B Scale: +/- 2K (XG2)
    QWORD:
    12s, Unscaled: +/- 2K
    9u, 8B Scale : + 4K
    10s, 8B Scale: +/- 4K (XG2)

    It was a pretty tight call between 10s and 10u, but 10s won out by a
    slight margin mostly because the majority of structs and stack-frames
    tend to be smaller than 4K (but, does create an incentive to use larger
    storage formats for on-stack storage).

    Though, for integer immediate instructions, RISC-V would have a slight advantage. Where, say, roughly 9% of 3R integer immediate values miss
    with the existing Imm9u/Imm9n scheme; but the sliver of "Misses with 9
    bits, but would hit with 12 bits", is relatively small (most of the
    "miss" cases are much larger constants).

    However, a fair chunk of these "miss" cases, could be handled with a bit-set/bit-clear instruction, say:
    y=x|0x02000000;
    z=x&0xFBFFFFFF;
    Turning into, say:
    BIS R4, 25, R6
    BIC R4, 25, R7

    Unclear if this case is quite common enough to justify adding these instructions though (granted, a case could be made for them).


    However, a few cases do typically need larger displacements:
    PC relative, such as branches.
    GBR relative, namely constant loads.


    For PC relative, 20-bits is "mostly enough", but one program has hit the
    20-bit limit (+/- 1MB). Recently, via a tweak, in current forms of the
    ISA, the effective branch-displacement limit (for a 32-bit instruction
    form) has been increased to 23 bit (+/- 8MB).
    Baseline+XGPR: Unconditional BRA and BSR only.
    Conditional branches still limited to 20 bits.
    XG2: Also includes conditional branches.

    In these cases, it was mostly because the bits that were being used to
    extend the GPRs to 6 bits were N/A for their original purpose with
    branch-ops, and this could be repurposed to the displacement. Main other alternatives would have been 22 bits + alternate link register, or a
    3-bit LR field; however, the cost of supporting this would have been
    higher than that of reassigning them simply towards making the
    displacement bigger.

    Potentially a similar role could have been served by a conjoined "MOV
    LR, R1 | BSR Disp" instruction (and/or allowing "MOV LR, R1" in Lane 2
    as a special case for this, even if it would not otherwise be allowed
    within the ISA rules). Though, would defeat the point if this encoding
    foils the branch predictor.



    Recently, had ended up adding some Disp11s Compare-with-Zero branches,
    mostly as these branches turn out to be useful (in the face of 2-cycle
    CMPxx), and 8 bits "wasn't quite enough". Say, Disp11s can cover a much
    bigger if/else block or loop body (+/- 2K) than Disp8s (+/- 256B).


    For GBR Relative:
    The default 9-bit displacement was Byte scaled (for "reasons");
    But, a 512B range isn't terribly useful;
    Later forms ended up with Disp10u Scaled:
    This gives 4K or 8K of range (in Baseline)
    This increases to 8K and 16K in XG2.


    If the compiler sorts primitive global variables by descending-usage
    (and emits the top N specially, at the start of ".data"), then the
    Scaled GBR cases can access a majority of the global variables (around
    75-80% with a scaled 10-bit displacement).

    Effectively, the remaining 20-25% or so need to be handled as one of:
    Jumbo Disp33s (if Jumbo prefixes are available, most profiles);
    2-op Disp25s (no jumbo, '.data'+'.bss' less than 16MB).
    3-op Disp33s (else).


    Though, as with the stack frames, these instructions do create an
    incentive to effectively promote any small global variables to a larger
    storage type (such as 'char' or 'short' to 'int'); just with implicit
    sign (or zero) extensions to preserve the expected behavior of the
    smaller type (though, strictly speaking, only zero-extensions would be
    required by the C standard, given signed overflow is technically UB; but
    there would be something "deeply wrong" with a 'char' variable being
    able to hold, say, -4495213, or similar).

    Though, does mean for normal variables, "just use int or similar" is
    typically faster (say, because there are dedicated 32-bit sign and zero extending forms of some of the common ALU ops, but not for 8 or 16 bit
    cases).


    A Disp16u case could maybe reach 256K or 512K, which could cover much of
    a combined data+bss section. While in theory this could be better, to
    make effective use of this would require effectively folding much of
    ".bss" into ".data", which is not such a good thing for the program
    loader (as opposed to merely folding the top N most-used variables into ".data").

    Then again, uninitialized global arrays could probably still be left in
    ".bss", which tend to be the main "bulking factor" for this section (as
    opposed to normal variables).




    I want memory-reference instructions to still fit in 32 bits, despite
    asking for so much more capacity.


    Yeah.

    If you want a Load/Store to have two 5 bit registers and a 16-bit
    displacement, only 6 bits are left in a 32-bit instruction word. This
    is, not a whole lot...

    For a full set of Load/Store ops, this is 4 bits;
    For a set of basic ALU ops, this is another 3 bits.

    So, just for Load/Store and basic ALU ops, half the encoding space is
    gone...

    Would it be worth it?...



    So what I had done was, after squeezing as much as I could into a basic instruction format, I provided for switching into alternate instruction formats which made different compromises by using the block headers.

    This has now been dropped. Since I managed to get the normal (unaligned) memory-reference instruction squeezed into so much less opcode space that
    I also had room for the aligned memory-reference format without compromises in the basic instruction set, it wasn't needed to have multiple instruction formats.

    I had to change the instructions longer than 32 bits to get them in the
    basic instruction format, so now they're less dense.

    Block structure is still used, but now for only the two things it's
    actually needed for: reserving part of a block as unused for the pseudo-immediates, and for VLIW features (explicitly indicating
    parallelism, and instruction predication).

    The ISA is still tremendously complicated, since I've put room in it for
    a large assortment of instructions of all kinds, but I think it's
    definitely made a significant stride towards sanity.


    Such is a long standing issue...


    I am also annoyed sometimes at how complicated my design has gotten.
    Still, it is within reason, and not too far outside the scope of many
    existing RISC's.

    But, as noted, the reason XG2 exists as-is was sort of a compromise:
    I couldn't come up with any encoding which could actually give
    everything I wanted, and the "most practical" option was effectively to
    dust off an idea I had originally rejected:
    Having an alternate encoding which dropped 16-bit ops in favor of
    reusing these bits for more GPRs.


    At first glance, RISC-V seems cleaner and simpler, but this falls on its
    face once one goes outside the scope of RV64IM or similar.

    And, it isn't tempting when, at least from my POV, RV64 seems "less
    good" than what I have already (others may disagree; but at least to me,
    some parts of RISC-V's design seem to me like kind of a trash fire).

    The main tempting thing the RV64 has is that, maybe, if one goes and
    implements RV64GC and clones a bunch of SiFive's hardware interfaces,
    then potentially one can run a mainline Linux on it.

    There have apparently been some people that have gotten NOMMU Linux
    working on RV32IM targets, which is possible (and, ironically, seemingly
    basing these on the SuperH branch in the Linux kernel from what I had
    seen...).


    Seemingly, AMD/Xilinx is jumping over from MicroBlaze to an RV32
    variant. But, granted, RV32 isn't too far from what MicroBlaze is
    typically used for, so not really a huge stretch.

    I sometimes wonder if maybe I would be better off jumping to RV, but
    then I end up seeing examples where cores running at somewhat higher
    clock speeds still manage to deliver relatively poor framerates in Doom.


    Like, as-is, my MIPs scores are kinda weak, but I am still getting
    around 30 fps in Doom at around 20-24 MIPs.

    RV64IM seemingly needs significantly higher MIPs to get similar
    framerates in Doom.

    Say, for Doom:
    BJX2 needs ~ 800k instructions / frame;
    RV64IM seemingly needs nearly 2 million instructions / frame.

    Not entirely sure what all is going on, but I have my suspicions.

    Though, it does seem to be the inverse situation with Dhrystone.

    Say:
    BJX2: around 1.3 DMIPS per BJX2 instruction;
    RV64: around 3.8 DMIPS per RV64 instruction.

    Though, I can note that there seems to be "something weird" with
    Dhrystone and GCC (in multiple scenarios, GCC gives Dhrystone scores
    that are significantly above what could be "reasonably expected", or
    which agree with the scores given by other compilers, seemingly as-if it
    is optimizing away a big chunk of the benchmark...).

    But, these results don't typically extend to other programs (where
    scores are typically much closer together).


    Actually, I have noted that if comparing BGBCC with MSVC and BJX2 with
    my Ryzen, performance relations seem to scale pretty closer to linearly relative to clock-speed, albeit with some outliers.

    There are cases where deviation has been noted:
    Speed differences for TKRA-GL's software rasterizer backend are smaller
    than the difference in clock-speed (74x clock-speed delta; 20x fill-rate delta);
    And cases where it is bigger: The performance delta for things like LZ4 decompression or some of my image codecs is somewhat larger than the clock-speed delta (say: 74x clock-speed delta, 115x performance delta, *1).


    *1: Though, LZ4 still operates near memcpy() speed in both cases; issue
    is mostly that, relative to MHz, my BJX2 core has comparably slower
    memory access.

    Albeit somehow, this trend reverses for my early 2000s laptop, which has
    slower RAM access. However, the SO-DIMM is 4x the width (64b vs 16b),
    and 133MHz vs 50MHz; and this leads to a theoretical 10.64x ratio, which
    isn't too far off from the observed memcpy() performance of the laptop.

    So, laptop has 10.64x faster RAM, relative to 28x more MHz.


    Wheres, say, my Ryzen has 2.64x more MHz (3.7 vs 1.4), but around 40x
    more memory bandwidth (12.7x for single-thread memcpy).



    Well, and if I did jump over to RV64, it would renderer much of what I
    am doing entirely moot.

    I *could* do a dedicated RV64 core, but could unlikely make it "notable"
    enough to be worthwhile.

    So, it seems like my options are either:
    Continue on doing stuff mostly as is;
    Drop it and probably go off to doing something else entirely.

    ...




    But, don't have much else better to be doing, considering the typically
    "meh" response to most of my 3D engine attempts. And my general
    lackluster skills towards most types of "creative" endeavors (I suspect "affective alexithymia" probably doesn't help too much for artistic expression).

    Well, and I have also recently noted other oddities, for example:
    It seems I may have "reverse slope hearing loss", and my hearing is
    seemingly notably poor for sounds much lower than about 1.5 or 2kHz (lower-frequency sine waves are nearly inaudible, but I can still hear square/triangle/sawtooth waves well; most of what I perceive as
    low-frequency sounds seemingly being based on higher-frequency harmonics
    of those sounds).

    So, say:
    2kHz..4kHz, loud, heard easily;
    4kHz..8kHz, also heard readily;
    8..15kHz, fades away and disappears.
    But, OTOH, for sine waves:
    1kHz: much quieter than 2kHz
    500Hz: fairly mild at full volume
    250Hz: relatively quiet
    125Hz: barely audible.


    But, for sounds much under around 200Hz, I can feel the vibrations, and
    can associate these with sound (but, this effect is not localized to
    ears, also works with hands and similar; this effect seems strongest at
    around 50-100 Hz, but has a lower range of around 6-8Hz, below this
    point, feeling becomes less sensitive to it, but visual perception can
    take over at this point).


    I can take audio and apply a fairly aggressive 2kHz high-pass filter
    (say, -48db per octave, applied several times), and for the most part it doesn't sound that much different, though does sound a little more
    tinny. This "tinny" effect is reduced with a 1kHz high-pass filter.

    Most of what I had perceived as low-frequency sounds are still present
    even after the filtering (and while entirely absent in a spectrum plot). Zooming in generally shows patterns of higher frequency vibrations
    following similar patterns to the low-frequency vibrations, which
    seemingly I perceive "as" the low-frequency vibration.


    And, in all this, I hadn't noticed that anything was amiss until looking
    into it for other reasons.



    I am left to wonder is some of this could be related to my preference
    for the sound of ADPCM compression over that of MP3 at lower quality
    levels (low bitrate MP3 sounds particularly awful, whereas ADPCM tends
    to fare better; but seemingly other people disagree).


    Does possibly explain some other past difficulties:
    I can make a noise and hear the walls within a room;
    But, trying to hit a metal tank to determine how much sand was in the
    tank by hearing, was quite a bit more difficult (best I could do was hit
    the tank, and then try to hear what parts of the tank had reduced echo;
    but results were pretty mixed as the sand level did not significantly
    change the echoes).

    Apparently, it turns out, people were listening for "thud" vs "not
    thud", but like, I couldn't really hear this part, and wasn't even
    really aware there should be a "thud" (or even really what a "thud"
    sounds like apart from the effects of, say, something hitting a chunk of
    wood; hitting a sand-filled steel tank with a rubber mallet was nearly
    silent, but, knuckles or tapping it with a screwdriver was easier to
    hear, ...).


    Well, also can't really understand what anyone is saying over the phone
    (as the phone reduces everything to difficult to understand muffled noises).

    Or, like the sound-effects in Wolfenstein 3D being theoretically voice
    clips saying stuff, but are more things like "aaaa uunn" or "aaaauuuu"
    or "uu aa uu" or similar owing to the poor audio quality.

    Well, and my past failures to achieve any kind of intelligibility in
    past experiments messing with formant synthesis.

    And some experiments with vocoder like designs, noting that I could
    seemingly discard pretty much everything much below 500Hz or 1kHz
    without much ill effect; but theoretically there is "relevant stuff" in
    these frequency ranges. Didn't really think of much at the time (it
    seemed like all of this was a "based frequency" where the combined
    amplitude of everything could be averaged together and treated like a
    single channel).

    Had noted that, one thing that did sort of work, was, say:
    Split the audio into 32 frequency bands;
    Pick the top 2 or 3 bands, ignoring low-frequency or adjacent bands;
    Say, anything below 1kHz is ignored.
    Record the band number and relative volume.

    Then, regenerate waveforms at each of these bands with the measured
    volume (along with alternate versions spread across different octaves;
    it worked better if higher power-of-2 frequencies were also synthesized,
    albeit at lower intensities). Get back "mostly intelligible" speech.

    IIRC, had mostly used 32 bands spread across 2 octaves (say, 1-2 kHz and 2-4kHz, or 2-4 kHz and 4-8 kHz).
    Can also mix in sounds from the same relative position in other octaves.

    Seemed to have best results with mostly evenly-spread frequency bands.


    ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Quadibloc on Thu Nov 9 18:50:37 2023
    Quadibloc <quadibloc@servername.invalid> schrieb:

    As Mitch Alsup has rightly noted, I want to have my cake and eat it
    too. I want an instruction format that is quick to fetch and decode,
    like a RISC format. I want RISC-like banks of 32 registers, and I
    want the CISC-like addressing modes of the IBM System/360, but with
    16-bit displacements, not 12-bit displacements.

    So, r1 = r2 + r3 + offset.

    Three registers is 15 bits plus a 16-bit offset, which gives you
    31 bits. You're left with one bit of opcode, one for load and
    one for store.

    The /360 had 12 bits for three registers plus 12 bits of offset, so
    24 bits left eight bits for the opcode (the RX format).

    So, if you want to do this kind of thing, why not go for a full 32-bit
    offset in a second 32-bit word?

    [...]

    The ISA is still tremendously complicated, since I've put room in it for
    a large assortment of instructions of all kinds, but I think it's
    definitely made a significant stride towards sanity.

    Have you ever written an assembler for your ISA?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Thomas Koenig on Thu Nov 9 21:38:31 2023
    On Thu, 09 Nov 2023 18:50:37 +0000, Thomas Koenig wrote:

    So, r1 = r2 + r3 + offset.

    Three registers is 15 bits plus a 16-bit offset, which gives you 31
    bits. You're left with one bit of opcode, one for load and one for
    store.

    Yes, and obviously that isn't enough. So I do have to make some
    compromises.

    The offset is 16 bits, because the 68000 (and the 8086, and others) had 16
    bit offsets!

    But the base and index registers are each specified by only 3 bits - only
    the destination register gets a 5-bit field.

    I need 5 bits for the opcode. That lets me have load and store for four floating-point types, load, store, unsigned load, and insert for four
    integer types (the largest one only uses load and store).

    So it is doable! 5 plus 5 plus 3 plus 3 equals 16, so I have 16 bits left
    for the offset.

    But that leaves only 1/4 of the opcode space. Which would be fine for a conventional RISC design, as that's plenty for the operate instructions.
    But I needed to reserve _half_ the opcode space, because I needed another
    1/4 of the opcode space for putting two 16-bit instructions in a 32-bit
    word for more compact code.

    That led me to look for compromises... and I found some that would not
    overly impair the effectiveness of the memory reference instructions,
    which I discussed previously. I ended up using _both_ of two alternatives
    each of which alone would have given me the needed savings in opcode
    space... that way, the compromised memory-reference instructions could be accompanied by another complete set of memory-reference instructions with
    _no_ compromise... except for only being able to specify aligned operands.

    The /360 had 12 bits for three registers plus 12 bits of offset, so 24
    bits left eight bits for the opcode (the RX format).

    Oh, yes, I remember it well.

    So, if you want to do this kind of thing, why not go for a full 32-bit
    offset in a second 32-bit word?

    Because the 360 only took 32 bits for a memory-reference instruction, so
    using 32 bits for one is sinfully wasteful!

    I want to "have my cake and eat it too" - to have a computer that's just
    as good as a Power PC or a 68000 or a System/360, even though they have different, incompatible, strengths that conflict with a computer being
    able to be good at what each of them is good at simultaneously.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Thu Nov 9 21:42:43 2023
    On Thu, 09 Nov 2023 21:38:31 +0000, Quadibloc wrote:

    I want to "have my cake and eat it too" - to have a computer that's just
    as good as a Power PC or a 68000 or a System/360, even though they have different, incompatible, strengths that conflict with a computer being
    able to be good at what each of them is good at simultaneously.

    Actually, it's worse than that, since I also want the virtues of processors like the TMS320C2000 or the Itanium.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to BGB-Alt on Thu Nov 9 21:51:31 2023
    On Thu, 09 Nov 2023 15:36:12 -0600, BGB-Alt wrote:
    On 11/9/2023 12:50 PM, Thomas Koenig wrote:

    So, r1 = r2 + r3 + offset.

    Three registers is 15 bits plus a 16-bit offset, which gives you 31
    bits. You're left with one bit of opcode, one for load and one for
    store.


    Oh, that is even worse than I understood it as, namely:
    LDx Rd, (Rs, Disp16)
    ...

    But, yeah, 1 bit of opcode clearly wouldn't work...

    And indeed, he is correct, that is what I'm trying to do.

    But I easily solve _most_ of the problem.

    I just use 3 bits for the index register and the base register.

    The 32 general registers aren't _quite_ general. They're divided into
    four groups of eight.

    16-bit register-to-register instructions use eight bits to specify their
    source and destination registers, so both registers must be from the same
    group of eight registers.

    This lends itself to writing code where four distinct threads are
    interleaved, helping pipelining in implementations too cheap to have out-of-order executiion.

    The index register can be one of registers 1 to 7 (0 means no indexing).

    The base register can be one of registers 25 to 31. (24, or a 0 in the three-bit base register field, indicates a special addressing mode.)

    This sort of is reminiscent of System/360 coding conventions.

    The special addressing modes do stuff like using registers 17 to 23 as
    base registers with a 12 bit displacement, so that additional short
    segments can be accessed.

    As I noted, shaving off two bits each from two fields gives me four more
    bits, and five bits is exactly what I need for the opcode field.

    Unfortunately, I needed one more bit, because I also wanted 16-bit instructions, and they take up too much space. That led me... to some interesting gyrations, but I finally found a compromise that was
    acceptable to me for saving those bits, so acceptable that I could drop
    the option of using the block header to switch to using "full" instructions instead. Finally!

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Thu Nov 9 22:11:41 2023
    On Thu, 09 Nov 2023 21:42:43 +0000, Quadibloc wrote:

    On Thu, 09 Nov 2023 21:38:31 +0000, Quadibloc wrote:

    I want to "have my cake and eat it too" - to have a computer that's
    just as good as a Power PC or a 68000 or a System/360, even though they
    have different, incompatible, strengths that conflict with a computer
    being able to be good at what each of them is good at simultaneously.

    Actually, it's worse than that, since I also want the virtues of
    processors like the TMS320C2000 or the Itanium.

    And don't forget the Cray-I.

    So the idea is to have *one* ISA that will serve for...

    embedded microcontrollers,
    data-base servers,
    desktop workstations, and
    HPC supercomputers.

    Of course, these different tasks will require different implementations,
    which focus on doing parts of the ISA well.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB-Alt@21:1/5 to Thomas Koenig on Thu Nov 9 15:36:12 2023
    On 11/9/2023 12:50 PM, Thomas Koenig wrote:
    Quadibloc <quadibloc@servername.invalid> schrieb:

    As Mitch Alsup has rightly noted, I want to have my cake and eat it
    too. I want an instruction format that is quick to fetch and decode,
    like a RISC format. I want RISC-like banks of 32 registers, and I
    want the CISC-like addressing modes of the IBM System/360, but with
    16-bit displacements, not 12-bit displacements.

    So, r1 = r2 + r3 + offset.

    Three registers is 15 bits plus a 16-bit offset, which gives you
    31 bits. You're left with one bit of opcode, one for load and
    one for store.


    Oh, that is even worse than I understood it as, namely:
    LDx Rd, (Rs, Disp16)
    ...

    But, yeah, 1 bit of opcode clearly wouldn't work...


    The /360 had 12 bits for three registers plus 12 bits of offset, so
    24 bits left eight bits for the opcode (the RX format).

    So, if you want to do this kind of thing, why not go for a full 32-bit
    offset in a second 32-bit word?


    Originally, I had turned any displacements that didn't fit into 9 bits
    into a 2-op sequence:
    MOV Imm25s, R0
    MOV.x (Rb, R0), Rn

    Actually, worse yet, the first form of BJX2 only had 5-bit Load/Store displacements, but it didn't take long to realize that 5 bits wasn't
    really enough (say, when roughly 2/3 of the load and store operations
    can't fit in the displacement).


    But, now, there are Jumbo-encodings, which can encode a full 33-bit displacement in a 64-bit encoding. Not everything is perfect though,
    mostly because these encodings are bigger and can't be used in a bundle.

    But, still "less bad" in this sense than my original 48-bit encodings,
    where "for reasons", these couldn't co-exist with bundles in the same
    code block.

    Despite the loss of 48-bit ops though:
    The jumbo encodings give larger displacements (33s vs 24u or 17s);
    They reuse the existing 32-bit decoders, rather than needing a dedicated
    48-bit decoder.


    But, yeah, "use another instruction word" if one needs a larger
    displacement, is mostly the option that I would probably recommend.


    At first, the 5-bit encodings went away, but later came back as a zombie
    of sorts (cases emerged where their existence was still valuable).

    But, then it later came up to a tradeoff (with the design of XG2):
    Do I expand the Disp9u to Disp10u, and then keep with the XGPR encoding
    of using the Disp5u encodings to encode a Disp6s case (for a small range
    of negative displacements), or expand to Disp9u to Disp10s?...

    In this case, Disp10s won out by a small margin, as I needed non-trivial negative displacements at least slightly more often than I needed 8K for structs and stack frames and similar.


    But, for most things, a 16-bit displacement would be a waste...
    If I were going to go the route of using a signed 12-bit displacement
    (like RISC-V), would probably still keep it scaled though, as 8K/16K is
    still more useful than 2K.


    Branch displacements are typically still hard-wired as 2 though, partly
    as the ISA started out with 16-bit ops, and switching XG2 over to 4-byte
    scale would have broken its symmetry with the Baseline ISA.


    Though, could pull a cheap trick and repurpose the LSB of branch ops in
    XG2, given as-is, it is effectively "Must Be Zero" (all instructions
    have a 32-bit alignment in this mode, and branches to an odd address are
    not allowed).

    So, the idea of a BSR that uses R1 as an alternate Link-Register is
    still not (entirely) dead (while at the same time allowing for the
    '.text' section to be expanded to 8MB).


    There are 64-bit Disp33s and Abs48 branch encodings, but, yeah, they
    have costs:
    They are 64-bit vs 32-bit, thus, bigger;
    Are ignored by the branch predictor, thus, slower;
    The Abs48 case is not PC relative
    Using it within a program requires a base reloc;
    Is generally useful for DLL imports and special cases though (*1).

    *1: Its existence is mostly as an alternative in these cases to a more expensive option:
    MOV Addr64, R1
    JMP R1
    Which needs 128-bits, and is also ignored by the branch predictor.


    [...]

    The ISA is still tremendously complicated, since I've put room in it for
    a large assortment of instructions of all kinds, but I think it's
    definitely made a significant stride towards sanity.

    Have you ever written an assembler for your ISA?

    Yeah, whether someone can write an assembler, or disassembler/emulator,
    and not drive themselves insane in the attempt, is possibly a test of
    "sanity".

    Granted, still not foolproof, as it isn't that bad to write an assembler/disassembler for x86 either, but trying to decode it in
    hardware would be nightmarish.

    Best guess I can have would be a "preclassify" stage:
    If this is an opcode byte, how long will it be, and will a Mod/RM
    follow, ...?
    If this is a Mod/RM byte, how many bytes will this add.

    Then in theory, one can figure instruction length like:
    Fetch OpLen for IP;
    Fetch Mod/RM len for IP+OpLen if Mod/RM flag is set;
    Add OpLen+ModRmLen.
    Add an extra 2/4 bytes if an Immed is present for this opcode.

    Nicer to not bother.


    For my 75 MHz experiment, did end up adding a similar sort of
    "preclassify" logic to deal with instruction-lengths though, at the cost
    that now L1 I$ cache-lines are specific to the operating mode in which
    they were fetched (which now needs to be checked along with the address
    and similar).

    Mostly all this is a case of "looking up 4 bits of tag metadata" being
    less latency than "feed 9 bits of instruction bits through some LUTs"
    (or 12 bits if RISC-V decoding is enabled). There is still some latency
    due to MUX'ing and similar, but this part is unavoidable.

    So, former case:
    8 bits: Classify BJX2 instruction length;
    1 bit: Specify Baseline or XG2.
    Latter case:
    8 bits: Classify BJX2 instruction length;
    2 bits: Classify RISC-V instruction length (16/32)
    2 bits: Specify Baseline, XG2, RISC-V, or XG2RV.

    Which map to 4 bits (IIRC):
    (0): 16-bit
    (1): (WEX && WxE) || Jumbo
    (2): WEX
    (3): Jumbo


    As-is, after MUX'ing, this can effectively turn op-len determination
    into a 4 or 6 bit lookup, say (bits tag 1:0 for two adjacent 32-bit words):
    00zz: 32-bit
    01zz: 16-bit
    1000: 64-bit
    1001: 48-bit (unused)
    1010: 96-bit (*)
    1011: Invalid
    11zz: Invalid

    *: Here, we just assume that the 3'rd instruction word is 00.
    Would actually need to check this if either 4-wide bundles or 80-bit
    encodings were "actually a thing".

    Where, handling both XG2 and WXE (WEX Enable) in the preclassify step
    greatly simplifies the logic during instruction fetch.

    This could, in premise, be reduced further in an "XG2 only" core, or to
    a lesser extent by eliminating the original XGPR scheme. These are not currently planned though (say, the first-stage lookup width could be
    reduced from 8 to 5 or 7 bits).

    ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB-Alt@21:1/5 to Quadibloc on Thu Nov 9 17:49:03 2023
    On 11/9/2023 3:51 PM, Quadibloc wrote:
    On Thu, 09 Nov 2023 15:36:12 -0600, BGB-Alt wrote:
    On 11/9/2023 12:50 PM, Thomas Koenig wrote:

    So, r1 = r2 + r3 + offset.

    Three registers is 15 bits plus a 16-bit offset, which gives you 31
    bits. You're left with one bit of opcode, one for load and one for
    store.


    Oh, that is even worse than I understood it as, namely:
    LDx Rd, (Rs, Disp16)
    ...

    But, yeah, 1 bit of opcode clearly wouldn't work...

    And indeed, he is correct, that is what I'm trying to do.

    But I easily solve _most_ of the problem.

    I just use 3 bits for the index register and the base register.

    The 32 general registers aren't _quite_ general. They're divided into
    four groups of eight.


    Errm, splitting up registers like this is likely to hurt far more than
    anything that 16-bit displacements are likely to gain.

    Unless, maybe, registers were being treated like a stack, but even then,
    this is still gonna suck.

    Much preferable for a compiler to have a flat space of 32 or 64
    registers. Having 16 sorta works, but does still add a bit to spill and
    fill.


    Theoretically, 32 registers should be "pretty good", but I ended up with
    64 partly due to arguable weakness in my compilers' register allocation.

    Say, 64 makes it possible to static assign most of the variables in most
    of the functions, which avoids the need for spill and fill; at least
    with a register allocator that isn't smart enough to locally assign
    registers across basic-block boundaries).

    I am not sure if a more clever compiler (such as GCC) could also find
    ways to make effective use of 64 GPRs.


    I guess, IA-64 did have 128 registers in banks of 32. Not sure how well
    this worked.


    16-bit register-to-register instructions use eight bits to specify their source and destination registers, so both registers must be from the same group of eight registers.


    When I added R32..R63, I ended up not bothering adding any way to access
    them from 16-bit ops.

    So:
    R0..R15: Generally accessible for all of 16-bit land;
    R16..R31: Accessible from a limited subset of 16-bit operations.
    R32..R63: Inaccessible from 16-bit land.
    Only accessible for an ISA subset for 32-bit ops in XGPR.

    Things are more orthogonal in XG2:
    No 16-bit ops;
    All of the 32-bit ops can access R0..R63 in the same way.


    This lends itself to writing code where four distinct threads are interleaved, helping pipelining in implementations too cheap to have out-of-order executiion.


    Considered variations on this in my case as well, just with static
    control flow.

    However, BGBCC is nowhere near clever enough to pull this off...

    Best that can be managed is doing this sort of thing manually (this is
    sort of how "functions with 100+ local variables" are born).

    In theory, a compiler could infer when blocks of code or functions are
    not sequentially dependent and inline everything and schedule it in
    parallel, but alas, this sort of thing requires a bit of cleverness that
    is hard to pull off.


    The index register can be one of registers 1 to 7 (0 means no indexing).

    The base register can be one of registers 25 to 31. (24, or a 0 in the three-bit base register field, indicates a special addressing mode.)

    This sort of is reminiscent of System/360 coding conventions.


    OK.


    The special addressing modes do stuff like using registers 17 to 23 as
    base registers with a 12 bit displacement, so that additional short
    segments can be accessed.

    As I noted, shaving off two bits each from two fields gives me four more bits, and five bits is exactly what I need for the opcode field.

    Unfortunately, I needed one more bit, because I also wanted 16-bit instructions, and they take up too much space. That led me... to some interesting gyrations, but I finally found a compromise that was
    acceptable to me for saving those bits, so acceptable that I could drop
    the option of using the block header to switch to using "full" instructions instead. Finally!


    A more straightforward encoding would make things, more straightforward...


    Main debates I think are, say:
    Whether to start with the MSB of each word (what I had often done);
    Or, start from the LSB (like RISC-V);
    Whether 5 or 6 bit register fields;
    How much bits for immediate and opcode fields;
    ...

    Bundling and predication may eat a few bits, say:
    00: Scalar
    01: Bundle
    10/11: If-True / If-False

    In my case, this did leave an ugly hack case to support conditional ops
    in bundles. Namely, the instruction to "Load 24 bits into R0" has
    different interpretations in each case (Scalar: Load 24 bits into R0;
    Bundle: Jumbo Prefix; If-True/If-False, repeat a different instruction
    block, but understood as both conditional and bundled).

    This could be fully orthogonal with 3 bits, but it seems, this is a big ask:
    000, Unconditional, Scalar
    001, Unconditional, Bundle
    010, Special, Scalar (Eg: Large constant load or Branch)
    011, Special, Bundle (Eg: Jumbo Prefix)
    100, If-True, Scalar
    101, If-True, Bundle
    110, If-False, Scalar
    111, If-False, Bundle


    This leads to a lopsided encoding though, and it seems like things only
    really fit together nicely with a limited combination of sizes.

    Say, for an immediate field:
    24+ 9 => 33s
    24+24+16 => 64
    This is almost magic...

    Though:
    26+ 7 => 33s
    26+26+12 => 64
    Could also work.


    But, does end up with an ISA layout where immediate values are mostly 7u
    or 7n, which is not nearly as attractive as 9u and 9n.

    Say, for Load/Store displacement hit (rough approximations, from memory):
    5u: 35%
    7u: 65%
    9u: 90%
    ...


    All turns into a bit of an annoying numbers game sometimes...


    But, this ended up as part of why I ended up with XG2, which didn't give
    me everything I wanted, and the encodings of some things does have more
    "dog chew" than I would like (I would have preferred if everything were
    nice contiguous fields, rather than the bits for each register field
    being scattered across the instruction word).

    But, the numbers added up in a way that worked better than most of the alternatives I could come up with (and happened to also be the "least
    effort" implementation path).


    Granted, I still keep half expecting people to be like "Dude, just jump
    onto the RISC-V wagon...".

    Or, failing this, at least implement enough of RISC-V to be able to run
    Linux on it (but, this would require significant architectural changes;
    being able to run a "stock" RV64GC Linux build would effectively require partially cloning a bunch of SiFive's architectural choices or similar;
    which is not something I would be happy with).

    But, otherwise, pretty much any other option in this area would still
    mean a porting effort...


    Well, and the on/off consideration of trying to port a BSD variant, as
    BSD seemed like potentially less effort (there is far less implicit
    assumptions of GNU related stuff being used).

    ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Fri Nov 10 01:11:13 2023
    Quadibloc wrote:

    Some progress has been made in advancing a small step towards sanity
    in the description of the Concertina II architecture described at

    http://www.quadibloc.com/arch/ct17int.htm

    As Mitch Alsup has rightly noted, I want to have my cake and eat it
    too. I want an instruction format that is quick to fetch and decode,
    like a RISC format. I want RISC-like banks of 32 registers, and I
    want the CISC-like addressing modes of the IBM System/360, but with
    16-bit displacements, not 12-bit displacements.
    <
    My 66000 has all of this.
    <
    I want memory-reference instructions to still fit in 32 bits, despite
    asking for so much more capacity.
    <
    The simple/easy ones definitely, the ones with longer displacements no.
    <
    So what I had done was, after squeezing as much as I could into a basic instruction format, I provided for switching into alternate instruction formats which made different compromises by using the block headers.
    <
    Block headers are simply consuming entropy.
    <
    This has now been dropped. Since I managed to get the normal (unaligned) memory-reference instruction squeezed into so much less opcode space that
    I also had room for the aligned memory-reference format without compromises in the basic instruction set, it wasn't needed to have multiple instruction formats.
    <
    I never had any aligned memory references. The HW overhead to "fix" the
    problem is so small as to be compelling.
    <
    I had to change the instructions longer than 32 bits to get them in the
    basic instruction format, so now they're less dense.

    Block structure is still used, but now for only the two things it's
    actually needed for: reserving part of a block as unused for the pseudo-immediates, and for VLIW features (explicitly indicating
    parallelism, and instruction predication).

    The ISA is still tremendously complicated, since I've put room in it for
    a large assortment of instructions of all kinds, but I think it's
    definitely made a significant stride towards sanity.
    <
    Yet, mine remains simple and compact.
    <
    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Quadibloc on Fri Nov 10 00:29:00 2023
    In article <uijjoj$2dc2i$1@dont-email.me>, quadibloc@servername.invalid (Quadibloc) wrote:

    Actually, it's worse than that, since I also want the virtues of
    processors like the TMS320C2000 or the Itanium.

    What do you consider the virtues of Itanium to be?

    No company ever seems to have taken it up on technical grounds, only as a result of Intel and HP persuading commercial managers that it would
    become widely used owing to their market power.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to John Dallman on Fri Nov 10 04:31:45 2023
    On Fri, 10 Nov 2023 00:29:00 +0000, John Dallman wrote:

    In article <uijjoj$2dc2i$1@dont-email.me>, quadibloc@servername.invalid (Quadibloc) wrote:

    Actually, it's worse than that, since I also want the virtues of
    processors like the TMS320C2000 or the Itanium.

    What do you consider the virtues of Itanium to be?

    Well, I think that superscalar operation of microprocessors is a good
    thing. Explicitly indicating which instructions may execute in parallel
    is one way to facilitate that. Even if the Itanium was an unsuccessful implementation of that principle.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to BGB-Alt on Fri Nov 10 04:37:16 2023
    On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
    On 11/9/2023 3:51 PM, Quadibloc wrote:

    The 32 general registers aren't _quite_ general. They're divided into
    four groups of eight.

    Errm, splitting up registers like this is likely to hurt far more than anything that 16-bit displacements are likely to gain.

    For 32-bit instructions, the only implication is that the first few
    integer registers would be used as index registers, and the last few
    would be used as base registers, which is likely to be true in any
    case.

    It's only in the 16-bit operate instructions that this splitting of
    registers is actively present as a constraint. It is needed to make
    16-bit operate instructions possible.

    So the cure is that if a compiler finds this too much trouble, it
    doesn't have to use the 16-bit instructions.

    Of course, if compilers can't use them, that raises the question of
    whether 16-bit instructions are worth having. Without them, the
    complications that I needed to be happy about my memory-reference
    instructions could have been entirely avoided.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to MitchAlsup on Thu Nov 9 22:19:48 2023
    On 11/9/2023 7:11 PM, MitchAlsup wrote:
    Quadibloc wrote:


    Good to see you are back on here...


    Some progress has been made in advancing a small step towards sanity
    in the description of the Concertina II architecture described at

    http://www.quadibloc.com/arch/ct17int.htm

    As Mitch Alsup has rightly noted, I want to have my cake and eat it
    too. I want an instruction format that is quick to fetch and decode,
    like a RISC format. I want RISC-like banks of 32 registers, and I
    want the CISC-like addressing modes of the IBM System/360, but with
    16-bit displacements, not 12-bit displacements.
    <
    My 66000 has all of this.
    <
    I want memory-reference instructions to still fit in 32 bits, despite
    asking for so much more capacity.
    <
    The simple/easy ones definitely, the ones with longer displacements no.
    <

    Yes.

    As noted a few times, as I see it, 9 .. 12 is sufficient.
    Much less than 9 is "not enough", much more than 12 is wasting entropy,
    at least for 32-bit encodings.


    12u-scaled would be "pretty good", say, being able to handle 32K for
    QWORD ops.


    So what I had done was, after squeezing as much as I could into a basic
    instruction format, I provided for switching into alternate instruction
    formats which made different compromises by using the block headers.
    <
    Block headers are simply consuming entropy.
    <

    Also yes.


    This has now been dropped. Since I managed to get the normal (unaligned)
    memory-reference instruction squeezed into so much less opcode space that
    I also had room for the aligned memory-reference format without
    compromises
    in the basic instruction set, it wasn't needed to have multiple
    instruction
    formats.
    <
    I never had any aligned memory references. The HW overhead to "fix" the problem is so small as to be compelling.
    <

    In my case, it is only for 128-bit load/store operations, which require
    64-bit alignment.

    Well, and an esoteric edge case:
    if((PC&0xE)==0xE)
    You can't use a 96-bit encoding, and will need to insert a NOP if one
    needs to do so.


    One can argue that aligned-only allows for a cheaper L1 D$, but also
    "sucks pretty bad" for some tasks:
    Fast memcpy;
    LZ decompression;
    Huffman;
    ...


    I had to change the instructions longer than 32 bits to get them in
    the basic instruction format, so now they're less dense.

    Block structure is still used, but now for only the two things it's
    actually needed for: reserving part of a block as unused for the
    pseudo-immediates, and for VLIW features (explicitly indicating
    parallelism, and instruction predication).

    The ISA is still tremendously complicated, since I've put room in it for
    a large assortment of instructions of all kinds, but I think it's
    definitely made a significant stride towards sanity.
    <
    Yet, mine remains simple and compact.
    <

    Mostly similar.
    Though, I guess some people could debate this in my case.


    Granted, I specify the entire ISA in a single location, rather than
    spreading it across a bunch of different documents (as was the case with RISC-V).

    Well, and where there is a lot that is left up to the specific hardware implementations in terms of stuff that one would need to "actually have
    an OS run on it", ...


    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Fri Nov 10 04:43:14 2023
    On Fri, 10 Nov 2023 01:11:13 +0000, MitchAlsup wrote:

    I never had any aligned memory references. The HW overhead to "fix" the problem is so small as to be compelling.

    Since I have a complete set of memory-reference instructions for which unaligned memory-reference instructions are supported, the problem isn't
    that I think unaligned fetches and stores take too many gates.

    Rather, being able to only specify aligned accesses saves *opcode space*,
    which lets me fit in one complete set of memory-reference instructions that
    can use all the base registers, all the index registers, and always use all
    the registers as destination registers.

    While the unaligned-capable instructions, that offer also important
    additional addressing modes, had to have certain restrictions to fit in.

    So they use six out of the seven index registers, they can use only half
    the registers as destination registers on indexed accesses, and they use
    four out of the seven base registers.

    Having 16-bit instructions for the possibility of more compact code meant
    that I had to have at least one of the two restrictions noted above -
    having both restrictions meant that I could offer the alternative of aligned-only instructions with neither restriction, which may be far less painful for some.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Quadibloc on Fri Nov 10 00:46:43 2023
    On 11/9/2023 10:37 PM, Quadibloc wrote:
    On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
    On 11/9/2023 3:51 PM, Quadibloc wrote:

    The 32 general registers aren't _quite_ general. They're divided into
    four groups of eight.

    Errm, splitting up registers like this is likely to hurt far more than
    anything that 16-bit displacements are likely to gain.

    For 32-bit instructions, the only implication is that the first few
    integer registers would be used as index registers, and the last few
    would be used as base registers, which is likely to be true in any
    case.

    It's only in the 16-bit operate instructions that this splitting of
    registers is actively present as a constraint. It is needed to make
    16-bit operate instructions possible.


    FWIW: I went with 16-bit ops with 4-bit register fields (with a small
    subset with 5-bit register fields).

    Granted, layout was different than SH:
    zzzz-nnnn-mmmm-zzzz //typical SH layout
    zzzz-zzzz-nnnn-mmmm //typical BJX2 layout

    Where, as noted, typical 32-bit layout in my case is:
    111p-ZwZZ-nnnn-mmmm ZZZZ-qnmo-oooo-ZZZZ
    And, in XG2:
    NMOP-ZwZZ-nnnn-mmmm ZZZZ-qnmo-oooo-ZZZZ




    I guess, a "minor" reorganization might yield, say:
    PwZZ-ZZZZ-ZZnn-nnnn-mmmm-mmoo-oooo-ZZZZ (3R)
    PwZZ-ZZZZ-ZZnn-nnnn-mmmm-mmZZ-ZZZZ-ZZZZ (2R)
    PwZZ-ZZZZ-ZZnn-nnnn-mmmm-mmii-iiii-iiii (3RI, Imm10)
    PwZZ-ZZZZ-ZZnn-nnnn-ZZZZ-ZZii-iiii-iiii (2RI, Imm10)
    PwZZ-ZZZZ-ZZnn-nnnn-iiii-iiii-iiii-iiii (2RI, Imm16)
    PwZZ-ZZZZ-iiii-iiii-iiii-iiii-iiii-iiii (Imm24)

    Which seems like actually a relatively nice layout thus far...


    Possibly, going further:
    Pw00-ZZZZ-ZZnn-nnnn-mmmm-mmoo-oooo-ZZZZ (3R Space)
    Pw00-1111-ZZnn-nnnn-mmmm-mmZZ-ZZZZ-ZZZZ (2R Space)

    Pw01-ZZZZ-ZZnn-nnnn-mmmm-mmii-iiii-iiii (Ld/St Disp10)

    Pw10-0ZZZ-ZZnn-nnnn-mmmm-mmii-iiii-iiii (3RI Imm10, ALU Block)
    Pw10-1ZZZ-ZZnn-nnnn-ZZZZ-ZZii-iiii-iiii (2RI Imm10)

    Pw11-0ZZZ-ZZnn-nnnn-iiii-iiii-iiii-iiii (2RI, Imm16)

    Pw11-1110-iiii-iiii-iiii-iiii-iiii-iiii BRA Disp24s (+/- 32MB)
    Pw11-1111-iiii-iiii-iiii-iiii-iiii-iiii BSR Disp24s (+/- 32MB)

    1111-111Z-iiii-iiii-iiii-iiii-iiii-iiii Jumbo


    Though, might almost make sense for PrWEX to be N/E, as the PrWEX blocks
    seem to be infrequently used in BJX2 (basically, for predicated
    instructions that exist as part of an instruction bundle).

    Say:
    Scalar: 77.3%
    WEX : 8.9%
    Pred : 13.5%
    PrWEX : 0.3%


    So the cure is that if a compiler finds this too much trouble, it
    doesn't have to use the 16-bit instructions.

    Of course, if compilers can't use them, that raises the question of
    whether 16-bit instructions are worth having. Without them, the
    complications that I needed to be happy about my memory-reference instructions could have been entirely avoided.


    For performance optimized cases, I am starting to suspect 16-bit ops are
    not worth it.

    For size optimization, they make sense; but size optimization also means
    mostly confining register allocation to R0..R15 in my case, with
    heuristics for when to enable additional registers, where enabling the
    higher registers effectively hinders the use of 16-bit instructions.


    The other option I have found is that, rather than optimizing for
    smaller instructions (as in an ISA with 16 bit instructions), one can
    instead optimize for doing stuff in as few instructions as it is
    reasonable to do so, which in turn further goes against the use of
    16-bit instructions.


    And, thus far, I am ending up building a lot of my programs in XG2 mode
    despite the slightly worse code density (leaving the main "hold outs"
    for the Baseline encoding mostly being the kernel and Boot ROM).

    The kernel could go over to XG2 without too much issue, mostly leaving
    the Boot ROM. Switching over the ROM would require some functional
    tweaks (coming out of reset in a different mode), as well as probably
    either increasing the size of the ROM or removing some stuff (building
    the Boot ROM as-is in XG2 mode would exceed the current 32K limit).


    Granted, the main things the ROM contains is a bunch of boot-time sanity
    check stuff, a RAM counter, FAT32 driver, and stuff to init the graphics
    module (such as a Boot-time ASCII font, *).

    *: Though, this font saves some space by only encoding the ASCII-range characters, and packing the character glyphs into 5*6 pixels (allowing
    32-bits, rather than the 64-bits needed for an 8x8 glyph). This won out aesthetically over using a 7-segment or 14-segment font (as well as it
    taking more complex logic to unpack 7 or 14 segment into an 8x8
    character cell).

    Where, say, unlike a CGA or VGA, the initial font is not held in a
    hardware ROM. There was originally, but it was cheaper to manage the
    font in software, effectively using the VRAM as a plain color-cell
    display in text mode.

    ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Quadibloc on Fri Nov 10 14:51:44 2023
    Quadibloc <quadibloc@servername.invalid> writes:
    On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
    On 11/9/2023 3:51 PM, Quadibloc wrote:

    The 32 general registers aren't _quite_ general. They're divided into
    four groups of eight.

    Errm, splitting up registers like this is likely to hurt far more than
    anything that 16-bit displacements are likely to gain.

    For 32-bit instructions, the only implication is that the first few
    integer registers would be used as index registers, and the last few
    would be used as base registers, which is likely to be true in any
    case.

    As soon as you make 'general purpose registers' not 'general'
    you've significantly complicated register allocation in compilers
    and likely caused additional memory accesses due to the need to
    spill registers unnecessarily.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Fri Nov 10 18:26:20 2023
    Quadibloc wrote:

    On Fri, 10 Nov 2023 00:29:00 +0000, John Dallman wrote:

    In article <uijjoj$2dc2i$1@dont-email.me>, quadibloc@servername.invalid
    (Quadibloc) wrote:

    Actually, it's worse than that, since I also want the virtues of
    processors like the TMS320C2000 or the Itanium.

    What do you consider the virtues of Itanium to be?

    Itanic's main virtue was to consume several Intel design teams, over 20
    years, preventing Intel from taking over the entire µprocessor market.

    I, personally, don't believe in exposing the scalarity to the compiler,
    nor the rotating register file to do what renaming does naturally,
    nor the lack of proper FP instructions (FDIV, SQRT), ...

    Academic quality at industrial prices.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Fri Nov 10 18:29:56 2023
    Quadibloc wrote:

    On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
    On 11/9/2023 3:51 PM, Quadibloc wrote:

    The 32 general registers aren't _quite_ general. They're divided into
    four groups of eight.

    Errm, splitting up registers like this is likely to hurt far more than
    anything that 16-bit displacements are likely to gain.

    For 32-bit instructions, the only implication is that the first few
    integer registers would be used as index registers, and the last few
    would be used as base registers, which is likely to be true in any
    case.

    It's only in the 16-bit operate instructions that this splitting of
    registers is actively present as a constraint. It is needed to make
    16-bit operate instructions possible.

    So the cure is that if a compiler finds this too much trouble, it
    doesn't have to use the 16-bit instructions.
    <
    Then why are they there ??
    <
    I think you will find (like RISC-V is) that having and not mandating use
    means you get a bit under ½ of what you think you are getting.
    <
    Of course, if compilers can't use them, that raises the question of
    whether 16-bit instructions are worth having. Without them, the
    complications that I needed to be happy about my memory-reference instructions could have been entirely avoided.
    <
    There is a subset of RISC-V designers who want to discard the 16-bit
    subset in order to solve the problems of the 32-bit set.
    <
    I might note: given the space of the compressed ISA in RISC-V, I could
    install the entire My 66000 ISA and then not need any of the RISC-V
    ISA.....
    <
    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Scott Lurndal on Fri Nov 10 12:24:08 2023
    On 11/10/2023 8:51 AM, Scott Lurndal wrote:
    Quadibloc <quadibloc@servername.invalid> writes:
    On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
    On 11/9/2023 3:51 PM, Quadibloc wrote:

    The 32 general registers aren't _quite_ general. They're divided into
    four groups of eight.

    Errm, splitting up registers like this is likely to hurt far more than
    anything that 16-bit displacements are likely to gain.

    For 32-bit instructions, the only implication is that the first few
    integer registers would be used as index registers, and the last few
    would be used as base registers, which is likely to be true in any
    case.

    As soon as you make 'general purpose registers' not 'general'
    you've significantly complicated register allocation in compilers
    and likely caused additional memory accesses due to the need to
    spill registers unnecessarily.

    Yeah.

    Either banks of 8, or an 8 data + 8 address, or ... would kinda "rather
    suck".

    Or, even smaller cases, like, "most instructions can use all the
    registers, but these ops only work on a subset" is kind of an annoyance
    (this is a big part of why I bothered with the whole XG2 thing).


    Much better to have a big flat register space.


    Though, within reason.
    Say:
    * 8: Pain, can barely hold anything in registers.
    ** One barely has enough for working values for expressions, etc.
    * 16: Not quite enough, still lots of spill/fill.
    * 32: Can work well, with a good register allocator;
    * 64: Can largely eliminate spill/fill, but a little much.
    * 128: Too many.
    * 256: Absurd.

    So, say, 32 and 64 seem to be the "good" area, where with 32, a majority
    of the functions can sit comfortably with most or all of their variables
    held in registers. But, for functions with a large number of variables
    (say, 100 or more), spill/fill becomes an issue (*).

    Having 64 allows a majority of functions to use a "static assign
    everything" strategy, where spill/fill can be eliminated entirely (apart
    from the prolog/epilog sequences), and otherwise seems to deal better
    with functions with large numbers of variables.


    *: And is more of a pain with a register allocator design which can't
    keep any non-static-assigned values in registers across basic-block
    boundaries. This issue is, ironically, less obvious with 16 registers
    (since spill/fill runs rampant anyways). But having nearly every basic
    block start with a blob of stack loads, and end with a blob of stores,
    only to reload them all again on the other side of a label, is fairly
    obvious.

    Having 64 registers does at least mostly hit this nail...


    Meanwhile, for 128, there aren't really enough variables and temporaries
    in most functions to make effective use of them. Also, 7 bit register
    fields wont fit easily into a 32-bit instruction word.


    As for register arguments:
    * Probably 8 or 16.
    ** 8 makes the most sense with 32 GPRs.
    *** 16 is asking too much.
    *** 8 deals with around 98% of functions.
    ** 16 makes sense with 64 GPRs.
    *** Nearly all functions can use exclusively register arguments.
    *** Gain is small though, if it only benefits 2% of functions.
    *** It is almost a "shoe in", except for cost of fixed spill space
    *** 128 bytes at the bottom of every non-leaf stack-frame is noticeable.
    *** Though, an ABI could decide to not have a spill space in this way.

    Though, admittedly, for a lot of my programs I had still ended up going
    with 8 register arguments with 64 GPRs, mostly as the gains of 16
    arguments is small, relative of the cost of spending an additional 64
    bytes in nearly every stack frame (and also there are still some
    unresolved bugs when using 16 argument mode).

    ...



    Current leaning is also that:
    32-bit primary instruction size;
    32/64/96 bit for variable-length instructions;
    Is "pretty good".

    In performance-oriented use cases, 16-bit encodings "aren't really worth
    it".
    In cases where you need a 32 or 64 bit value, being able to encode them
    or load them quickly into a register is ideal. Spending multiple
    instructions to glue a value together isn't ideal, nor is needing to
    load it from memory (this particularly sucks from the compiler POV).


    As for addressing modes:
    (Rb, Disp) : ~ 66-75%
    (Rb, Ri) : ~ 25-33%
    Can address the vast majority of cases.

    Displacements are most effective when scaled by the size of the element
    type, as unaligned displacements are exceedingly rare. The vast majority
    of displacements are also positive.

    Not having a register-indexed mode is shooting oneself in the foot, as
    these are "not exactly rare".

    Most other possible addressing modes can be mostly ignored.
    Auto-increment becomes moot if one has superscalar or VLIW;
    (Rb, Ri, Disp) is only really applicable in niche cases
    Eg, array inside struct, etc.
    ...



    RISC-V did sort of shoot itself in the foot in several of these areas,
    albeit with some workarounds in "Bitmanip":
    SHnADD, can mimic a LEA, allowing array access in fewer ops.
    PACK, allows an inline 64-bit constant load in 5 instructions...
    LUI+ADD+LUI+ADD+PACK
    ...

    Still not ideal...

    An extra cycle for memory access is not ideal for a close second place addressing mode; nor are 64-bit constants rare enough that one
    necessarily wants to spend 5 or so clock cycles on them.

    But, still better than the situation where one does not have these instructions.

    ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Fri Nov 10 18:22:43 2023
    BGB wrote:

    On 11/9/2023 7:11 PM, MitchAlsup wrote:
    Quadibloc wrote:


    Good to see you are back on here...


    Some progress has been made in advancing a small step towards sanity
    in the description of the Concertina II architecture described at

    http://www.quadibloc.com/arch/ct17int.htm

    As Mitch Alsup has rightly noted, I want to have my cake and eat it
    too. I want an instruction format that is quick to fetch and decode,
    like a RISC format. I want RISC-like banks of 32 registers, and I
    want the CISC-like addressing modes of the IBM System/360, but with
    16-bit displacements, not 12-bit displacements.
    <
    My 66000 has all of this.
    <
    I want memory-reference instructions to still fit in 32 bits, despite
    asking for so much more capacity.
    <
    The simple/easy ones definitely, the ones with longer displacements no.
    <

    Yes.

    As noted a few times, as I see it, 9 .. 12 is sufficient.
    Much less than 9 is "not enough", much more than 12 is wasting entropy,
    at least for 32-bit encodings.
    <
    Can you suggest something I could have done by sacrificing 16-bits
    down to 12-bits that would have improved "something" in my ISA ??
    {{You see I did not have any trouble in having all 16-bits for MEM references--just like having 16-bits for integer, logical, and branch offsets.}}
    <
    12u-scaled would be "pretty good", say, being able to handle 32K for
    QWORD ops.
    <
    IBM 360 found so, EMBench is replete with stack sizes and struct sizes
    where My 66000 uses 1×32-bit instruction where RISC-V needs 2×32-bit... Exactly the difference between 12-bits and 14-bits....

    So what I had done was, after squeezing as much as I could into a basic
    instruction format, I provided for switching into alternate instruction
    formats which made different compromises by using the block headers.
    <
    Block headers are simply consuming entropy.
    <

    Also yes.


    This has now been dropped. Since I managed to get the normal (unaligned) >>> memory-reference instruction squeezed into so much less opcode space that >>> I also had room for the aligned memory-reference format without
    compromises
    in the basic instruction set, it wasn't needed to have multiple
    instruction
    formats.
    <
    I never had any aligned memory references. The HW overhead to "fix" the
    problem is so small as to be compelling.
    <

    In my case, it is only for 128-bit load/store operations, which require 64-bit alignment.
    <
    VVM does all the wide stuff without necessitating the wide stuff in
    registers or instructions.
    <
    Well, and an esoteric edge case:
    if((PC&0xE)==0xE)
    You can't use a 96-bit encoding, and will need to insert a NOP if one
    needs to do so.
    <
    Ehhhhh...
    <
    One can argue that aligned-only allows for a cheaper L1 D$, but also
    "sucks pretty bad" for some tasks:
    Fast memcpy;
    LZ decompression;
    Huffman;
    ...
    <
    Time found that HW can solve the problem way more than adequately--
    obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
    <

    I had to change the instructions longer than 32 bits to get them in
    the basic instruction format, so now they're less dense.

    Block structure is still used, but now for only the two things it's
    actually needed for: reserving part of a block as unused for the
    pseudo-immediates, and for VLIW features (explicitly indicating
    parallelism, and instruction predication).

    The ISA is still tremendously complicated, since I've put room in it for >>> a large assortment of instructions of all kinds, but I think it's
    definitely made a significant stride towards sanity.
    <
    Yet, mine remains simple and compact.
    <

    Mostly similar.
    Though, I guess some people could debate this in my case.


    Granted, I specify the entire ISA in a single location, rather than
    spreading it across a bunch of different documents (as was the case with RISC-V).

    Well, and where there is a lot that is left up to the specific hardware implementations in terms of stuff that one would need to "actually have
    an OS run on it", ...


    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to MitchAlsup on Fri Nov 10 12:48:10 2023
    On 11/10/2023 12:22 PM, MitchAlsup wrote:
    BGB wrote:

    On 11/9/2023 7:11 PM, MitchAlsup wrote:
    Quadibloc wrote:


    Good to see you are back on here...


    Some progress has been made in advancing a small step towards sanity
    in the description of the Concertina II architecture described at

    http://www.quadibloc.com/arch/ct17int.htm

    As Mitch Alsup has rightly noted, I want to have my cake and eat it
    too. I want an instruction format that is quick to fetch and decode,
    like a RISC format. I want RISC-like banks of 32 registers, and I
    want the CISC-like addressing modes of the IBM System/360, but with
    16-bit displacements, not 12-bit displacements.
    <
    My 66000 has all of this.
    <
    I want memory-reference instructions to still fit in 32 bits, despite
    asking for so much more capacity.
    <
    The simple/easy ones definitely, the ones with longer displacements no.
    <

    Yes.

    As noted a few times, as I see it, 9 .. 12 is sufficient.
    Much less than 9 is "not enough", much more than 12 is wasting
    entropy, at least for 32-bit encodings.
    <
    Can you suggest something I could have done by sacrificing 16-bits
    down to 12-bits that would have improved "something" in my ISA ??
    {{You see I did not have any trouble in having all 16-bits for MEM references--just like having 16-bits for integer, logical, and branch offsets.}}
    <
    12u-scaled would be "pretty good", say, being able to handle 32K for
    QWORD ops.
    <
    IBM 360 found so, EMBench is replete with stack sizes and struct sizes
    where My 66000 uses 1×32-bit instruction where RISC-V needs 2×32-bit... Exactly the difference between 12-bits and 14-bits....


    RISC-V is 12-bit signed unscaled (which can only do +/- 2K).

    On average, 12-bit signed unscaled is actually worse than 9-bit unsigned
    scaled (4K range, for QWORD).

    So, ironically, despite BJX2 having smaller displacements than RISC-V,
    it actually deals better with the larger stack frames.


    But, if one could address 32K, this should cover the vast majority of
    structs and stack-frames.


    A 16-bit unsigned scaled displacement would cover 512K for QWORD ops,
    which could be nice, but likely unnecessary.


    So what I had done was, after squeezing as much as I could into a basic >>>> instruction format, I provided for switching into alternate instruction >>>> formats which made different compromises by using the block headers.
    <
    Block headers are simply consuming entropy.
    <

    Also yes.


    This has now been dropped. Since I managed to get the normal
    (unaligned)
    memory-reference instruction squeezed into so much less opcode space
    that
    I also had room for the aligned memory-reference format without
    compromises
    in the basic instruction set, it wasn't needed to have multiple
    instruction
    formats.
    <
    I never had any aligned memory references. The HW overhead to "fix" the
    problem is so small as to be compelling.
    <

    In my case, it is only for 128-bit load/store operations, which
    require 64-bit alignment.
    <
    VVM does all the wide stuff without necessitating the wide stuff in
    registers or instructions.
    <
    Well, and an esoteric edge case:
       if((PC&0xE)==0xE)
    You can't use a 96-bit encoding, and will need to insert a NOP if one
    needs to do so.
    <
    Ehhhhh...
    <

    This is mostly due to a quirk in the L1 I$ design, where "fixing" it
    costs more than just being like, "yeah, this case isn't allowed" (and
    having the compiler emit a NOP in the rare edge cases it is encountered).


    One can argue that aligned-only allows for a cheaper L1 D$, but also
    "sucks pretty bad" for some tasks:
       Fast memcpy;
       LZ decompression;
       Huffman;
       ...
    <
    Time found that HW can solve the problem way more than adequately--
    obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
    <


    Wait, are you arguing for aligned-only memory ops here?...

    But, yeah, for me, a major selling points for unaligned access is mostly
    that I can copy blocks of memory around like:
    v0=((uint64_t *)cs)[0];
    v1=((uint64_t *)cs)[1];
    v2=((uint64_t *)cs)[2];
    v3=((uint64_t *)cs)[3];
    ((uint64_t *)ct)[0]=v0;
    ((uint64_t *)ct)[1]=v1;
    ((uint64_t *)ct)[2]=v2;
    ((uint64_t *)ct)[3]=v3;
    cs+=32; ct+=32;

    For Huffman, some of the fastest strategies to implement the bitstream reading/writing, tend to be to casually make use of unaligned access
    (shifting in and loading bytes is slower in comparison).

    Though, all this falls on its face, if encountering a CPU that uses
    traps to emulate unaligned access (apparently a lot of the SiFive cores
    and similar).



    I had to change the instructions longer than 32 bits to get them in
    the basic instruction format, so now they're less dense.

    Block structure is still used, but now for only the two things it's
    actually needed for: reserving part of a block as unused for the
    pseudo-immediates, and for VLIW features (explicitly indicating
    parallelism, and instruction predication).

    The ISA is still tremendously complicated, since I've put room in it
    for
    a large assortment of instructions of all kinds, but I think it's
    definitely made a significant stride towards sanity.
    <
    Yet, mine remains simple and compact.
    <

    Mostly similar.
    Though, I guess some people could debate this in my case.


    Granted, I specify the entire ISA in a single location, rather than
    spreading it across a bunch of different documents (as was the case
    with RISC-V).

    Well, and where there is a lot that is left up to the specific
    hardware implementations in terms of stuff that one would need to
    "actually have an OS run on it", ...


    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to BGB on Fri Nov 10 11:17:37 2023
    On 11/10/2023 10:24 AM, BGB wrote:
    On 11/10/2023 8:51 AM, Scott Lurndal wrote:
    Quadibloc <quadibloc@servername.invalid> writes:
    On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
    On 11/9/2023 3:51 PM, Quadibloc wrote:

    The 32 general registers aren't _quite_ general. They're divided into >>>>> four groups of eight.

    Errm, splitting up registers like this is likely to hurt far more than >>>> anything that 16-bit displacements are likely to gain.

    For 32-bit instructions, the only implication is that the first few
    integer registers would be used as index registers, and the last few
    would be used as base registers, which is likely to be true in any
    case.

    As soon as you make 'general purpose registers' not 'general'
    you've significantly complicated register allocation in compilers
    and likely caused additional memory accesses due to the need to
    spill registers unnecessarily.

    Yeah.

    Either banks of 8, or an 8 data + 8 address, or ... would kinda "rather suck".

    Or, even smaller cases, like, "most instructions can use all the
    registers, but these ops only work on a subset" is kind of an annoyance
    (this is a big part of why I bothered with the whole XG2 thing).


    Much better to have a big flat register space.

    Yes, but sometimes you just need "another bit" in the instructions. So
    an alternative is to break the requirement that all register specifier
    fields in the instruction be the same length. So, for example, allow
    access to all registers from one source operand position, but say only
    half from the other source operand position. So, for a system with 32 registers, you would need 5 plus 5 plus 4 bits. Much of the time, such
    as with commutative operations like adds, this doesn't hurt at all.

    Yes, this makes register allocation in the compiler harder. And
    occasionally you might need an extra instruction to copy a value to the
    half size field, but on high end systems, this can be done in the rename
    stage without taking an execution slot.

    A more extreme alternative is to only allow the destination field to
    also be one bit smaller. Of course, this makes things even harder for
    the compiler, and probably requires extra "copy" instructions more
    frequently, but sometimes you just gotta do what you gotta do. :-(

    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Quadibloc on Fri Nov 10 22:03:23 2023
    Quadibloc <quadibloc@servername.invalid> schrieb:
    On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
    On 11/9/2023 3:51 PM, Quadibloc wrote:

    The 32 general registers aren't _quite_ general. They're divided into
    four groups of eight.

    Errm, splitting up registers like this is likely to hurt far more than
    anything that 16-bit displacements are likely to gain.

    For 32-bit instructions, the only implication is that the first few
    integer registers would be used as index registers, and the last few
    would be used as base registers, which is likely to be true in any
    case.

    This breaks with the central tenet of the /360, the PDP-11,
    the VAX, and all RISC architectures: (Almost) all registers are general-purpose registers.

    This would make your ISA very un-S/360-like.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Thomas Koenig on Fri Nov 10 23:25:41 2023
    Thomas Koenig wrote:

    Quadibloc <quadibloc@servername.invalid> schrieb:
    On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
    On 11/9/2023 3:51 PM, Quadibloc wrote:

    The 32 general registers aren't _quite_ general. They're divided into
    four groups of eight.

    Errm, splitting up registers like this is likely to hurt far more than
    anything that 16-bit displacements are likely to gain.

    For 32-bit instructions, the only implication is that the first few
    integer registers would be used as index registers, and the last few
    would be used as base registers, which is likely to be true in any
    case.

    This breaks with the central tenet of the /360, the PDP-11,
    the VAX, and all RISC architectures: (Almost) all registers are general-purpose registers.
    <
    But follows S.E.L 32/{...} series and several other minicomputers with
    isolated base registers. In the 32/{..} series, there was 2 LDs and 2 STs
    1 LD was byte (signed) with 19-bit displacement
    2 LD was size (signed) with the lower bits of displacement specifying size.
    3 ST was byte <ibid>
    3 ST was size <ibid>
    <
    only registers 1-7 could be used as base register.
    <
    I saw several others using similar tricks but can't remember.....
    <
    This would make your ISA very un-S/360-like.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Fri Nov 10 23:21:08 2023
    BGB wrote:

    On 11/10/2023 12:22 PM, MitchAlsup wrote:

    One can argue that aligned-only allows for a cheaper L1 D$, but also
    "sucks pretty bad" for some tasks:
       Fast memcpy;
       LZ decompression;
       Huffman;
       ...
    <
    Time found that HW can solve the problem way more than adequately--
    obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
    <


    Wait, are you arguing for aligned-only memory ops here?...
    <

    No, I am arguing that all memory references are inherently un aligned, but where
    aligned references never suffer a stall penalty; and the the compiler does not need to understand if the reference is aligned or unaligned.
    <
    But, yeah, for me, a major selling points for unaligned access is mostly
    that I can copy blocks of memory around like:
    v0=((uint64_t *)cs)[0];
    v1=((uint64_t *)cs)[1];
    v2=((uint64_t *)cs)[2];
    v3=((uint64_t *)cs)[3];
    ((uint64_t *)ct)[0]=v0;
    ((uint64_t *)ct)[1]=v1;
    ((uint64_t *)ct)[2]=v2;
    ((uint64_t *)ct)[3]=v3;
    cs+=32; ct+=32;
    <
    MM Rcs,Rct,#length // without the for loop
    <
    For Huffman, some of the fastest strategies to implement the bitstream reading/writing, tend to be to casually make use of unaligned access (shifting in and loading bytes is slower in comparison).

    Though, all this falls on its face, if encountering a CPU that uses
    traps to emulate unaligned access (apparently a lot of the SiFive cores
    and similar).
    <
    Traps to perform unaligned are so 1985......either don't allow them at all (SIGSEGV) or treat them as first class citizens. The former fails in the market.
    <


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to MitchAlsup on Fri Nov 10 20:37:38 2023
    On 11/10/2023 5:21 PM, MitchAlsup wrote:
    BGB wrote:

    On 11/10/2023 12:22 PM, MitchAlsup wrote:

    One can argue that aligned-only allows for a cheaper L1 D$, but also
    "sucks pretty bad" for some tasks:
       Fast memcpy;
       LZ decompression;
       Huffman;
       ...
    <
    Time found that HW can solve the problem way more than adequately--
    obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
    <


    Wait, are you arguing for aligned-only memory ops here?...
    <

    No, I am arguing that all memory references are inherently un aligned,
    but where
    aligned references never suffer a stall penalty; and the the compiler
    does not
    need to understand if the reference is aligned or unaligned.
    <

    OK, fair enough.

    I don't have separate aligned/unaligned ops for anything QWORD or
    smaller, as all these cases are implicitly unaligned.

    Though, aligned is sometimes a little faster, due to playing better with
    the L1 cache; but, using misaligned memory access is generally faster
    than any of the traditional workarounds (the difference being mostly a
    slight increase in the probability of triggering an L1 cache miss).


    The main exception is MOV.X requiring 64-bit alignment (for a 128-bit
    memory access), but the unaligned fallback here is to use a pair of
    MOV.Q instructions instead.

    But, this was in part because of how the L1 caches were implemented, and supporting fully unaligned 128-bit access would have been more expensive
    (and the relative gain is smaller).

    This does mean alternate logic for aligned vs unaligned "memcpy()", with
    the unaligned case being a little slower as a result of needing to use
    MOV.Q ops.


    It is possible a case could be made for allowing fully unaligned MOV.X
    as well.

    Would mostly involve reworking how MOV.X is implemented relative to the extract/insert logic (likely internally working with 192 bits rather
    than 128; with as-is, MOV.X implemented by bypassing the main
    extract/insert logic).


    But, yeah, for me, a major selling points for unaligned access is
    mostly that I can copy blocks of memory around like:
       v0=((uint64_t *)cs)[0];
       v1=((uint64_t *)cs)[1];
       v2=((uint64_t *)cs)[2];
       v3=((uint64_t *)cs)[3];
       ((uint64_t *)ct)[0]=v0;
       ((uint64_t *)ct)[1]=v1;
       ((uint64_t *)ct)[2]=v2;
       ((uint64_t *)ct)[3]=v3;
       cs+=32; ct+=32;
    <
        MM   Rcs,Rct,#length            // without the for loop <

    I typically use a "while()" loop or similar, but yeah...

    At present, the fastest loop strategy is generally:
    while(n--)
    {
    ...
    }




    For Huffman, some of the fastest strategies to implement the bitstream
    reading/writing, tend to be to casually make use of unaligned access
    (shifting in and loading bytes is slower in comparison).

    Though, all this falls on its face, if encountering a CPU that uses
    traps to emulate unaligned access (apparently a lot of the SiFive
    cores and similar).
    <
    Traps to perform unaligned are so 1985......either don't allow them at all (SIGSEGV) or treat them as first class citizens. The former fails in the market.
    <


    Apparently SiFive went this way, for some reason...

    Like, RISC-V requires unaligned access to work, but doesn't specify how,
    and apparently they considered trapping to be an acceptable option, but trapping sucks for performance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to BGB-Alt on Sat Nov 11 05:39:59 2023
    On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:

    Errm, splitting up registers like this is likely to hurt far more than anything that 16-bit displacements are likely to gain.

    No doubt you're right.

    As that means my 16-bit instructions, with the registers split into four
    parts, are useless to compilers, now I have to go around in circles again.
    I thought I had finally achieved a single instruction format that satisfied
    my ambitions - and now I find it is fatally flawed.

    One possibility is to go back to the full format for 32-bit memory
    reference instructions. That will still leave me enough opcode space that a four-bit prefix could precede three 20-bit short instructions. To avoid creating a variable-length instruction set, which complicates decoding,
    I would require such blocks to be aligned on 64-bit boundaries.

    So now there's a nested block structure, of 64-bit blocks inside 256-bit blocks!

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Quadibloc on Sat Nov 11 07:07:00 2023
    In article <uikbng$2lh5f$1@dont-email.me>, quadibloc@servername.invalid (Quadibloc) wrote:

    Well, I think that superscalar operation of microprocessors is a
    good thing.

    Indeed.

    Explicitly indicating which instructions may execute in parallel
    is one way to facilitate that. Even if the Itanium was an
    unsuccessful implementation of that principle.

    Intel tried that with the Pentium, with its two pipelines and run-time automatic instruction scheduling, to moderate success. They tried it with
    the i860, with compiler scheduling and a comprehensive lack of success.
    The Itanium tried the i860 method, much harder and was still unsuccessful.


    In engineering, the gap between "Doing this would be good" and "Here it
    is working" generally involves having a good idea about /how/ to do it.

    Finding an example where explicit but non-automatic parallelism worked
    for general-purpose code and figuring out how that was done should be
    easier than inventing a method. In the absence of that, we have some
    evidence that just hoping the software people will solve this problem for
    you doesn't work.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Quadibloc on Sat Nov 11 06:50:00 2023
    In article <uijk93$2dc2i$2@dont-email.me>, quadibloc@servername.invalid (Quadibloc) wrote:

    This lends itself to writing code where four distinct threads are interleaved, helping pipelining in implementations too cheap to have out-of-order executiion.

    This is not the conventional way of implementing threads, and seems to
    have some drawbacks:

    One of the uses of threads is to scale to the hardware resources
    available. With this approach, the number of threads is baked in at
    compile time.

    Debugging such interleaved threads is likely to be even more confusing
    than debugging multiple threads usually is.

    Pipeline stalls affect every thread, rather than just the thread that
    triggers them.

    The common threading APIs also lack a way to set such threads to work,
    but that's a far more soluble problem.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Sat Nov 11 07:22:21 2023
    BGB <cr88192@gmail.com> writes:
    On 11/10/2023 12:22 PM, MitchAlsup wrote:
    BGB wrote:
    One can argue that aligned-only allows for a cheaper L1 D$, but also
    "sucks pretty bad" for some tasks:
       Fast memcpy;
       LZ decompression;
       Huffman;
       ...

    Hashing

    Though, all this falls on its face, if encountering a CPU that uses
    traps to emulate unaligned access (apparently a lot of the SiFive cores
    and similar).

    Let's see what this SiFive U74 does:

    [fedora-starfive:~/nfstmp/gforth-riscv:98397] perf stat -e instructions:u -e cycles:u gforth-fast -e "create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x x ! x foo bye "

    Performance counter stats for 'gforth-fast -e create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x x ! x foo bye ':

    469832112 instructions:u # 0.79 insn per cycle
    591015904 cycles:u

    0.609751748 seconds time elapsed

    0.533195000 seconds user
    0.061522000 seconds sys


    [fedora-starfive:~/nfstmp/gforth-riscv:98398] perf stat -e instructions:u -e cycles:u gforth-fast -e "create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x 1+ x 1+ ! x 1+ foo bye "

    Performance counter stats for 'gforth-fast -e create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x 1+ x 1+ ! x 1+ foo bye ':

    53533370273 instructions:u # 0.77 insn per cycle
    69304924487 cycles:u

    69.368484169 seconds time elapsed

    69.256290000 seconds user
    0.049997000 seconds sys

    So when we do aligned accesses (first command), the code performs 4.7 instructions and 5.9 cycles per load, while for unaligned accesses
    (second command) the same code performs 535.3 instructions and 693.0
    cycles per load. So apparently an unaligned load triggers >500
    additional instructions, confirming your claim. Interestingly, all
    that is attributed to user time; maybe the fixup is performed by a
    user-level trap or microcode.

    Still, the approach of having separate instructions for aligned and
    unaligned accesses (typically with several instructionf for the
    unaligned case) has been tried and discarded. Software just does not
    declare that some access will be unaligned.

    A particularly strong evidence for this is that gas generated
    non-working code for ustq (unaligned store quadword) on Alpha for
    several years, and apparently nobody noticed until I gave an exercise
    to my students where they should use ustq (so no production use,
    either).

    So, every general-purpose architecture, including RISC-V, the
    spiritual descendent of MIPS and Alpha (which had the division),
    settled on having memory access instructions that perform both aligned
    and unaligned accesses (with performance advantages for aligned
    accesses).

    If RISC-V implementations want to perform well for code that uses
    unaligned accesses for memory copying, compression/decompression, or
    hashing, they will eventually have to implement unaligned accesses
    more efficiently, but at least the code works, and aligned accesses
    are fast.

    Why would you not go the same way? It would also save on instruction
    encoding space.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Anton Ertl on Sat Nov 11 03:03:18 2023
    On 11/11/2023 1:22 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/10/2023 12:22 PM, MitchAlsup wrote:
    BGB wrote:
    One can argue that aligned-only allows for a cheaper L1 D$, but also
    "sucks pretty bad" for some tasks:
       Fast memcpy;
       LZ decompression;
       Huffman;
       ...

    Hashing


    Possibly true.


    Some of my data hash/checksum functions were along the lines of:
    uint32_t *cs, *cse;
    uint64_t v0, v1, v;

    cs=buf; cse=buf+((sz+3)>>2);
    v0=1; v1=1;
    while(cs<cse)
    {
    v=*cs++;
    v0+=v;
    v1+=v0;
    }
    v0=((uint32_t)v0)+(v0>>32); //*
    v1=((uint32_t)v1)+(v1>>32);
    v0=((uint32_t)v0)+(v0>>32);
    v1=((uint32_t)v1)+(v1>>32);
    v=(uint32_t)(v0^v1);

    *: This step may seem frivolous, but seems to increase the strength of
    the checksum.

    There are faster variants, but this one gives the general idea.
    Not aware of anyone else doing it this way, but it is faster than either Adler32 or CRC32, while giving some similar properties (the second sum detecting various issues which would be missed with a single sum).

    A faster variant of this being to run multiple sets of sums in parallel
    and then combine the values at the end.


    Though, all this falls on its face, if encountering a CPU that uses
    traps to emulate unaligned access (apparently a lot of the SiFive cores
    and similar).

    Let's see what this SiFive U74 does:

    [fedora-starfive:~/nfstmp/gforth-riscv:98397] perf stat -e instructions:u -e cycles:u gforth-fast -e "create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x x ! x foo bye "

    Performance counter stats for 'gforth-fast -e create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x x ! x foo bye ':

    469832112 instructions:u # 0.79 insn per cycle
    591015904 cycles:u

    0.609751748 seconds time elapsed

    0.533195000 seconds user
    0.061522000 seconds sys


    [fedora-starfive:~/nfstmp/gforth-riscv:98398] perf stat -e instructions:u -e cycles:u gforth-fast -e "create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x 1+ x 1+ ! x 1+ foo bye "

    Performance counter stats for 'gforth-fast -e create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x 1+ x 1+ ! x 1+ foo bye ':

    53533370273 instructions:u # 0.77 insn per cycle
    69304924487 cycles:u

    69.368484169 seconds time elapsed

    69.256290000 seconds user
    0.049997000 seconds sys

    So when we do aligned accesses (first command), the code performs 4.7 instructions and 5.9 cycles per load, while for unaligned accesses
    (second command) the same code performs 535.3 instructions and 693.0
    cycles per load. So apparently an unaligned load triggers >500
    additional instructions, confirming your claim. Interestingly, all
    that is attributed to user time; maybe the fixup is performed by a
    user-level trap or microcode.


    I wasn't that sure how it was implemented, but it is "kinda weak" in any
    case.

    On the BJX2 core, the performance impact of using misaligned load and
    store is approximately 3% in my tests, I suspect mostly due to a
    slightly higher incidence of L1 cache misses.


    Still, the approach of having separate instructions for aligned and
    unaligned accesses (typically with several instructionf for the
    unaligned case) has been tried and discarded. Software just does not
    declare that some access will be unaligned.

    A particularly strong evidence for this is that gas generated
    non-working code for ustq (unaligned store quadword) on Alpha for
    several years, and apparently nobody noticed until I gave an exercise
    to my students where they should use ustq (so no production use,
    either).

    So, every general-purpose architecture, including RISC-V, the
    spiritual descendent of MIPS and Alpha (which had the division),
    settled on having memory access instructions that perform both aligned
    and unaligned accesses (with performance advantages for aligned
    accesses).

    If RISC-V implementations want to perform well for code that uses
    unaligned accesses for memory copying, compression/decompression, or
    hashing, they will eventually have to implement unaligned accesses
    more efficiently, but at least the code works, and aligned accesses
    are fast.

    Why would you not go the same way? It would also save on instruction encoding space.


    I was never claiming that one should have separate instructions (since,
    if the L1 cache supports unaligned access, what is the point of having
    aligned only variants of the instructions?...).


    Rather, that it might make sense to do an aligned-only core, and then
    trap on misaligned (possibly allowing the access to be emulated, like if
    SiFive cores); mostly in the name of making the L1 cache cheaper.


    A few of my small core experiments had used aligned-only L1 caches, but
    I mostly went with a natively unaligned designs for my bigger ISA
    designs, mostly as I tend to make frequent use of unaligned memory
    access as a "performance trick".



    However, BJX2 has a natively unaligned L1 cache (well, apart from MOV.X).

    Have gone and added the logic to allow MOV.X to be unaligned as well,
    which mostly has the effect of a minor increase in LUT cost and similar
    (mostly as the internal extract/insert logic needed to be widened from
    128 to 192 bits to deal with this; with MOV.X now being handled in a
    similar way to MOV.Q when this feature is enabled).


    Though, one thing is whether to "formally fix" the Op96 at
    ((PC&0xE)==0xE) issue. Ironically, in this case, the "fix" is already
    present in the Verilog code, just the restriction exists more as a
    "break glass to save some LUTs" option.


    Well, along with some other wonk, like leaving it as undefined what
    happens if the instruction stream is allowed to cross a 4GB boundary,
    ... Branching is fine, just the PC increment logic can save some latency
    by not bothering with the high 16 bits.

    I guess, in an ideal world, there wouldn't be a lot of this wonk, but
    needing to often battle with timing constraints and similar does create incentive for corner cutting in various areas.


    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Sat Nov 11 08:37:00 2023
    In article <2023Nov11.082221@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Let's see what this SiFive U74 does:
    ...

    So apparently an unaligned load triggers >500 additional instructions, confirming your claim.

    Wow. I think I'd rather have SIGBUS on unaligned accesses. That is at
    least obvious. Slowdowns like this will be a major drag on performance,
    simply because finding them all is tricky.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Dallman on Sat Nov 11 10:22:54 2023
    jgd@cix.co.uk (John Dallman) writes:
    In article <2023Nov11.082221@mips.complang.tuwien.ac.at>, >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    So apparently an unaligned load triggers >500 additional instructions,
    confirming your claim.

    Wow. I think I'd rather have SIGBUS on unaligned accesses. That is at
    least obvious.

    True, but that has been tried out and, in a world (like Linux) where
    software is developed on a platform that supports unaligned accesses,
    and then compiled by package maintainers (who often are not that
    familiar with the software) on a lot of platforms, the end result was
    that the kernel by default performed a fixup (and put a message in the
    dmesg buffer) instead of delivering a SIGBUS.

    There was a system call for switching to the SIGBUS behaviour. On
    Tru64 OSF/1 (or whatever it is called this week), the default
    behaviour was to SIGBUS, but it had the same system call, and a
    shell-level tool "uac" to change the behaviour to fix it up. I
    implemented a tool "uace" for Linux that can be used for running a
    process with the SIGBUS behaviour that you desire: <https://www.complang.tuwien.ac.at/anton/uace.c>. Maybe something
    similar is possible on the U74.

    Anyway, it seems that the problems was not a big one on Linux-Alpha
    (messages about unaligned accesses were not that frequent).
    Apparently the large majority of code performs aligned accesses. It's
    just that there are a few unaligned ones.

    I would not worry about cores like the U74 (and I have a program that
    uses unaligned accesses for hashing); that's just a stepping stone for
    getting more capable RISC-V cores, and at some point (before RISC-V
    becomes mainstream) the trapping will be replaced with something more efficient.

    We have seen the same development on AMD64. The Penryn
    (second-generation Core 2) takes 159 cycles for an unaligned load that
    crosses a page boundary, the Sandy Bridge takes 28 <http://al.howardknight.net/?ID=143135464800>. The Sandy Bridge and
    Ivy Bridge take 200 cycles for an unaligned page-crossing store,
    Haswell and Skylake take 25 and 24.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Sat Nov 11 11:11:46 2023
    BGB <cr88192@gmail.com> writes:
    On 11/11/2023 1:22 AM, Anton Ertl wrote:
    Hashing


    Possibly true.

    Definitely true: The data you want to hash may be aligned to byte
    boundaries (e.g., strings), but a fast hash function loads it at the
    largest granularity possible and also processes the loaded values at
    the largest granularity possible.

    And in contrast to block copying, where you can do some prelude, then
    perform aligned accesses, and then a postlude (at least on one side of
    the copying), for this kind of hashing you want to have in the first
    step, the first n bytes in a register, because the first byte
    influences the hash function result differently than the second byte.

    What you could do is load aligned into a shift buffer (in a register),
    and then use something like AMD64's shld to get the data in the needed
    form. Same for the second side of block copying. But is this faster
    on modern CPUs?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Sat Nov 11 16:53:00 2023
    In article <2023Nov11.112254@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    True, but that has been tried out and, in a world (like Linux) where
    software is developed on a platform that supports unaligned
    accesses, and then compiled by package maintainers (who often are
    not that familiar with the software) on a lot of platforms, the end
    result was that the kernel by default performed a fixup (and put a
    message in the dmesg buffer) instead of delivering a SIGBUS.

    Yup. The software I work on is meant, in itself, to work on platforms
    that enforce alignment, and it was a useful catcher for some kinds of bug. However, I'm now down to one that actually enforces it, in SPARC Solaris,
    and that isn't long for this world.

    I dug into what it would take to have x86-64 Linux work with alignment enforcement turned on, and it's a huge job.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stephen Fuld on Sat Nov 11 18:11:04 2023
    Stephen Fuld wrote:

    On 11/10/2023 10:24 AM, BGB wrote:



    Much better to have a big flat register space.

    Yes, but sometimes you just need "another bit" in the instructions. So
    an alternative is to break the requirement that all register specifier
    fields in the instruction be the same length. So, for example, allow
    <
    Another way to get a few more bits is to use a prefix-instruction like
    CARRY for those seldom needed bits.
    <
    access to all registers from one source operand position, but say only
    half from the other source operand position. So, for a system with 32 registers, you would need 5 plus 5 plus 4 bits. Much of the time, such
    as with commutative operations like adds, this doesn't hurt at all.

    Yes, this makes register allocation in the compiler harder. And
    occasionally you might need an extra instruction to copy a value to the
    half size field, but on high end systems, this can be done in the rename stage without taking an execution slot.

    A more extreme alternative is to only allow the destination field to
    also be one bit smaller. Of course, this makes things even harder for
    the compiler, and probably requires extra "copy" instructions more frequently, but sometimes you just gotta do what you gotta do. :-(

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Anton Ertl on Sat Nov 11 11:30:19 2023
    On 11/10/2023 11:22 PM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/10/2023 12:22 PM, MitchAlsup wrote:
    BGB wrote:
    One can argue that aligned-only allows for a cheaper L1 D$, but also
    "sucks pretty bad" for some tasks:
       Fast memcpy;
       LZ decompression;
       Huffman;
       ...

    Hashing
    [...]

    Fwiw, proper alignment is very important wrt a programmer to gain some
    of the benefits of, basically, virtually "any" target architecture. For instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
    the programmer can set up an array that is aligned on a cache line
    boundary and pad each element of said array up to the size of a L2 cache
    line.

    Two steps... Align your memory on a proper cache line boundary, and pad
    the size of each element up to the size of a single cache line.

    Think of LL/SC... If one did not honor the reservation granule....
    well... Shit.. False sharing on a reservation granule can cause live
    lock and damage forward progress wrt some LL/SC setups.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB-Alt@21:1/5 to MitchAlsup on Sat Nov 11 14:33:20 2023
    On 11/11/2023 12:11 PM, MitchAlsup wrote:
    Stephen Fuld wrote:

    On 11/10/2023 10:24 AM, BGB wrote:



    Much better to have a big flat register space.

    Yes, but sometimes you just need "another bit" in the instructions.
    So an alternative is to break the requirement that all register
    specifier fields in the instruction be the same length.  So, for
    example, allow
    <
    Another way to get a few more bits is to use a prefix-instruction like
    CARRY for those seldom needed bits.
    <

    Or, a similar role is served by my Jumbo-Op64 prefix.

    So, there are two different Jumbo prefixes:
    Jumbo-Imm, which mostly just makes the immed/disp field bigger;
    Jumbo-Op64, which mostly extends the opcode and other things;
    May extend immediate, but less so, and that is not its main purpose.

    Op64 also does, optionally:
    Being the original mechanism to address R32..R63, before XGPR and XG2
    encodings were added, and needed (in Baseline) for the parts of the ISA
    not covered by the XGPR encodings;
    Adds a potential 4'th register, extra displacement (or smaller Immed extension), or rounding-mode / opcode bits (depends on the base
    instruction).

    As-is, 8 bits in the Op64 prefix are Must Be Zero, as-is, they are
    designated specifically towards expanding the opcode space (with the 00
    case designated as mapping to the same instruction as in the basic
    32-bit encoding).


    access to all registers from one source operand position, but say only
    half from the other source operand position.  So, for a system with 32
    registers, you would need 5 plus 5 plus 4 bits.  Much of the time,
    such as with commutative operations like adds, this doesn't hurt at all.

    Yes, this makes register allocation in the compiler harder.  And
    occasionally you might need an extra instruction to copy a value to
    the half size field, but on high end systems, this can be done in the
    rename stage without taking an execution slot.

    A more extreme alternative is to only allow the destination field to
    also be one bit smaller.  Of course, this makes things even harder for
    the compiler, and probably requires extra "copy" instructions more
    frequently, but sometimes you just gotta do what you gotta do. :-(

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Dallman on Sat Nov 11 21:28:05 2023
    jgd@cix.co.uk (John Dallman) writes:
    In article <2023Nov11.112254@mips.complang.tuwien.ac.at>, >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    True, but that has been tried out and, in a world (like Linux) where
    software is developed on a platform that supports unaligned
    accesses, and then compiled by package maintainers (who often are
    not that familiar with the software) on a lot of platforms, the end
    result was that the kernel by default performed a fixup (and put a
    message in the dmesg buffer) instead of delivering a SIGBUS.

    Yup. The software I work on is meant, in itself, to work on platforms
    that enforce alignment, and it was a useful catcher for some kinds of bug. >However, I'm now down to one that actually enforces it, in SPARC Solaris,
    and that isn't long for this world.

    I dug into what it would take to have x86-64 Linux work with alignment >enforcement turned on, and it's a huge job.

    It might be easier with AArch64. Just set the A bit (bit 1) in SCTLR_EL1;
    it only effects code executing in usermode.

    There may even already be some ELF flag that will set it when the
    file is exec(2)'d.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Chris M. Thomasson on Sat Nov 11 21:22:00 2023
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> schrieb:
    On 11/10/2023 11:22 PM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/10/2023 12:22 PM, MitchAlsup wrote:
    BGB wrote:
    One can argue that aligned-only allows for a cheaper L1 D$, but also >>>>> "sucks pretty bad" for some tasks:
       Fast memcpy;
       LZ decompression;
       Huffman;
       ...

    Hashing
    [...]

    Fwiw, proper alignment is very important wrt a programmer to gain some
    of the benefits of, basically, virtually "any" target architecture. For instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
    the programmer can set up an array that is aligned on a cache line
    boundary and pad each element of said array up to the size of a L2 cache line.

    Two steps... Align your memory on a proper cache line boundary, and pad
    the size of each element up to the size of a single cache line.

    For smaller elements smaller than a cache line, that makes little
    sense. as written. I think there is an unwritten assumption
    "for elements larger than cache line" there, or we would all
    be using 64-byte bools.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Sat Nov 11 22:53:09 2023
    Chris M. Thomasson wrote:


    Think of LL/SC... If one did not honor the reservation granule....
    well... Shit.. False sharing on a reservation granule can cause live
    lock and damage forward progress wrt some LL/SC setups.
    <
    One should NEVER (N. E. V. E. R.) attempt ATOMIC stuff on an unaligned container. Only aligned containers possess ATOMIC-smelling properties.
    <

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Thomas Koenig on Sat Nov 11 14:23:51 2023
    On 11/11/2023 1:22 PM, Thomas Koenig wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> schrieb:
    On 11/10/2023 11:22 PM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/10/2023 12:22 PM, MitchAlsup wrote:
    BGB wrote:
    One can argue that aligned-only allows for a cheaper L1 D$, but also >>>>>> "sucks pretty bad" for some tasks:
       Fast memcpy;
       LZ decompression;
       Huffman;
       ...

    Hashing
    [...]

    Fwiw, proper alignment is very important wrt a programmer to gain some
    of the benefits of, basically, virtually "any" target architecture. For
    instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
    the programmer can set up an array that is aligned on a cache line
    boundary and pad each element of said array up to the size of a L2 cache
    line.

    Two steps... Align your memory on a proper cache line boundary, and pad
    the size of each element up to the size of a single cache line.

    For smaller elements smaller than a cache line, that makes little
    sense. as written. I think there is an unwritten assumption
    "for elements larger than cache line" there, or we would all
    be using 64-byte bools.

    :^). Basically, I am thinking along the lines of cache line allocators
    that return properly aligned and padded l2 lines. Aligning and padding
    on l2 lines helps get rid of any nasty false sharing. Remember those
    damn hyperthreaded intel processors what had 128 byte l2 lines, but
    could falsely share the low 64 bytes with the high 64 bytes? Iirc, Intel
    had a work around that involved offsetting a threads stack using alloca.

    Also, see what happens if you straddle a l2 cache line and use it for a
    LOCK'ed atomic RMW on Intel. It just might assert a bus lock.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Thomas Koenig on Sat Nov 11 14:28:48 2023
    On 11/11/2023 1:22 PM, Thomas Koenig wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> schrieb:
    On 11/10/2023 11:22 PM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/10/2023 12:22 PM, MitchAlsup wrote:
    BGB wrote:
    One can argue that aligned-only allows for a cheaper L1 D$, but also >>>>>> "sucks pretty bad" for some tasks:
       Fast memcpy;
       LZ decompression;
       Huffman;
       ...

    Hashing
    [...]

    Fwiw, proper alignment is very important wrt a programmer to gain some
    of the benefits of, basically, virtually "any" target architecture. For
    instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
    the programmer can set up an array that is aligned on a cache line
    boundary and pad each element of said array up to the size of a L2 cache
    line.

    Two steps... Align your memory on a proper cache line boundary, and pad
    the size of each element up to the size of a single cache line.

    For smaller elements smaller than a cache line, that makes little
    sense. as written. I think there is an unwritten assumption
    "for elements larger than cache line" there, or we would all
    be using 64-byte bools.

    Also, think about the atomic state for a mutex. Say:

    <pseudo-code>

    struct mutex_atomic_state
    {
    std::atomic<word> m_state;
    };

    Well, you want this state to be aligned on a cache line boundary and
    padded up to the size of a cache line. You want to avoid false sharing
    between this state and any user state used in the locked region.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Thomas Koenig on Sat Nov 11 22:58:32 2023
    Thomas Koenig wrote:

    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> schrieb:


    Fwiw, proper alignment is very important wrt a programmer to gain some
    of the benefits of, basically, virtually "any" target architecture. For
    instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
    the programmer can set up an array that is aligned on a cache line
    boundary and pad each element of said array up to the size of a L2 cache
    line.

    Two steps... Align your memory on a proper cache line boundary, and pad
    the size of each element up to the size of a single cache line.

    For smaller elements smaller than a cache line, that makes little
    sense. as written. I think there is an unwritten assumption
    "for elements larger than cache line" there, or we would all
    be using 64-byte bools.
    <
    Then consider a 4-way banked cache (¼ cache line per bank) and an access
    that straddles a ¼ line boundary and multiple AGEN units. So one AGEN
    unit creates the access to the container which straddles the boundary
    while another creates an access into the second part of the spanning
    access.
    <
    Then consider that "program order" information is not instantaneously available, and the bank selector picks the second access. Now, that
    spanning access is no longer ATOMIC, and might even see a Snoop between
    its first access and its spanning access...............
    <

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Lurndal on Sat Nov 11 22:47:00 2023
    In article <FlS3N.25739$_Oab.3565@fx15.iad>, scott@slp53.sl.home (Scott Lurndal) wrote:
    jgd@cix.co.uk (John Dallman) writes:

    I dug into what it would take to have x86-64 Linux work with
    alignment enforcement turned on, and it's a huge job.

    It might be easier with AArch64. Just set the A bit (bit 1) in
    SCTLR_EL1; it only effects code executing in usermode.

    There may even already be some ELF flag that will set it when the
    file is exec(2)'d.

    I'll take a look, but I doubt glibc on Aarch64 is built to be run with alignment trapping. Should it be EL0 for usermode?

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Nov 12 10:34:04 2023
    According to Quadibloc <quadibloc@servername.invalid>:
    What do you consider the virtues of Itanium to be?

    Well, I think that superscalar operation of microprocessors is a good
    thing. Explicitly indicating which instructions may execute in parallel
    is one way to facilitate that. Even if the Itanium was an unsuccessful >implementation of that principle.

    I knew the people at Yale who invented trace scheduling and started Multiflow.

    It was and is a very clever technique for kind of computers we could build
    in the 1980s. It works really well for programs with regular memory access patterns, not so well for programs without. Once we could build enough transistors to do dynamic memory and instruction scheduling, why try to
    do it at compile time?

    I gather it is still useful for embedded or realtime applications which
    are fairly regular and for cost or power reasons you want to minimize
    the number of transistors.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Sun Nov 12 13:59:06 2023
    John Levine <johnl@taugh.com> writes:
    I gather it is still useful for embedded or realtime applications which
    are fairly regular and for cost or power reasons you want to minimize
    the number of transistors.

    Even there, VLIW-inspired CPUs like Philips Trimedia was terminated,
    and I have not heard much about TI's C6000 lately. Both NXP (spun of
    from Philips) and TI seem to bet heavily on ARM.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Dallman on Sun Nov 12 14:08:11 2023
    jgd@cix.co.uk (John Dallman) writes:
    I dug into what it would take to have x86-64 Linux work with alignment >enforcement turned on, and it's a huge job.

    I did a first attempt in the IA-32 days, and there I found that the
    alignment requirements of the hardware were incompatible with the ABI
    (which required 4-byte alignment for 8-byte FP numbers).

    My second attempt was with AMD64, and there I found that gcc produced misaligned 16-bit memory accesses for stuff like strcpy(buf, "a"). I
    did not try to disable this with a flag at the time, but maybe -fno-tree-vectorize would help. But even if I use that for my code, I
    would also have to recompile all the libraries with that flag.

    Another problem (on both platforms) were memcpy, memmove, etc., but I
    expected that one could link with alignment-clean versions. But I
    don't know how many functions are affected.

    I would be surprised if ARM A64 did not have the same problems (except
    the idiotic incompatibility between Intel ABI and Intel hardware).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Nov 12 14:54:56 2023
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    jgd@cix.co.uk (John Dallman) writes:
    I dug into what it would take to have x86-64 Linux work with alignment >>enforcement turned on, and it's a huge job.

    I did a first attempt in the IA-32 days, and there I found that the
    alignment requirements of the hardware were incompatible with the ABI
    (which required 4-byte alignment for 8-byte FP numbers).

    This is a very old problem. S/360 was the first byte addressed machine
    and required aligned operands. They immediately realized that Fortran
    programs that used COMMON or EQUIVALENCE often forced 8-byte FP onto
    4-byte boundaries. The Fortran library had a hack that caught the
    alignment fault and fixed it up very slowly. But they quickly dealt
    with it in hardware. The 360/85 which brought us caches also had "byte
    oriented opeands" i.e. misaligned, and it was carried into all
    subsequent 370 and later machines.

    It makes some sense that they did so since caches greatly decrease the
    cost of misaligned operands.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Sun Nov 12 16:24:00 2023
    In article <2023Nov12.150811@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    jgd@cix.co.uk (John Dallman) writes:
    I dug into what it would take to have x86-64 Linux work with
    alignment enforcement turned on, and it's a huge job.

    I did a first attempt in the IA-32 days, and there I found that the
    alignment requirements of the hardware were incompatible with the
    ABI (which required 4-byte alignment for 8-byte FP numbers).

    By the time I was running short of alignment-sensitive platforms, x86-64
    was well established, and 64-bit is preferable for this kind of
    bug-hunting since accidental correct alignment is rarer.

    My second attempt was with AMD64, and there I found that gcc
    produced misaligned 16-bit memory accesses for stuff like
    strcpy(buf, "a"). I did not try to disable this with a flag
    at the time, but maybe -fno-tree-vectorize would help. But
    even if I use that for my code, I would also have to recompile
    all the libraries with that flag.

    I reached similar conclusions, reckoning that I'd need to rebuild the
    Linux userland for the job, at minimum. An alternative is to wrap all
    calls to system libraries and turn alignment traps off and on there,
    which would be easier, given I have a well-defined set of software to
    test.

    I would be surprised if ARM A64 did not have the same problems
    (except the idiotic incompatibility between Intel ABI and Intel
    hardware).

    Yup. I have a lot more x86-64 hardware available, so it would be the
    choice, if I didn't have so many more urgent projects to do.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Dallman on Sun Nov 12 17:21:51 2023
    jgd@cix.co.uk (John Dallman) writes:
    In article <FlS3N.25739$_Oab.3565@fx15.iad>, scott@slp53.sl.home (Scott >Lurndal) wrote:
    jgd@cix.co.uk (John Dallman) writes:

    I dug into what it would take to have x86-64 Linux work with
    alignment enforcement turned on, and it's a huge job.

    It might be easier with AArch64. Just set the A bit (bit 1) in
    SCTLR_EL1; it only effects code executing in usermode.

    There may even already be some ELF flag that will set it when the
    file is exec(2)'d.

    I'll take a look, but I doubt glibc on Aarch64 is built to be run with >alignment trapping. Should it be EL0 for usermode?

    The EL1 in the register name describes the minimum exception level
    allowed to access the register. SCTLR_EL1 includes control bits
    for both EL1 and EL0.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Lurndal on Sun Nov 12 17:40:00 2023
    In article <PQ74N.100$ayBd.39@fx07.iad>, scott@slp53.sl.home (Scott
    Lurndal) wrote:

    jgd@cix.co.uk (John Dallman) writes:
    In article <FlS3N.25739$_Oab.3565@fx15.iad>, scott@slp53.sl.home
    (Scott Lurndal) wrote:
    jgd@cix.co.uk (John Dallman) writes:
    It might be easier with AArch64. Just set the A bit (bit 1) in
    SCTLR_EL1; it only effects code executing in usermode.

    There may even already be some ELF flag that will set it when the
    file is exec(2)'d.

    I'll take a look, but I doubt glibc on Aarch64 is built to be run
    with alignment trapping. Should it be EL0 for usermode?

    The EL1 in the register name describes the minimum exception level
    allowed to access the register. SCTLR_EL1 includes control bits
    for both EL1 and EL0.

    Aha. It's harder for ARM64: I'd have to be in supervisor mode to set that
    bit, and the stuff I work on is strictly application code.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Sun Nov 12 17:27:36 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Chris M. Thomasson wrote:


    Think of LL/SC... If one did not honor the reservation granule....
    well... Shit.. False sharing on a reservation granule can cause live
    lock and damage forward progress wrt some LL/SC setups.
    <
    One should NEVER (N. E. V. E. R.) attempt ATOMIC stuff on an unaligned >container. Only aligned containers possess ATOMIC-smelling properties.
    <

    That is indeed the case. Consider the effect of a page fault when
    an unaligned access crosses a page boundary, for example; leaving
    aside, of course, all the difficulties inherent in dealing with
    atomicity when the access spans two cache lines.

    ARM implementations of LL/SC (Load Exclusive/Store Exclusive) can
    have an arbitrary sized reservation granule (ARM's Cortex-M7,
    for example, has a single reservation granule the size of the
    full address space). Any store between the loadex znd storex
    instructions is allowed by the architecture (V7 and V8) to cause
    the storex to fail.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to BGB-Alt on Sun Nov 12 20:55:27 2023
    On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:

    Errm, splitting up registers like this is likely to hurt far more than anything that 16-bit displacements are likely to gain.

    Unless, maybe, registers were being treated like a stack, but even then,
    this is still gonna suck.

    Much preferable for a compiler to have a flat space of 32 or 64
    registers. Having 16 sorta works, but does still add a bit to spill and
    fill.

    This led me to seriously reconsider the path down which I was
    heading.

    I had tried, with all sorts of ingenious compromises of register spaces and
    the like, to fit all the capabilities I wanted into the opcode space of a single version of the instruction set, eliminating the need for blocks
    which contained instructions belonging to alternate versions of the
    instruction set.

    But if the 16-bit instructions I'm making room for are useless to
    compilers, that's questionable.

    At first, when I mulled over this, I came up with multiple ideas to address
    it, each one crazier than the last.

    Seeing, therefore, that this was a difficult nut to crack, and not wanting
    to go down in another wrong direction... instead, I found a way to go that seemed to me to be reasonably sensible.

    Go back to uncompromised 32-bit instructions, even though that means there
    are no 16-bit instructions.

    Then, bring back short instructions - effectively 17 bits long - so as to
    have room for full register specifications. This means an alternative block format where 16, 32, 48, 64... bit instructions are all possible.

    *But* because of the room 17-bit short instructions take up in the header,
    the 32-bit instructions are the same regular format as in the other case.
    Not some kind of 33-bit or 35-bit instruction with a new set of instruction formats.

    So, even though there are now two formats for code instead of one, one is merely the 32-bit subset of the other, so that although I have taken a step back in order to take steps forward, it still isn't too far back.

    I'm _trying_ to keep a lid on the extravagances in Concertina II, even if
    using the word "sanity" in the same breath with it may be considered inappropriate...

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Sun Nov 12 21:28:11 2023
    BGB wrote:

    On 11/9/2023 10:37 PM, Quadibloc wrote:


    For performance optimized cases, I am starting to suspect 16-bit ops are
    not worth it.
    <
    BINGO:: another near convert.......
    <
    For size optimization, they make sense; but size optimization also means mostly confining register allocation to R0..R15 in my case, with
    heuristics for when to enable additional registers, where enabling the
    higher registers effectively hinders the use of 16-bit instructions.


    The other option I have found is that, rather than optimizing for
    smaller instructions (as in an ISA with 16 bit instructions), one can
    instead optimize for doing stuff in as few instructions as it is
    reasonable to do so, which in turn further goes against the use of
    16-bit instructions.
    <
    This is the My 66000 path-execute fewer instructions even if they take
    the same number of bytes in .text.
    <

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Sun Nov 12 21:25:20 2023
    Quadibloc wrote:

    On Fri, 10 Nov 2023 01:11:13 +0000, MitchAlsup wrote:

    I never had any aligned memory references. The HW overhead to "fix" the
    problem is so small as to be compelling.

    Since I have a complete set of memory-reference instructions for which unaligned memory-reference instructions are supported, the problem isn't
    that I think unaligned fetches and stores take too many gates.

    Rather, being able to only specify aligned accesses saves *opcode space*,
    <
    I am not buying this. Which takes more opcode space::
    a) an ISA with unaligned only LDs and STs (11)
    or
    b) an ISA with unaligned LDs and STs (11) and aligned LDs and STs (another 11) <
    It is a simple entropy (allocated counting) problem
    <
    which lets me fit in one complete set of memory-reference instructions that can use all the base registers, all the index registers, and always use all the registers as destination registers.

    While the unaligned-capable instructions, that offer also important additional addressing modes, had to have certain restrictions to fit in.

    So they use six out of the seven index registers, they can use only half
    the registers as destination registers on indexed accesses, and they use
    four out of the seven base registers.

    Having 16-bit instructions for the possibility of more compact code meant that I had to have at least one of the two restrictions noted above -
    having both restrictions meant that I could offer the alternative of aligned-only instructions with neither restriction, which may be far less painful for some.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Sun Nov 12 21:35:11 2023
    BGB wrote:

    On 11/10/2023 8:51 AM, Scott Lurndal wrote:


    As for register arguments:
    * Probably 8 or 16.
    ** 8 makes the most sense with 32 GPRs.
    *** 16 is asking too much.
    *** 8 deals with around 98% of functions.
    ** 16 makes sense with 64 GPRs.
    *** Nearly all functions can use exclusively register arguments.
    *** Gain is small though, if it only benefits 2% of functions.
    *** It is almost a "shoe in", except for cost of fixed spill space
    *** 128 bytes at the bottom of every non-leaf stack-frame is noticeable.
    *** Though, an ABI could decide to not have a spill space in this way.
    <
    For the reasons stated above (some clipped) I agree with this whole block of statements.
    <
    Since My 66000 has 32 registers, I went with upto 8 arguments in registers, upto 8 results in registers, with the 9th of either on-the-stack in such a
    way that if the callee is vararg the argument registers can be pushed on the stack to form a memory resident vector of arguments {{just perfect for printf().}}
    <
    With 8 registers covering 98%-ile of calls, there is too little left by
    making this boundary 12-16 both of which ARE still possible.
    <
    Though, admittedly, for a lot of my programs I had still ended up going
    with 8 register arguments with 64 GPRs, mostly as the gains of 16
    arguments is small, relative of the cost of spending an additional 64
    bytes in nearly every stack frame (and also there are still some
    unresolved bugs when using 16 argument mode).
    <
    It is a delicate balance and it is easy to make the code look better
    while actually running slower.
    <
    ....



    Current leaning is also that:
    32-bit primary instruction size;
    32/64/96 bit for variable-length instructions;
    Is "pretty good".

    In performance-oriented use cases, 16-bit encodings "aren't really worth
    it".
    In cases where you need a 32 or 64 bit value, being able to encode them
    or load them quickly into a register is ideal. Spending multiple
    instructions to glue a value together isn't ideal, nor is needing to
    load it from memory (this particularly sucks from the compiler POV).


    As for addressing modes:
    (Rb, Disp) : ~ 66-75%
    (Rb, Ri) : ~ 25-33%
    Can address the vast majority of cases.

    Displacements are most effective when scaled by the size of the element
    type, as unaligned displacements are exceedingly rare. The vast majority
    of displacements are also positive.

    Not having a register-indexed mode is shooting oneself in the foot, as
    these are "not exactly rare".

    Most other possible addressing modes can be mostly ignored.
    Auto-increment becomes moot if one has superscalar or VLIW;
    (Rb, Ri, Disp) is only really applicable in niche cases
    Eg, array inside struct, etc.
    ...



    RISC-V did sort of shoot itself in the foot in several of these areas,
    albeit with some workarounds in "Bitmanip":
    SHnADD, can mimic a LEA, allowing array access in fewer ops.
    PACK, allows an inline 64-bit constant load in 5 instructions...
    LUI+ADD+LUI+ADD+PACK
    ...

    Still not ideal...

    An extra cycle for memory access is not ideal for a close second place addressing mode; nor are 64-bit constants rare enough that one
    necessarily wants to spend 5 or so clock cycles on them.

    But, still better than the situation where one does not have these instructions.

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Sun Nov 12 21:37:39 2023
    BGB wrote:

    On 11/10/2023 12:22 PM, MitchAlsup wrote:
    BGB wrote:

    One can argue that aligned-only allows for a cheaper L1 D$, but also
    "sucks pretty bad" for some tasks:
       Fast memcpy;
       LZ decompression;
       Huffman;
       ...
    <
    Time found that HW can solve the problem way more than adequately--
    obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
    <


    Wait, are you arguing for aligned-only memory ops here?...
    <
    I have not argued for aligned memory references since about 2000 (maybe as early as 1991).
    <

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kent Dickey@21:1/5 to Scott Lurndal on Sun Nov 12 22:18:31 2023
    In article <FlS3N.25739$_Oab.3565@fx15.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    jgd@cix.co.uk (John Dallman) writes:
    In article <2023Nov11.112254@mips.complang.tuwien.ac.at>, >>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    True, but that has been tried out and, in a world (like Linux) where
    software is developed on a platform that supports unaligned
    accesses, and then compiled by package maintainers (who often are
    not that familiar with the software) on a lot of platforms, the end
    result was that the kernel by default performed a fixup (and put a
    message in the dmesg buffer) instead of delivering a SIGBUS.

    Yup. The software I work on is meant, in itself, to work on platforms
    that enforce alignment, and it was a useful catcher for some kinds of bug. >>However, I'm now down to one that actually enforces it, in SPARC Solaris, >>and that isn't long for this world.

    I dug into what it would take to have x86-64 Linux work with alignment >>enforcement turned on, and it's a huge job.

    It might be easier with AArch64. Just set the A bit (bit 1) in SCTLR_EL1; >it only effects code executing in usermode.

    There may even already be some ELF flag that will set it when the
    file is exec(2)'d.

    On Aarch64, with GCC at least, you also need to specify "-mstrict-align"
    when compiling all source code, to prevent the compiler from assuming it
    can access structure fields in an unaligned way, even if all of your
    code accesses are fully aligned. GCC can mess around behind your back, changing ptr->array32[1] = 0 and ptr->array32[2] = 0 into a single
    64-bit write of ptr->array32[1] = 0, among other things. If the offset
    of array32[1] wasn't 64-bit aligned, it's an alignment trap if
    SCTLR_EL1.A=1.

    On all Arm system, Device memory accesses must always be aligned. User code
    in general does not get access to Device memory, so this does not affect regular users.

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Sun Nov 12 22:09:24 2023
    Quadibloc <quadibloc@servername.invalid> writes:
    On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
    Errm, splitting up registers like this is likely to hurt far more than
    anything that 16-bit displacements are likely to gain.
    ...
    Much preferable for a compiler to have a flat space of 32 or 64
    registers. Having 16 sorta works, but does still add a bit to spill and
    fill.
    ...
    But if the 16-bit instructions I'm making room for are useless to
    compilers, that's questionable.

    It works for the RISC-V C (compressed) extension. Some of these
    compressed instrutions use registers 8-15 (others use all 32
    registers, but have other restrictions). But it works fine exactly
    because, if your register usage does not fit the limitations of the
    16-bit encoding, you just use the 32-bit version of the instruction.
    It seems that they designed the ABI such that registers 8-15 occur
    often in the code. Maybe the gcc maintainer also put some work into
    preferring these registers.

    OTOH, ARM who have extensive experience with mixed 32-bit/16-bit
    instruction sets with their A32/T32 instruction set(s), designed their
    A64 instruction set to strictly use 32-bit instructions.

    So if MIPS, SPARC, Power, Alpha, and ARM A64 went for fixed-width
    32-bit instructions, why make your task harder by also implementing
    short instructions? Of course, if that is your goal or you have fun
    with this, why not? But if you want to make progress, it seems to be
    something that can be skipped.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Kent Dickey on Mon Nov 13 00:09:00 2023
    Kent Dickey wrote:

    In article <FlS3N.25739$_Oab.3565@fx15.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    jgd@cix.co.uk (John Dallman) writes:
    In article <2023Nov11.112254@mips.complang.tuwien.ac.at>, >>>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    True, but that has been tried out and, in a world (like Linux) where
    software is developed on a platform that supports unaligned
    accesses, and then compiled by package maintainers (who often are
    not that familiar with the software) on a lot of platforms, the end
    result was that the kernel by default performed a fixup (and put a
    message in the dmesg buffer) instead of delivering a SIGBUS.

    Yup. The software I work on is meant, in itself, to work on platforms >>>that enforce alignment, and it was a useful catcher for some kinds of bug. >>>However, I'm now down to one that actually enforces it, in SPARC Solaris, >>>and that isn't long for this world.

    I dug into what it would take to have x86-64 Linux work with alignment >>>enforcement turned on, and it's a huge job.

    It might be easier with AArch64. Just set the A bit (bit 1) in SCTLR_EL1; >>it only effects code executing in usermode.

    There may even already be some ELF flag that will set it when the
    file is exec(2)'d.

    On Aarch64, with GCC at least, you also need to specify "-mstrict-align"
    when compiling all source code, to prevent the compiler from assuming it
    can access structure fields in an unaligned way, even if all of your
    code accesses are fully aligned. GCC can mess around behind your back, changing ptr->array32[1] = 0 and ptr->array32[2] = 0 into a single
    64-bit write of ptr->array32[1] = 0, among other things. If the offset
    of array32[1] wasn't 64-bit aligned, it's an alignment trap if
    SCTLR_EL1.A=1.

    On all Arm system, Device memory accesses must always be aligned. User code in general does not get access to Device memory, so this does not affect regular users.
    <
    For all the same reasons one does not do misaligned accesses to ATOMIC
    memory locations, one does not do misaligned accesses to device control registers.
    <
    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Sun Nov 12 23:15:43 2023
    On Sun, 12 Nov 2023 21:25:20 +0000, MitchAlsup wrote:

    I am not buying this. Which takes more opcode space::
    a) an ISA with unaligned only LDs and STs (11)
    or b) an ISA with unaligned LDs and STs (11) and aligned LDs and STs
    (another 11)

    That is true, *other things being equal*.

    However, what I had was:

    An ISA with unaligned loads and stores, that could use all 32 destination registers, and all 8 index and base registers. (Call this A)

    That took up too much opcode space to allow 16-bit instructions.

    So I made various compromises to shave one bit off the loads and stores,
    and then I could have 16 bit instructions. (Call this B)

    But I didn't like the compromises.

    So I made _more_ compromises, to shave _another_ bit off the loads and
    stores. This way, I had enough opcode space to add aligned-only loads
    and stores... that could use all 32 destination registers, and all 8
    index and base registers. (Call this C)

    Since other things _were not equal_, it was perfectly possible for C
    to use less opcode space than A, and about the same amount of opcode
    space as B. So I got to use 16-bit instructions AND have a set of loads
    and stores that used all 32 destnation registers, and all 8 index and
    base registers.

    The compromises on the _unaligned_ loads and stores were painful, but
    they were chosen so that code using them wouldn't have to be be
    significantly less efficient than code with the set of loads and stores
    in A.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Mon Nov 13 00:10:44 2023
    Anton Ertl wrote:

    Quadibloc <quadibloc@servername.invalid> writes:
    On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
    Errm, splitting up registers like this is likely to hurt far more than
    anything that 16-bit displacements are likely to gain.
    ....
    Much preferable for a compiler to have a flat space of 32 or 64
    registers. Having 16 sorta works, but does still add a bit to spill and
    fill.
    ....
    But if the 16-bit instructions I'm making room for are useless to >>compilers, that's questionable.

    It works for the RISC-V C (compressed) extension. Some of these
    compressed instrutions use registers 8-15 (others use all 32
    registers, but have other restrictions). But it works fine exactly
    because, if your register usage does not fit the limitations of the
    16-bit encoding, you just use the 32-bit version of the instruction.
    It seems that they designed the ABI such that registers 8-15 occur
    often in the code. Maybe the gcc maintainer also put some work into preferring these registers.

    OTOH, ARM who have extensive experience with mixed 32-bit/16-bit
    instruction sets with their A32/T32 instruction set(s), designed their
    A64 instruction set to strictly use 32-bit instructions.

    So if MIPS, SPARC, Power, Alpha, and ARM A64 went for fixed-width
    32-bit instructions, why make your task harder by also implementing
    short instructions? Of course, if that is your goal or you have fun
    with this, why not? But if you want to make progress, it seems to be something that can be skipped.
    <
    Sound
    <
    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Mon Nov 13 00:16:24 2023
    Quadibloc wrote:

    On Sun, 12 Nov 2023 21:25:20 +0000, MitchAlsup wrote:

    I am not buying this. Which takes more opcode space::
    a) an ISA with unaligned only LDs and STs (11)
    or b) an ISA with unaligned LDs and STs (11) and aligned LDs and STs
    (another 11)

    That is true, *other things being equal*.

    However, what I had was:
    <
    A poorly chosen starting point (dark alley)
    <
    An ISA with unaligned loads and stores, that could use all 32 destination registers, and all 8 index and base registers. (Call this A)

    That took up too much opcode space to allow 16-bit instructions.

    So I made various compromises to shave one bit off the loads and stores,
    and then I could have 16 bit instructions. (Call this B)

    But I didn't like the compromises.
    <
    Captain Obvious to the rescue::
    <
    So I made _more_ compromises, to shave _another_ bit off the loads and stores. This way, I had enough opcode space to add aligned-only loads
    and stores... that could use all 32 destination registers, and all 8
    index and base registers. (Call this C)
    <
    Back out of the dark alley, and start from first principles again.
    <
    Since other things _were not equal_, it was perfectly possible for C
    to use less opcode space than A, and about the same amount of opcode
    space as B. So I got to use 16-bit instructions AND have a set of loads
    and stores that used all 32 destnation registers, and all 8 index and
    base registers.
    <
    Maybe "less opcode space" if you count bits, but it is "more opcode space" if/when you enumerate all the opcodes within the space.
    <
    The compromises on the _unaligned_ loads and stores were painful, but
    they were chosen so that code using them wouldn't have to be be
    significantly less efficient than code with the set of loads and stores
    in A.
    <
    Does you compiler agree with this assertion ??
    <
    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Mon Nov 13 00:54:49 2023
    On Mon, 13 Nov 2023 00:16:24 +0000, MitchAlsup wrote:

    Does you compiler agree with this assertion ??

    As I'm still only in the early stages of roughing out
    the bare outlines of an ISA, I have not yet built such
    advanced diagnostic tools, I must admit.

    However, my original compromise had been to reduce
    the number of index registers used with memory-reference
    instructions to 3 from 7.

    The two improved compromises I used in this later effort
    were:

    Compromise 1:

    Reduce the number of base registers used with memory-reference
    instructions (when using a 16-bit displacement) to 3 from 7.

    I figured that _this_ was far less likely to reduce efficiency,
    since normally not that many base registers were used in any
    case.

    Compromise 2:

    When an instruction is not indexed, reduce the size of the index
    register field to two bits, both containing 0.

    When an instruction is indexed, reduce the size of the destination
    register field to 4 bits from 5, thus allowing only 16 of the 32
    registers to be used with indexed memory accesses.

    This one is more painful, but it had historical precedent. One
    consequence is that the number of index registers is reduced, to
    six from 7, because now index register 4 "looks like zero".

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to MitchAlsup on Sun Nov 12 19:28:51 2023
    On 11/12/2023 3:37 PM, MitchAlsup wrote:
    BGB wrote:

    On 11/10/2023 12:22 PM, MitchAlsup wrote:
    BGB wrote:

    One can argue that aligned-only allows for a cheaper L1 D$, but also
    "sucks pretty bad" for some tasks:
       Fast memcpy;
       LZ decompression;
       Huffman;
       ...
    <
    Time found that HW can solve the problem way more than adequately--
    obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
    <


    Wait, are you arguing for aligned-only memory ops here?...
    <
    I have not argued for aligned memory references since about 2000 (maybe as early as 1991).
    <

    Makes sense, but I was confused as to what was being argued here...


    I prefer unaligned memory access, since it allows a lot of nifty stuff
    to be done.

    But, I can note that the main drawback it has is in terms of requiring a
    more expensive L1 cache.

    Aligned-only cache only needs:
    A single row of cache-lines
    To check a single address for hit/miss;
    Can use a simpler set of MUX'es for extract/insert.

    Vs, say:
    Two rows of cache lines (say, even and odd);
    Needs to check two addresses;
    More complicated extract/insert logic.


    But, say, if one needs to operate within the limits of an aligned-only
    cache, then even something like an LZ4 decompressor is painfully slow,
    as it has to basically do damn near everything 1 byte at a time (or, at
    least, more so than it does already).


    I once did have a compressor (FeLZ32) more designed for the constraints
    of the SuperH ISA (and aligned-only memory access), but its main
    "feature" was that pretty much everything was defined in terms of 32-bit
    words (it was not copying bytes, rather, 32 bit words, and the encoded
    stream was itself an array of 32-bit words).

    It also managed to beat out LZ4's performance by a fair margin on the Piledriver I was using at the time.

    But, this performance advantage effectively evaporated on my Ryzen
    (where LZ4 speed increased significantly), and was also mostly N/A on
    BJX2. In this case, the byte-oriented formats were more preferable as
    they got better compression.

    Like, a lot of the performance tricks I had developed on the Piledriver
    were effectively rendered moot.

    Though, some amount of the tricks were mostly workarounds for "things
    that were slow", which the newer CPU had made effectively unnecessary or counter productive.


    ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Mon Nov 13 02:44:57 2023
    On Mon, 13 Nov 2023 00:16:24 +0000, MitchAlsup wrote:

    A poorly chosen starting point (dark alley)

    Back out of the dark alley, and start from first principles again.

    By the way, I think you mean a _blind_ alley.

    A dark alley is just a dangerous place, since robbers can attack you
    there without being seen.

    A _blind_ alley is one that had no exit, one that is a dead end. That
    seems to better fit the context of your remarks.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Mon Nov 13 03:06:03 2023
    Quadibloc wrote:

    On Mon, 13 Nov 2023 00:16:24 +0000, MitchAlsup wrote:

    A poorly chosen starting point (dark alley)

    Back out of the dark alley, and start from first principles again.

    By the way, I think you mean a _blind_ alley.

    A dark alley is just a dangerous place, since robbers can attack you
    there without being seen.

    A _blind_ alley is one that had no exit, one that is a dead end. That
    seems to better fit the context of your remarks.
    <
    based on our definitions I definitively meant dark as in dangerous as
    opposed to no way out except backwards.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to MitchAlsup on Sun Nov 12 20:21:13 2023
    On 11/12/2023 3:35 PM, MitchAlsup wrote:
    BGB wrote:

    On 11/10/2023 8:51 AM, Scott Lurndal wrote:


    As for register arguments:
    * Probably 8 or 16.
    ** 8 makes the most sense with 32 GPRs.
    *** 16 is asking too much.
    *** 8 deals with around 98% of functions.
    ** 16 makes sense with 64 GPRs.
    *** Nearly all functions can use exclusively register arguments.
    *** Gain is small though, if it only benefits 2% of functions.
    *** It is almost a "shoe in", except for cost of fixed spill space
    *** 128 bytes at the bottom of every non-leaf stack-frame is noticeable.
    *** Though, an ABI could decide to not have a spill space in this way.
    <
    For the reasons stated above (some clipped) I agree with this whole
    block of statements.
    <
    Since My 66000 has 32 registers, I went with upto 8 arguments in registers, upto 8 results in registers, with the 9th of either on-the-stack in such a way that if the callee is vararg the argument registers can be pushed on
    the
    stack to form a memory resident vector of arguments {{just perfect for printf().}}
    <
    With 8 registers covering 98%-ile of calls, there is too little left by making this boundary 12-16 both of which ARE still possible.
    <

    Yeah.

    Short of things like using 128-bit pointers, or lots of 128-bit
    arguments (with an ABI that expresses these in pairs), the 8 argument
    ABI seems to be slightly ahead here (even with 64 registers).


    Mostly, because 2% of functions needing to use memory arguments seems to
    cost less than the indirect cost of every other non-leaf function
    needing to reserve an extra 64 bytes in the stack frame.

    Had considered a possible ABI tweak where functions that only call other functions with fewer than 8 register arguments (likely excluding
    vararg); only need to reserve space for the first 8 arguments.

    But, the gains are likely to be rather small compared to the added
    debugging effort.


    Though, admittedly, for a lot of my programs I had still ended up
    going with 8 register arguments with 64 GPRs, mostly as the gains of
    16 arguments is small, relative of the cost of spending an additional
    64 bytes in nearly every stack frame (and also there are still some
    unresolved bugs when using 16 argument mode).
    <
    It is a delicate balance and it is easy to make the code look better
    while actually running slower.
    <

    Yeah.

    I suspect it is likely due mostly to something like L1 cache misses or
    similar (bigger stack frame, more area for the L1 cache to miss).


    OTOH: Had recently added the logic to shuffle prolog register-stores in
    an attempt to reduce WAW stalls. Turned out, fully aligning stuff would
    be a much bigger pain than initially hope (the curse of multiple cases
    of duplicated logic that needs to operate in lockstep).


    Did come up with an intermediate option:
    Generate an temporary array of which registers are saved at which offsets; Generate a permutation array for which order to store these registers;
    Initial permutation uses simple XOR shuffling;
    Have a function to model the WAW cost of each permutation;
    Shuffle the permutations with a PRNG (up to N times);
    Pick the permutation with the smallest WAW cost.

    Mostly works OK, but granted, nearly any ordering is better at this
    metric than saving them in a linear order.

    Though, doesn't really gain much if the forwarding option is enabled.



    Relatedly, was also able to make Doom a little faster with another trick: Instead of drawing into an off-screen buffer, and then copying this to
    the screen in the form of a DIB Bitmap object...

    There can be functions to request and release framebuffers for a given Drawing-Context (with a supplied BITMAPINFOHEADER; this request failing
    and returning NULL if the BITMAPINFOHEADER doesn't match the format used
    by the HDC or similar; forcing fallback to the older method).

    Similarly, there is a "SwapBuffers" style call, with these buffers
    effectively operating in a double-buffering style.

    In effect, it is an interface slightly more like what SDL uses.


    Was kind of a hassle to modify Doom to play well with double buffering
    though, initially it was a strobe-filled / flickering mess , with the
    status bar effectively having a seizure. Does still have the annoyance
    that when one noclip's though a wall, then whatever garbage is left over
    is now prone to a strobe effect.

    However, using shared buffers and then having Doom draw into them, does
    reduce the amount of framebuffer copying needed for each screen update.


    As-is, will currently only work though in 320x200 hi-color mode (where biHeight==-200, where negative height indicates an origin in the
    top-left corner).


    However, the DIB drawing method does allow more flexibility here (the
    internal bitmap can be in a wider range of formats, and will be
    converted as needed).

    Granted, one can note that things like pixel format conversion and
    similar aren't free.



    Also recently encountered a video online where someone was running Doom
    on a 386, and, the framerates *sucked*... ( Like, mostly single-digit territory, and with somewhat longer load-times as well. )

    Can at least probably say, with reasonable confidence, that my BJX2 core
    is faster than a 386...

    Some other information implies that the speeds I am seeing are more
    on-par with a high-end 486 or maybe a low-end Pentium.

    ( Nevermind that Quake performance is still crap in my case... )

    ( Somehow, it seems like old computers were generally worse and less
    capable than my childhood self remembered. )




    Formats supported in DIB form at present:
    RGB555, RGB24, RGBA32, Indexed 1/2/4/8-bit, UTX2.

    Formats used by the display hardware:
    Color-Cell 8x8 as 4x 4x4x2bpp (2 endpoints per 4x4 cell);
    Color-Cell 8x8x1 (2 color endpoints).
    Also used for text-mode display.
    4x4x16bit RGB555
    4x4x8bit Indexed
    (New/Experimental) Linear RGB555 and Indexed 8-bit
    Framebuffer pixels now in a conventional linear raster ordering.
    Also, the framebuffer is now movable, allowing double-buffering.
    Framebuffer will require a 32 byte alignment though.
    And needs to be in a physically-mapped address range.


    Still don't have any "good" 256 color palettes:
    6*6*6 and 6*7*6 (216 and 252 color)
    Good for bright cartoony graphics, poor for much else.
    Generally loses any detail in things like shading.
    6*7*6 can't do grays effectively, only purple and green tints.
    16 shades of 16 colors
    Better "in general", obvious color distortion for cartoon images
    13 shades of 19 colors (*1)
    Slightly better than the previous
    Mostly cutting off "near black" for additional colors.
    Say: adding an Orange, Olive-Green, and Sky-Blue gradient.
    Don't need 48 colors of "almost black"...

    I don't know of any palette optimization algorithms that are fast enough
    to run in real-time on the BJX2 core (I suspect "in the old days",
    palette optimization was likely offline only).

    Granted, other palettes are possible, mostly just the difficulty of
    finding an organization that "looks good in the general case".

    *1:
    0z: Gray
    1z: Blue (High Sat)
    2z: Green (High Sat)
    3z: Cyan (High Sat)
    4z: Red (High Sat)
    5z: Magenta (High Sat)
    6z: Yellow (High Sat)
    7z: Pink (Off-White)
    8z: Beige (Off-White)
    9z: Blue (Low Sat)
    Az: Green (Low Sat)
    Bz: Cyan (Low Sat)
    Cz: Red (Low Sat)
    Dz: Magenta (Low Sat)
    Ez: Yellow (Low Sat)
    Fz: Sky Blue (Off-White)

    z0: Orange (Mid Sat)
    z1: Olive (Mid Sat)
    z2: Sky Blue (Mid Sat)

    00: Black
    01, 02: Very dark gray.
    10/11/12/20/21/22: Various other "nearly black" colors.
    Technically, the bottoms of the orange/olive/sky bars;
    But, these can effectively "merge" the other colors.

    In my fiddling, this was generally the "best performing" palette layout
    I could seem to find thus far.


    ....



    Current leaning is also that:
       32-bit primary instruction size;
       32/64/96 bit for variable-length instructions;
       Is "pretty good".

    In performance-oriented use cases, 16-bit encodings "aren't really
    worth it".
    In cases where you need a 32 or 64 bit value, being able to encode
    them or load them quickly into a register is ideal. Spending multiple
    instructions to glue a value together isn't ideal, nor is needing to
    load it from memory (this particularly sucks from the compiler POV).


    As for addressing modes:
       (Rb, Disp) : ~ 66-75%
       (Rb, Ri)   : ~ 25-33%
    Can address the vast majority of cases.

    Displacements are most effective when scaled by the size of the
    element type, as unaligned displacements are exceedingly rare. The
    vast majority of displacements are also positive.

    Not having a register-indexed mode is shooting oneself in the foot, as
    these are "not exactly rare".

    Most other possible addressing modes can be mostly ignored.
       Auto-increment becomes moot if one has superscalar or VLIW;
       (Rb, Ri, Disp) is only really applicable in niche cases
         Eg, array inside struct, etc.
       ...



    RISC-V did sort of shoot itself in the foot in several of these areas,
    albeit with some workarounds in "Bitmanip":
       SHnADD, can mimic a LEA, allowing array access in fewer ops.
       PACK, allows an inline 64-bit constant load in 5 instructions...
         LUI+ADD+LUI+ADD+PACK
       ...

    Still not ideal...

    An extra cycle for memory access is not ideal for a close second place
    addressing mode; nor are 64-bit constants rare enough that one
    necessarily wants to spend 5 or so clock cycles on them.

    But, still better than the situation where one does not have these
    instructions.

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to MitchAlsup on Mon Nov 13 16:10:20 2023
    MitchAlsup wrote:
    Chris M. Thomasson wrote:


    Think of LL/SC... If one did not honor the reservation granule....
    well... Shit.. False sharing on a reservation granule can cause live
    lock and damage forward progress wrt some LL/SC setups.
    <
    One should NEVER (N. E. V. E. R.) attempt ATOMIC stuff on an unaligned container. Only aligned containers possess ATOMIC-smelling properties.

    This is so obviously correct that you should not have needed to mention
    it. Hammering HW with unaligned (maybe even page-straddling) LOCKed
    updates is something that should only ever be done for testing purposes.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Dallman on Mon Nov 13 14:44:15 2023
    jgd@cix.co.uk (John Dallman) writes:
    In article <PQ74N.100$ayBd.39@fx07.iad>, scott@slp53.sl.home (Scott
    Lurndal) wrote:

    jgd@cix.co.uk (John Dallman) writes:
    In article <FlS3N.25739$_Oab.3565@fx15.iad>, scott@slp53.sl.home
    (Scott Lurndal) wrote:
    jgd@cix.co.uk (John Dallman) writes:
    It might be easier with AArch64. Just set the A bit (bit 1) in
    SCTLR_EL1; it only effects code executing in usermode.

    There may even already be some ELF flag that will set it when the
    file is exec(2)'d.

    I'll take a look, but I doubt glibc on Aarch64 is built to be run
    with alignment trapping. Should it be EL0 for usermode?

    The EL1 in the register name describes the minimum exception level
    allowed to access the register. SCTLR_EL1 includes control bits
    for both EL1 and EL0.

    Aha. It's harder for ARM64: I'd have to be in supervisor mode to set that >bit, and the stuff I work on is strictly application code.

    Unless the ELF flag trick is implemented. I haven't looked at the kernel
    with respect to that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Nov 13 11:46:47 2023
    That took up too much opcode space to allow 16-bit instructions.

    You might want to try and get fancy in your short instructions by
    "randomizing" the subset of registers they can access.

    E.g. allow both your short LD and ST instruction access 16 registers
    but not exactly the same 16.
    Or allow your arithmetic instructions to access only 8 registers for their input and output args but not exactly the same 8 for the two inputs
    and/or for the output.

    I suspect that if done well, it could give benefits similar to the skewed-associative caches. The other upside is that it makes register allocation *really* interesting, thus opening up opportunities to
    spend a few more years working on that subproblem :-)

    To up the ante, you could make the set of registers reachable from each instruction depend not just on the opcode but also on the instruction's address, so you can sometimes avoid a spill by swapping two
    instructions. This would allow the register allocation to interact in
    even more interesting ways with instruction scheduling.
    There could be a few more PhDs worth of research there.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Anton Ertl on Mon Nov 13 14:12:16 2023
    On 11/12/2023 4:09 PM, Anton Ertl wrote:
    Quadibloc <quadibloc@servername.invalid> writes:
    On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
    Errm, splitting up registers like this is likely to hurt far more than
    anything that 16-bit displacements are likely to gain.
    ...
    Much preferable for a compiler to have a flat space of 32 or 64
    registers. Having 16 sorta works, but does still add a bit to spill and
    fill.
    ...
    But if the 16-bit instructions I'm making room for are useless to
    compilers, that's questionable.

    It works for the RISC-V C (compressed) extension. Some of these
    compressed instrutions use registers 8-15 (others use all 32
    registers, but have other restrictions). But it works fine exactly
    because, if your register usage does not fit the limitations of the
    16-bit encoding, you just use the 32-bit version of the instruction.
    It seems that they designed the ABI such that registers 8-15 occur
    often in the code. Maybe the gcc maintainer also put some work into preferring these registers.


    Yeah. They can be used by a compiler, and can make a difference for code-density.

    Just, it is more a case of, if one has a tradeoff of:
    Fewer instructions but more bytes;
    More instructions but fewer bytes.
    Then the former is better for performance.

    Things like reusing registers more aggressively and using a smaller
    subset of the registers, are good for making 16-bit instructions usable,
    but are less good for performance.

    ...



    Though, granted, one doesn't want to try to reserve too many registers
    (on an ISA with plenty of registers), as one may find that
    saving/restoring them costs more than that gained by having them
    available for use.

    Though, the partial workaround for this (in my case) was dividing the
    registers up into sub-groups, and using heuristics to enable these
    groups based on an estimate of the register pressure.

    Say:
    R8 ..R14: Always available, prioritized for size optimization ("/Os");
    R24..R31: Enables as needed for "/Os", always enabled for perf opt.
    R40..R47: Enabled with high register pressure.
    R56..R63: Enabled with very high register pressure.

    Note:
    BGBCC's command-line accepts both "/Os" and "-Os" style arguments.
    "/Os": Size optimize
    "/O1": Moderate speed (try to balance speed and size)
    "/O2": Prioritize speed.
    "/Z*": Mostly debug related options (like "-g" in GCC)
    "/f*": Optional feature flags.
    "/m*": Selects target arch/profile.
    "/Fe*": Specify output binary (like "-o" in GCC)
    Else, it will try to guess an output file name.
    Eg: "foo.c" -> "foo.exe"
    ...

    It does try to guess whether the '/' is part of an option or the start
    of a filename. If it sees more than one '/', or sees a '.' or similar,
    without encountering an '=', assume it is a filename.


    It is almost, but not quite, based on a count of the in-use variables.

    It helps to also apply a scale factor for each variable based on how
    deeply nested in a loop it is (so that if one has a lot of variables in
    use inside a deeply nested loop, the register pressure estimate will be
    higher than if most are used outside of a loop).

    Though, this scale-factor is nowhere near as severe as with the register allocation priority (where the nesting level was effectively raised to
    an exponent). For pressure estimates, one can use a gentler scale, more
    like, say: "scale=sqrt(deepest_nest_level+1.0);".


    For dynamically allocated variables in leaf blocks (basic block does not contain a function call), it may make sense to allocate them in scratch registers.

    Scratch registers are similar:
    R0..R1: Not used as GPRs by compiler;
    R2..R3: Designated scratch, not used for reg alloc.
    R4..R7: Always available;
    R16..R17: Designated scratch, not used for reg alloc.
    R18..R23: Available when R24..R31 are enabled (always for perf opt);
    R32..R39, R48..R55: Available under high register pressure.
    Always available if the registers are available and perf optimized.


    In performance optimized code, in my case, the spread of the registers
    is generally too disperse to really make any sort of small sub-setting particularly effective.


    OTOH, ARM who have extensive experience with mixed 32-bit/16-bit
    instruction sets with their A32/T32 instruction set(s), designed their
    A64 instruction set to strictly use 32-bit instructions.


    I guess it can also be noted, that 64-bit ARM went all-in with a lot of
    the sorts of features that RISC-V avoided. For example, it still has
    some more complex addressing modes, etc.

    I guess also they approached constants a little differently:
    You can load a 16-bit value into 1 of 4 positions within a register,
    with one of: zero fill, one fill, or keeping the prior contents.

    This allows loading an arbitrary constant in between 1 and 4 instructions.



    Though, I did realize that with RISC-V's Bitmanip extensions, it is
    possible to get a 64-bit constant load down to 5 instructions, which is
    better than RV64I needing 6 (and in both cases, needing 2 registers).


    In BJX2, with Jumbo, it is 3 instruction words and 1 clock cycle.
    Without Jumbo, it is 4 instructions (albeit less flexible than the
    mechanism in ARM).


    So if MIPS, SPARC, Power, Alpha, and ARM A64 went for fixed-width
    32-bit instructions, why make your task harder by also implementing
    short instructions? Of course, if that is your goal or you have fun
    with this, why not? But if you want to make progress, it seems to be something that can be skipped.


    In my case, I am left with an awkward split in my ISA:
    Baseline Mode, which has both 16 and 32-bit instructions (and bigger);
    XG2, which is 32-bit (and bigger).


    Some of my newer design variants had leaned towards 32-bit and 64
    registers, mostly because the higher register count does towards
    performance (at least, performance per clock; not so sure it helps with
    LUTs or timing constraints though, *).

    *: Mostly because the 5-bit LUTRAMs work with 3 bits of data, but the
    6-bit LUTRAMs only have 2 bits of data.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Quadibloc on Mon Nov 13 13:58:06 2023
    On 11/12/2023 6:44 PM, Quadibloc wrote:
    On Mon, 13 Nov 2023 00:16:24 +0000, MitchAlsup wrote:

    A poorly chosen starting point (dark alley)

    Back out of the dark alley, and start from first principles again.

    By the way, I think you mean a _blind_ alley.

    A dark alley is just a dangerous place, since robbers can attack you
    there without being seen.

    Expose the darkness to the light, before any adventures...? ;^)


    A _blind_ alley is one that had no exit, one that is a dead end. That
    seems to better fit the context of your remarks.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Stefan Monnier on Tue Nov 14 14:54:32 2023
    On Mon, 13 Nov 2023 11:46:47 -0500, Stefan Monnier wrote:

    You might want to try and get fancy in your short instructions by "randomizing" the subset of registers they can access.

    E.g. allow both your short LD and ST instruction access 16 registers but
    not exactly the same 16.
    Or allow your arithmetic instructions to access only 8 registers for
    their input and output args but not exactly the same 8 for the two
    inputs and/or for the output.

    I suspect that if done well, it could give benefits similar to the skewed-associative caches. The other upside is that it makes register allocation *really* interesting, thus opening up opportunities to spend
    a few more years working on that subproblem :-)

    I would like to be able to say that this idea was too bizarre even for
    me.

    However, one of the ideas I toyed with before settling on my current
    iteration of Concertina II was to

    - drop the aligned memory-reference instructions
    - somehow squeeze the 32-bit operate instructions into the space left
    over by the byte instructions in the family
    - thereby doubling the space available for 16-bit instructions.

    The instruction slots of the form 0-0- would be as before: two instructions where both source and destination are in the same group of eight registers.

    The instruction slots of the form 0-1- would contain two 16-bit
    instructions where the source and destination registers are each
    four bits long, allowing (as in the indexed memory-reference
    instructions) the use of the first four registers in each of the four
    groups of eight registers.

    Thus, one instruction type uses all the registers, and the other
    allows transfers between the 8-bit banks.

    So, sadly, I actually *did* contemplate going there. Fortunately, I
    thought better of it.

    To up the ante, you could make the set of registers reachable from each instruction depend not just on the opcode but also on the instruction's address, so you can sometimes avoid a spill by swapping two
    instructions. This would allow the register allocation to interact in
    even more interesting ways with instruction scheduling.
    There could be a few more PhDs worth of research there.

    That would definitely be one trick to allow access to more registers than
    the number of opcode bits allows.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to MitchAlsup on Wed Nov 15 10:38:56 2023
    On 11/11/2023 10:11 AM, MitchAlsup wrote:
    Stephen Fuld wrote:

    On 11/10/2023 10:24 AM, BGB wrote:



    Much better to have a big flat register space.

    Yes, but sometimes you just need "another bit" in the instructions.
    So an alternative is to break the requirement that all register
    specifier fields in the instruction be the same length.  So, for
    example, allow
    <
    Another way to get a few more bits is to use a prefix-instruction like
    CARRY for those seldom needed bits.

    Good point. A combination of the two ideas could be to have the prefix instruction specify which register to use instead of the one specified
    in the reduced register specifier for whichever instructions in its
    shadow have the bit set in the prefix. Worst case, this is the same as
    my original proposal - one extra, not really executed, instruction
    (prefix versus register to register move) for one where you need to use
    it, but this idea might, by allowing the prefix to specify multiple instructions, save more than one extra "instruction". The only downside
    is it requires an additional op code.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stephen Fuld on Wed Nov 15 19:02:00 2023
    Stephen Fuld wrote:

    On 11/11/2023 10:11 AM, MitchAlsup wrote:
    Stephen Fuld wrote:

    On 11/10/2023 10:24 AM, BGB wrote:



    Much better to have a big flat register space.

    Yes, but sometimes you just need "another bit" in the instructions.
    So an alternative is to break the requirement that all register
    specifier fields in the instruction be the same length.  So, for
    example, allow
    <
    Another way to get a few more bits is to use a prefix-instruction like
    CARRY for those seldom needed bits.

    Good point. A combination of the two ideas could be to have the prefix instruction specify which register to use instead of the one specified
    in the reduced register specifier for whichever instructions in its
    shadow have the bit set in the prefix.
    <
    You could have the prefix instruction supply the missing bits of all
    shortened register specifiers.
    <
    < Worst case, this is the same as
    my original proposal - one extra, not really executed, instruction
    <
    Which is why I use the term instruction-modifier.
    <
    (prefix versus register to register move) for one where you need to use
    it, but this idea might, by allowing the prefix to specify multiple instructions, save more than one extra "instruction". The only downside
    is it requires an additional op code.
    <
    But by having an instruction-modifier that can add bits to several
    succeeding instructions, you can avoid cluttering up ISA with things
    like ADC, SBC, IMULD, DDIV, ....... So, in the end, you save OpCode
    enumeration space not consume it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to MitchAlsup on Wed Nov 15 11:58:25 2023
    On 11/15/2023 11:02 AM, MitchAlsup wrote:
    Stephen Fuld wrote:

    On 11/11/2023 10:11 AM, MitchAlsup wrote:
    Stephen Fuld wrote:

    On 11/10/2023 10:24 AM, BGB wrote:



    Much better to have a big flat register space.

    Yes, but sometimes you just need "another bit" in the instructions.
    So an alternative is to break the requirement that all register
    specifier fields in the instruction be the same length.  So, for
    example, allow
    <
    Another way to get a few more bits is to use a prefix-instruction like
    CARRY for those seldom needed bits.

    Good point. A combination of the two ideas could be to have the prefix
    instruction specify which register to use instead of the one specified
    in the reduced register specifier for whichever instructions in its
    shadow have the bit set in the prefix.
    <
    You could have the prefix instruction supply the missing bits of all shortened register specifiers.

    I am not sure what you are proposing here. Can you show an example?




    < <                                         Worst case, this is the same as
    my original proposal - one extra, not really executed, instruction
    <
    Which is why I use the term instruction-modifier.

    Agreed.


    <
    (prefix versus register to register move) for one where you need to
    use it, but this idea might, by allowing the prefix to specify
    multiple instructions, save more than one extra "instruction".  The
    only downside is it requires an additional op code.
    <
    But by having an instruction-modifier that can add bits to several
    succeeding instructions, you can avoid cluttering up ISA with things
    like ADC, SBC, IMULD, DDIV, ....... So, in the end, you save OpCode enumeration space not consume it.

    In the general case, I certainly agree. But here you need a different
    op-code than CARRY, as this has different semantics, and I think the new instruction modifier has no other use, hence it is an additional op code
    versus the original proposal of using essentially a register copy
    instruction, which already exists (i.e. a load with a zero displacement
    and the source register as the address modifier).




    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stephen Fuld on Wed Nov 15 21:10:52 2023
    Stephen Fuld wrote:

    On 11/15/2023 11:02 AM, MitchAlsup wrote:
    Stephen Fuld wrote:

    On 11/11/2023 10:11 AM, MitchAlsup wrote:
    Stephen Fuld wrote:

    On 11/10/2023 10:24 AM, BGB wrote:



    Much better to have a big flat register space.

    Yes, but sometimes you just need "another bit" in the instructions.
    So an alternative is to break the requirement that all register
    specifier fields in the instruction be the same length.  So, for
    example, allow
    <
    Another way to get a few more bits is to use a prefix-instruction like >>>> CARRY for those seldom needed bits.

    Good point. A combination of the two ideas could be to have the prefix
    instruction specify which register to use instead of the one specified
    in the reduced register specifier for whichever instructions in its
    shadow have the bit set in the prefix.
    <
    You could have the prefix instruction supply the missing bits of all
    shortened register specifiers.

    I am not sure what you are proposing here. Can you show an example?

    Let us postulate an MoreBits instruction-modifier with a 16-bit immediate field. Now each 16-bit instruction, that has access to only 8 registers,
    strips off 2-bits/specifier, so now all its register specifiers are 5-bits.
    The immediate supplies the bits and as bits are stripped off the Decoder
    shifts the field down by the consumed bits. When the last bit has been
    stripped off you would need another MB im to supply those bits. Since
    only 16-bit instructions are "limited" one MB should last about a basic
    block or extended basic block.

    Note I don't care how the bits are apportioned, formatted, consumed, ...

    <
    <                                         Worst case, this is the same as
    my original proposal - one extra, not really executed, instruction
    <
    Which is why I use the term instruction-modifier.

    Agreed.


    <
    (prefix versus register to register move) for one where you need to
    use it, but this idea might, by allowing the prefix to specify
    multiple instructions, save more than one extra "instruction".  The
    only downside is it requires an additional op code.
    <
    But by having an instruction-modifier that can add bits to several
    succeeding instructions, you can avoid cluttering up ISA with things
    like ADC, SBC, IMULD, DDIV, ....... So, in the end, you save OpCode
    enumeration space not consume it.

    In the general case, I certainly agree. But here you need a different op-code than CARRY, as this has different semantics, and I think the new instruction modifier has no other use, hence it is an additional op code versus the original proposal of using essentially a register copy instruction, which already exists (i.e. a load with a zero displacement
    and the source register as the address modifier).

    CARRY is your access to ALL extended precision calculations (saving 20+
    OpCodes when you consider a robust commercial ISA rather than an Academic
    ISA.) Carry accesses integer arithmetic, shifts, extracts, inserts, and
    exact floating point calculations larger than 64-bits including Kahan- Babashuka summation. {{Not bad for 1 OpCode !!}}

    Similarly:: VEC-LOOP provide access to 1,000+ SIMD instructions and 400+
    Vector instructions at the cost of 2 units in the OpCode Space !! It also allows a future implementation to execute wider (or narrower) than SIMD
    with no change in the instruction sequence.

    MoreBits is effectively just like REX except it can span instructions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to MitchAlsup on Mon Nov 20 09:31:11 2023
    On 11/15/2023 1:10 PM, MitchAlsup wrote:
    Stephen Fuld wrote:

    On 11/15/2023 11:02 AM, MitchAlsup wrote:
    Stephen Fuld wrote:

    On 11/11/2023 10:11 AM, MitchAlsup wrote:
    Stephen Fuld wrote:

    On 11/10/2023 10:24 AM, BGB wrote:



    Much better to have a big flat register space.

    Yes, but sometimes you just need "another bit" in the
    instructions. So an alternative is to break the requirement that
    all register specifier fields in the instruction be the same
    length.  So, for example, allow
    <
    Another way to get a few more bits is to use a prefix-instruction like >>>>> CARRY for those seldom needed bits.

    Good point. A combination of the two ideas could be to have the
    prefix instruction specify which register to use instead of the one
    specified in the reduced register specifier for whichever
    instructions in its shadow have the bit set in the prefix.
    <
    You could have the prefix instruction supply the missing bits of all
    shortened register specifiers.

    I am not sure what you are proposing here.  Can you show an example?

    Let us postulate an MoreBits instruction-modifier with a 16-bit immediate field. Now each 16-bit instruction, that has access to only 8 registers, strips off 2-bits/specifier, so now all its register specifiers are 5-bits. The immediate supplies the bits and as bits are stripped off the Decoder shifts the field down by the consumed bits. When the last bit has been stripped off you would need another MB im to supply those bits. Since
    only 16-bit instructions are "limited" one MB should last about a basic
    block or extended basic block.

    Note I don't care how the bits are apportioned, formatted, consumed, ...

    Oh, so you have changed the meaning of the "immediate bit map" from
    specifying which of the following instructions it applies to (e.g.
    CARRY) to the actual data. I like it!

    If using 16 bit instructions, and if you only have one small register
    field per instruction, I think it is better to make "MoreBits" a 16 bit instruction modifier itself, with say a five bit op code and an eleven
    bit immediate, which supplies the extra bit for the next 11
    instructions. More compact than a 32 bit instruction, and almost as
    "far reaching". If you need more than 11 bits, even if you add a second
    MB instruction modifier 11 instructions later, you are still no worse
    off than an instruction modifier plus a 16 bit immediate.

    Of course, if you need more than one extra bit per instruction, then
    more "drastic" measures, such as your proposal, are needed.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Stephen Fuld on Mon Nov 20 17:51:46 2023
    On 11/20/2023 11:31 AM, Stephen Fuld wrote:
    On 11/15/2023 1:10 PM, MitchAlsup wrote:
    Stephen Fuld wrote:

    On 11/15/2023 11:02 AM, MitchAlsup wrote:
    Stephen Fuld wrote:

    On 11/11/2023 10:11 AM, MitchAlsup wrote:
    Stephen Fuld wrote:

    On 11/10/2023 10:24 AM, BGB wrote:



    Much better to have a big flat register space.

    Yes, but sometimes you just need "another bit" in the
    instructions. So an alternative is to break the requirement that >>>>>>> all register specifier fields in the instruction be the same
    length.  So, for example, allow
    <
    Another way to get a few more bits is to use a prefix-instruction
    like
    CARRY for those seldom needed bits.

    Good point. A combination of the two ideas could be to have the
    prefix instruction specify which register to use instead of the one
    specified in the reduced register specifier for whichever
    instructions in its shadow have the bit set in the prefix.
    <
    You could have the prefix instruction supply the missing bits of all
    shortened register specifiers.

    I am not sure what you are proposing here.  Can you show an example?

    Let us postulate an MoreBits instruction-modifier with a 16-bit immediate
    field. Now each 16-bit instruction, that has access to only 8 registers,
    strips off 2-bits/specifier, so now all its register specifiers are
    5-bits.
    The immediate supplies the bits and as bits are stripped off the Decoder
    shifts the field down by the consumed bits. When the last bit has been
    stripped off you would need another MB im to supply those bits. Since
    only 16-bit instructions are "limited" one MB should last about a
    basic block or extended basic block.

    Note I don't care how the bits are apportioned, formatted, consumed, ...

    Oh, so you have changed the meaning of the "immediate bit map" from specifying which of the following instructions it applies to (e.g.
    CARRY) to the actual data.  I like it!

    If using 16 bit instructions, and if you only have one small register
    field per instruction, I think it is better to make "MoreBits" a 16 bit instruction modifier itself, with say a five bit op code and an eleven
    bit immediate, which supplies the extra bit for the next 11
    instructions.  More compact than a 32 bit instruction, and almost as
    "far reaching".  If you need more than 11 bits, even if you add a second
    MB instruction modifier 11 instructions later, you are still no worse
    off than an instruction modifier plus a 16 bit immediate.

    Of course, if you need more than one extra bit per instruction, then
    more "drastic" measures, such as your proposal, are needed.




    Ironically, this is closer to how 32-bit ops were originally intended to
    work in BJX2, and how they worked in BJX1 (where most of the 32-bit ops
    were basically prefixes on the existing 16-bit SuperH ops).

    Say:
    ZnmZ //typical layout of a 16-bit op, R0..R15
    8Ceo-ZnmZ //Op gains an extra register field, and R16..R31.

    Then, in the original form of BJX2:
    ZZnm
    F0eo-ZZnm

    For some ops, the 3rd register (Ro) would instead operate as a 5-bit immediate/displacement field. Which was initially a similar idea, with
    the 32-bit space mirroring the 16-bit space.



    When I later added the Imm9 encodings, the encoding of the other ops was changed to be more consistent with this:
    F0nm-ZeoZ
    F2nm-Zeii

    This was originally designed as a possible successor ISA, but it seemed "better" to back-fold it into my existing ISA (effectively replacing the original encoding scheme in the process).

    This encoding was relatively stable, until Jumbo prefixes were added and
    shook things up a little more (and the more recent shakeup with XG2,
    which has effectively fragmented the ISA into two sub-variants with
    neither being a "clear winner", *).

    *: The previous Baseline encoding is better for code density (due to
    still having 16-bit ops), XG2 is better for performance (due to more orthogonality, such as the ability to use every register from every instruction, and adding a bit to the Immed/Displacement fields, or 3 in
    the case of plain branches).



    Had considered possible options for "Make XG2's encoding less dog
    chewed", but the issue is not so simple as simply shifting the bits
    around (shuffling the bits would just make it dog-chewed in other ways).


    So, existing encoding, expressed in bits, is roughly:
    NMOP-ZwZZ-nnnn-mmmm ZZZZ-Qnmo-oooo-ZZZZ

    And the possible revised form:
    PwZZ-ZZZZ-ZZnn-nnnn-mmmm-mmoo-oooo-ZZZZ


    However, what I have thus far would effectively amount to nearly a full
    reboot of the encoding (which would be a huge pile of effort), so less
    likely to be "worth it" in the name of a slightly less chewed encoding
    scheme (and, hell, RISC-V is going along OK with its immediate fields
    being effectively confetti).

    Though, another option could be closer to a straight reshuffle:
    NMOP-ZwZZ-nnnn-mmmm YYYY-Qnmo-oooo-XXXX
    NMIP-ZwZZ-nnnn-mmmm YYYY-Qnmi-iiii-iiii
    To:
    PwZZ-ZQnn-nnnn-YYYY-mmmm-mmoo-oooo-XXXX
    PwZZ-ZQnn-nnnn-YYYY-mmmm-mmii-iiii-iiii

    So, the existing ISA listing could be mapped over mostly as-is, with the
    main changes (besides the bit-reshuffle) being in the immediate field.

    However:
    DDDP-0w00-nnnn-mmmm 1100-dddd-dddd-dddd
    To:
    Pw00-0ddd-dddd-YYYY-dddd-dddd-dddd-dddd

    Is gonna need some new relocs, ...

    OTOH, it would allow making the F8 block's encoding consistent with the
    rest of the ISA.



    But, recently I am left feeling uncertain if any of this is anything
    more than moot...

    Did recently make a little bit of progress towards having a GUI in
    TestKern, in that I now have a console window with a shell "sorta" able
    to run inside this console.

    Has partly opened the "pandora's box" though that is needing to deal
    with multitasking, re-entrance, and the possible need for needing to use
    mutex locking (as-is, it was "barely working" in that I had to carefully
    avoid re-entrance in a few areas to keep the kernel from exploding; as
    none of this stuff has mutexes).

    Well, and then having to fix-up issues like making the scheduler not try
    to schedule the syscall-handler task and then promptly causing the "OS"
    to explode (for now, these are special cased; I may need to come up with
    a general way of flagging some tasks as "do not schedule", since they
    will exist as special-cases to handle syscalls or specifically as the
    target of inter-process VTable calls, as is the case with TKGDI, where
    the call itself will schedule the task). Where, in this case, the
    mechanism for inter-task control flow will take a form resembling that
    of COM objects (it is likely that TKRA-GL may need to be reworked into
    this form as well, *2).


    Also looking like I will need to rework how the shell works.
    Effectively, now, rather than the CLI running directly in the kernel, it
    needs to be a userland (or "superuserland", *) task communicating with
    the kernel via syscalls. So, the shell can no longer directly invoke the PE/COFF loader, but will now need to use a "CreateProcess" call (and
    then probably sleep-loop until the created process terminates).

    *: Where a task is being run more like a userland task, but still in
    running in supervisor mode (the syscall handler task and TKGDI backend
    running in this mode).

    Where, say:
    Thread: Logical thread of execution within some existing process;
    Process: Distinct collection of 1 or more threads within a shared
    address space and shared process identity (may have its own address
    space, though as-of-yet, TestKern uses a shared global address space);
    Task: Supergroup that includes Threads, Processes, and other thread-like entities (such as call and method handlers), may be either thread-like
    or process-like.

    Where, say, the Syscall interrupt handler doesn't generally handle
    syscalls itself (since the ISRs will only have access to
    physically-mapped addresses), but effectively instead initiates a
    context switch to the task that can handle the request (or, to context
    switch back to the task that made the request, or to yield to another
    task, ...).

    Though, will need to probably add more special case handling such that
    the Syscall task can not yield or try to itself make a syscall (the only
    valid exit point for this task being where it transfers control back to
    the caller and awaits the next syscall to arrive; and it is not valid
    for this task to try to syscall back into itself).


    As-is, I am running a lot of tasks in userland, but for now there is effectively no real memory protection in TestKern, but the plan is to
    try to resolve this. This is itself work; needing to gradually weed-out programs accessing privileged resources; and in some system-level APIs
    needing to distinguish "Local" from "Global" memory ("malloc" will give
    local memory, whereas "tkgGlobalAlloc" will give global memory; the idea
    being for now that global memory will be identity mapped and accessible
    across process boundaries).

    Doesn't "yet" matter, but easier to try to address this now than later.



    *2: For TKRA-GL, it generally needs to work with physically mapped
    memory and MMIO to access the rasterizer module, which means the backend
    parts will likely need to run either in "superuserland" or in "kernel land".

    Likely rework is to try to separate the OpenGL API front-end from some
    backend machinery, which will be a more narrowly focused interface
    mostly dealing with things like:
    Uploading textures and similar;
    Drawing vertex arrays.
    All the things like glEnable/glDisable, matrix-stack manipulations, etc,
    will need to be kept in the front-end (making a context switch every
    time the program used glEnable or glColor4f or similar would be an
    impractical level of overhead).


    Though, in Windows, the division point seems to be a little higher
    (closer to the level of the OpenGL API itself). To mimic the Windows
    model, I would effectively need two division points:

    A front-end interface whose purpose is mostly to wrap over a bunch of "GetProcAddress" funk (with some way to plug in an interface to provide
    the GetProcAddress backend). This isn't asking too much more, since one
    needs to provide all the GetProcAddress cruft either way.

    An division interface between the frontend part which needs to run
    directly in the userland task, and the backend part which deals with the "actually making stuff happen" parts.

    One could design a lower-level API for this latter part, but
    (ironically) it would probably end up sort of resembling some sort of
    weird OpenGL/Direct3D hybrid...

    Though, could still do like TKGDI and provide a C wrapper over the
    internal VTable calls.
    HRESULT fooDooTheThing()
    {
    fooContext *ctx;
    ctx=fooGetCurrentContext();
    return(ctx->vt->DooTheThing(ctx));
    }
    ...


    A lot of this stuff gets kind of annoying sometimes though...

    Like, one can't just "do the thing", they end up needing a bunch of
    layers and boilerplate getting from "the place where the thing needs to
    be done" to "the place where the thing can be done" (but, I guess, the
    other alternative being to effectively not have an OS at all).

    ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Tue Nov 21 22:12:18 2023
    BGB wrote:

    On 11/20/2023 11:31 AM, Stephen Fuld wrote:
    On 11/15/2023 1:10 PM, MitchAlsup wrote:


    For some ops, the 3rd register (Ro) would instead operate as a 5-bit immediate/displacement field. Which was initially a similar idea, with
    the 32-bit space mirroring the 16-bit space.

    Almost all My 66000 {1,2,3}-operand instructions can convert a 5-bit register specifier into a 5-bit immediate of either positive or negative integer
    value. This makes::

    1<<n
    ~0<<n
    container.bitfield = 7;

    single instructions.


    Where, say:
    Thread: Logical thread of execution within some existing process;
    has a register file and a stack.
    Process: Distinct collection of 1 or more threads within a shared
    has a memory map a heap and a vector of threads.
    address space and shared process identity (may have its own address
    space, though as-of-yet, TestKern uses a shared global address space);
    Task: Supergroup that includes Threads, Processes, and other thread-like entities (such as call and method handlers), may be either thread-like
    or process-like.

    Where, say, the Syscall interrupt handler doesn't generally handle
    syscalls itself (since the ISRs will only have access to
    physically-mapped addresses), but effectively instead initiates a
    context switch to the task that can handle the request (or, to context
    switch back to the task that made the request, or to yield to another
    task, ...).

    We call these things:: dispatchers.

    Though, will need to probably add more special case handling such that
    the Syscall task can not yield or try to itself make a syscall (the only valid exit point for this task being where it transfers control back to
    the caller and awaits the next syscall to arrive; and it is not valid
    for this task to try to syscall back into itself).

    In My 66000, every <effective> SysCall goes deeper into the privilege hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV,
    Guest HV SysCalls real HV. No data structures need maintenance during
    these transitions of the hierarchy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Finch@21:1/5 to MitchAlsup on Tue Nov 21 22:47:26 2023
    On 2023-11-21 5:12 p.m., MitchAlsup wrote:
    BGB wrote:

    On 11/20/2023 11:31 AM, Stephen Fuld wrote:
    On 11/15/2023 1:10 PM, MitchAlsup wrote:


    For some ops, the 3rd register (Ro) would instead operate as a 5-bit
    immediate/displacement field. Which was initially a similar idea, with
    the 32-bit space mirroring the 16-bit space.

    Almost all My 66000 {1,2,3}-operand instructions can convert a 5-bit
    register
    specifier into a 5-bit immediate of either positive or negative integer value. This makes::

        1<<n
       ~0<<n
        container.bitfield = 7;

    single instructions.

    Q+ CPU allows immediates of any length to be used in place of source
    operand register values via postfix instructions. Virtually all
    instructions may use immediates instead of registers. There are also
    quick immediate form instructions that have the second source operand as
    an immediate constant encoded directly in the instruction as this is the
    most common use.

    The postfix immediate instructions come in four lengths. 23-bit, 39-bit,
    71-bit and 135-bit. Currently float values make use of on 32 or 64 bits
    out of the 39 and 71-bit formats. I have been pondering having the float immediates left aligned with additional trailing bits. These bits are
    zero for now.

    Postfixes are treated as part of the current instruction by the CPU.


    Where, say:
    Thread: Logical thread of execution within some existing process;
             has a register file and a stack.
    Process: Distinct collection of 1 or more threads within a shared
             has a memory map a heap and a vector of threads.
    address space and shared process identity (may have its own address
    space, though as-of-yet, TestKern uses a shared global address space);
    Task: Supergroup that includes Threads, Processes, and other
    thread-like entities (such as call and method handlers), may be either
    thread-like or process-like.

    Where, say, the Syscall interrupt handler doesn't generally handle
    syscalls itself (since the ISRs will only have access to
    physically-mapped addresses), but effectively instead initiates a
    context switch to the task that can handle the request (or, to context
    switch back to the task that made the request, or to yield to another
    task, ...).

    We call these things:: dispatchers.

    Though, will need to probably add more special case handling such that
    the Syscall task can not yield or try to itself make a syscall (the
    only valid exit point for this task being where it transfers control
    back to the caller and awaits the next syscall to arrive; and it is
    not valid for this task to try to syscall back into itself).

    In My 66000, every <effective> SysCall goes deeper into the privilege hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV, Guest HV SysCalls real HV. No data structures need maintenance during
    these transitions of the hierarchy.

    Does it follow the same way for hardware interrupts? I think RISCV goes
    to the deepest level first, machine level, then redirects to lower
    levels as needed. I was planning on Q+ operating the same way.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to MitchAlsup on Tue Nov 21 21:36:30 2023
    On 11/21/2023 4:12 PM, MitchAlsup wrote:
    BGB wrote:

    On 11/20/2023 11:31 AM, Stephen Fuld wrote:
    On 11/15/2023 1:10 PM, MitchAlsup wrote:


    For some ops, the 3rd register (Ro) would instead operate as a 5-bit
    immediate/displacement field. Which was initially a similar idea, with
    the 32-bit space mirroring the 16-bit space.

    Almost all My 66000 {1,2,3}-operand instructions can convert a 5-bit
    register
    specifier into a 5-bit immediate of either positive or negative integer value. This makes::

        1<<n
       ~0<<n
        container.bitfield = 7;

    single instructions.


    Originally, the pattern depended on the 16-bit operation, IIRC:
    (Rm), Rn => (Rm, Disp5), Rn
    (Rm, R0), Rn => (Rm, Ro), Rn
    ALU Ops:
    OP Rm, Rn => OP Rm, Ro, Rn
    OP Rm, R0, Rn => OP Rm, Imm5u, Rn

    Initially, BJX2 started out in a similar camp to BJX1, but when it
    became obvious that the 16-bit and 32-bit encodings effectively needed
    separate encoders, there was no real point keeping up the concept of
    32-bit ops being prefix-extended 16-bit ops.


    Then some other analysis/testing showed that for "general case
    tradeoffs", it was better to have an ISA with primarily 32-bit encodings
    with a 16-bit subset, than one with primarily 16-bit encodings with
    32-bit extended forms (though, by this point, I had already settled on
    the general encoding scheme).

    The main practical consequence of this realization was that the ISA did
    not need to be able to operate entirely within the limits of the 16-bit encoding space (but, did need to be able to operate without any of the
    16-bit encodings).


    After more development, I now have:
    Imm5u/Disp5u, some ops (Baseline)
    Imm6s/Disp6s (XG2)
    Imm9u: Typical ALU ops
    Imm10u (XG2)
    Imm9n: A few ALU ops
    Imm10n (XG2)
    Disp9u: LD/ST ops
    Disp10s (XG2)
    TBD if Disp10u+Disp6s would have been better.
    Since negative displacements are still pretty rare.
    Might have been better to have larger positive displacements.
    Imm10{u/n}: Various 2RI ops
    Imm11{u/n} {XG2}
    Disp11s / Disp12s (XG2), Branch-Compare-Zero
    Effectively uses an opcode bit as the sign bit.
    Imm16u/Imm16n: Some 2RI ops.
    Disp20s: BRA/BSR
    Disp23s (XG2)
    Imm24{u/n}: LDIZ/LDIN ("MOV Imm25s, R0")

    However, they are only available in specific combinations.
    Imm9u: ADD, ADDS.L, ADDU.L, AND, OR, XOR, SH{A/L}D{L/Q}, MULS, MULU
    Imm9n: ADD, ADDS.L, ADDU.L

    Which does mean, say:
    y=x&(~7);
    Needs either to load a constant into a register, or use a jumbo prefix.


    The Disp9u/Disp10s encoding exists on all basic Load/Store ops, however "special" ops (like XMOV.x) only have Disp5u/Disp6s encodings (not a
    huge loss though).

    With a Jumbo-Imm prefix, many of the Disp/Imm cases expand to 33 bits
    (except Disp5 which only goes to 29 bits).



    Where, say:
    Thread: Logical thread of execution within some existing process;
             has a register file and a stack.
    Process: Distinct collection of 1 or more threads within a shared
             has a memory map a heap and a vector of threads.
    address space and shared process identity (may have its own address
    space, though as-of-yet, TestKern uses a shared global address space);
    Task: Supergroup that includes Threads, Processes, and other
    thread-like entities (such as call and method handlers), may be either
    thread-like or process-like.

    Where, say, the Syscall interrupt handler doesn't generally handle
    syscalls itself (since the ISRs will only have access to
    physically-mapped addresses), but effectively instead initiates a
    context switch to the task that can handle the request (or, to context
    switch back to the task that made the request, or to yield to another
    task, ...).

    We call these things:: dispatchers.


    Yeah.

    As-is, I have several major interrupt handlers:

    Fault: Something has gone wrong, current handling is to stall the CPU
    until reset (and/or terminate the emulator). Could in premise do other
    things.

    IRQ: Deals with timer, may potentially be used for preemptive task
    scheduling (code is in place, but this is not currently enabled). Does
    not currently perform any other "complex" actions (and the "practical"
    use of IRQ's remains limited in my case, due in large part to the
    limitations of interrupt handling).

    TLB Miss: Handles TLB miss and ACL Miss events, may initiate further
    action if a "page fault" style event occurs (or something needs to be
    paged in/paged out from the swapfile).

    SYSCALL: Mostly initiates task switches and similar, and little else.


    Unlike x86, the design of the interrupt mechanisms means it isn't
    practical to hang the whole OS off of an interrupt handler. The closest
    option is mostly to use the interrupt handlers to trigger context
    switches (which is, ironically, slightly less of an issue, as many of
    the "hard" parts of a context switch are already performed for sake of
    dealing with the "rather minimalist" interrupt mechanism).


    Basically, in this design, it isn't possible to enter a new interrupt
    without first returning from the prior interrupt (at least not without
    f*ing the CPU state). And, as-is, interrupts can only operate in
    physically addressed mode.

    They also need to manually save and restore all the registers, since
    unlike either SuperH or RISC-V, BJX2 does not have any banked registers
    (apart from SP/SSP, which switch places when entering/leaving an ISR).

    Unlike x86 (protected mode), it doesn't have TSS's either, and unlike
    8086 real-mode, it doesn't implicitly push anything to the stack (nor
    have an "interrupt vector table").


    So, the interrupt handling is basically a computed branch; which was
    basically about the cheapest mechanism I could come up with at the time.

    Did create a little bit of a puzzle initially as to how to get the CPU
    state saved off and restored with no free registers. Though, there are a
    few CR's which capture the CPU state at the time the ISR happens (these registers getting overwritten every time a new interrupt occurs).


    So, say:
    Interrupt entry:
    Copy low bits of SR into high bits of EXSR;
    Copy PC into SPC.
    Copy fault address into TEA;
    Swap SP and SSP (*1);
    Set CPU flags to Supervisor+ISR mode;
    CPU Mode bits now copied from high bits of VBR.
    Computed branch relative to VBR.
    Offset depends on interrupt category.
    Interrupt return (RTE):
    Copy EXSR bits back into SR;
    Unswap SP/SSP (*1);
    Branch to SPC.


    *1: At the time, couldn't figure a good way to shave more logic off the mechanism. Though, now, the most obvious candidate now would be to
    eliminate the implicit SP/SSP swapping (this part is currently handled
    in the instruction decoder).

    So, instead, the ISR entry point would do something like:
    MOV SP, SSP
    MOV 0xDE00, SP //Designated ISR stack SRAM
    MOV.Q R0, (SP, 0)
    NOV.Q R1, (SP, 8)
    ... Now save off everything else ...

    But, didn't really think of it at the time.


    There is already the trick of requiring VBR to be aligned (currently 64B
    in practice; formally 256B), mostly so as to allow the "address
    computation" to be done via bit-slicing.

    Not sure if many CPUs have a cheaper mechanism here...



    Note that in my case, generally the interrupt handlers are written in C,
    with the compiler managing all the ISR prolog/epilog stuff (mostly saving/restoring pretty much the entire CPU state to the ISR stack).

    Generally, the ISR's also need to deal with having a comparably small
    stack (with 0.75K already used for the saved CPU state).

    Where:
    0000..7FFF: Boot ROM
    8000..BFFF: (Optional) Extended Boot ROM
    C000..DFFF: Boot/ISR SRAM
    E000..FFFF: (Optional) Extended SRAM

    Generally, much of the work of the context switch is pulled off using
    "memcpy" calls (with the compiler providing a special "__arch_regsave"
    variable giving the address of the location it has dumped the CPU
    registers into; which in turn covers most of the core state that needs
    to be saved/restored for a process context switch).

    Though, I guess one other possibility would be if the compiler-generated
    ISR code assumed TBR to always be valid (and then copied the registers
    to a fixed location relative to TBR instead of the ISR stack), which
    could in-theory allow for faster context switching (by eliminating the
    need for the memcpy calls), but would be a bit more brittle (if TBR is
    invalid, stuff is going to break pretty hard as soon as an interrupt
    happens).

    Would likely need special compiler attributes for this (would not make
    sense for interrupts which do not, or are unlikely to, perform a context switch).


    Though, will need to probably add more special case handling such that
    the Syscall task can not yield or try to itself make a syscall (the
    only valid exit point for this task being where it transfers control
    back to the caller and awaits the next syscall to arrive; and it is
    not valid for this task to try to syscall back into itself).

    In My 66000, every <effective> SysCall goes deeper into the privilege hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV, Guest HV SysCalls real HV. No data structures need maintenance during
    these transitions of the hierarchy.

    No way to handle a syscall recursively in my case, partly because of how
    the task works:
    It gets started at a certain location, and switches off at the point
    where it would receive a syscall request.

    So, sort of like:
    ... //initial task setup
    TK_Task_SyscallReturnToUser(task);
    while(1)
    {
    TK_Task_SyscallGetArgs(&task, &sobj, &umsg, &rptr, &args);
    //handle the syscall
    TK_Task_SyscallReturnToUser(task);
    }
    Whenever ReturnToUser returns, it expects there to be a syscall request
    for it to handle. This call effectively transfers control back to the
    caller task, with the syscall task ready to receive a new request.

    SyscallGetArgs basically invokes "arcane magic" to fetch the parameters
    for the task that performed the syscall (the dispatch mechanism stashes
    the parameters in a designated location in the syscall handler's task
    context).


    However, if the Syscall task itself tries to invoke yield, or otherwise triggers a context switch, then it will not be at the correct location
    to handle a syscall if one were to arrive (at which point, the OS explodes).

    Or, if it tries to perform a syscall, then the syscall attempt will
    return immediately (since it effectively performs a context which back
    to itself).


    Granted, it is possible that the SYSCALL dispatcher could be made to
    dispatch among one of multiple SYSCALL tasks, which could then handle up
    to N levels of recursion.

    On a multi-core system, each core would also need its own syscall tasks
    (well, and/or they operate round-robin, and the syscall is directed at whichever task is in the correct state to handle a request).

    There is a little flexibility here, at least in as far as pretty much
    the whole mechanism is managed in software in this case (apart from the
    ISR mechanism itself).



    Note that for inter-task method-calls, a similar mechanism is used to
    normal syscalls, except:
    A range of special syscall numbers is used as a VTable index;
    The object's VTable implicitly encodes the PID of the task to
    dispatch the request to.

    So, instead of waiting for syscalls, it waits for method calls, and then dispatches them as needed (locally) when they arrive.

    On the reciever end, there is a mechanism to compose the VTable
    interface, where the VTable is effectively composed of methods whose
    sole purpose is to invoke a syscall, passing the argument list and
    similar off to a handler, with the syscall number based on the method's location within the VTable.

    Then, the SYSCALL ISR sees this, and then fetches the corresponding task
    to dispatch to, ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Wed Nov 22 18:38:00 2023
    BGB wrote:

    On 11/21/2023 4:12 PM, MitchAlsup wrote:
    BGB wrote:

    Where, say, the Syscall interrupt handler doesn't generally handle
    syscalls itself (since the ISRs will only have access to
    physically-mapped addresses), but effectively instead initiates a
    context switch to the task that can handle the request (or, to context
    switch back to the task that made the request, or to yield to another
    task, ...).

    We call these things:: dispatchers.


    Yeah.

    As-is, I have several major interrupt handlers:

    Fault: Something has gone wrong, current handling is to stall the CPU
    until reset (and/or terminate the emulator). Could in premise do other things.

    I call these checks:: a page fault is an unanticipated SysCall to the
    Guest OS page fault handler; whereas a check is something that should
    never happen but did (ECC repair fail): These trap to Real HV.

    IRQ: Deals with timer, may potentially be used for preemptive task
    scheduling (code is in place, but this is not currently enabled). Does
    not currently perform any other "complex" actions (and the "practical"
    use of IRQ's remains limited in my case, due in large part to the
    limitations of interrupt handling).

    Every My 66000 process has its own event table which combines exceptions interrupts, SysCalls,... This means there is no table surgery when switching between Guest OS and Guest Hypervisor and Real Hypervisor.

    TLB Miss: Handles TLB miss and ACL Miss events, may initiate further
    action if a "page fault" style event occurs (or something needs to be
    paged in/paged out from the swapfile).

    HW table walking.

    SYSCALL: Mostly initiates task switches and similar, and little else.

    Part of Event table.

    Unlike x86, the design of the interrupt mechanisms means it isn't
    practical to hang the whole OS off of an interrupt handler. The closest option is mostly to use the interrupt handlers to trigger context
    switches (which is, ironically, slightly less of an issue, as many of
    the "hard" parts of a context switch are already performed for sake of dealing with the "rather minimalist" interrupt mechanism).

    My 66000 can perform a context (user->user) in a single instruction.
    Old state goes to memory, new state comes from memory; by the time
    state has arrived, you are fetching instructions in the new context
    under the new context MMU tables and privileges and priorities.

    Basically, in this design, it isn't possible to enter a new interrupt
    without first returning from the prior interrupt (at least not without
    f*ing the CPU state). And, as-is, interrupts can only operate in
    physically addressed mode.

    They also need to manually save and restore all the registers, since
    unlike either SuperH or RISC-V, BJX2 does not have any banked registers (apart from SP/SSP, which switch places when entering/leaving an ISR).

    Unlike x86 (protected mode), it doesn't have TSS's either, and unlike
    8086 real-mode, it doesn't implicitly push anything to the stack (nor
    have an "interrupt vector table").


    So, the interrupt handling is basically a computed branch; which was basically about the cheapest mechanism I could come up with at the time.

    Did create a little bit of a puzzle initially as to how to get the CPU
    state saved off and restored with no free registers. Though, there are a
    few CR's which capture the CPU state at the time the ISR happens (these registers getting overwritten every time a new interrupt occurs).

    Why not just treat the RF as a cache with a known address in physical memory. In MY 66000 that is what I do and then just push and pull 4 cache lines at a time.

    So, say:
    Interrupt entry:
    Copy low bits of SR into high bits of EXSR;
    Copy PC into SPC.
    Copy fault address into TEA;
    Swap SP and SSP (*1);
    Set CPU flags to Supervisor+ISR mode;
    CPU Mode bits now copied from high bits of VBR.
    Computed branch relative to VBR.
    Offset depends on interrupt category.
    Interrupt return (RTE):
    Copy EXSR bits back into SR;
    Unswap SP/SSP (*1);
    Branch to SPC.

    Interrupt Entry Point::
    // by this point all the old registers have been saved where they
    // are supposed to go, and the interrupt dispatcher registers are
    // already loader up and ready to go, and the CPU is running at
    // whatever privilege level was specified.
    HR R1<-WHY
    LD IP,[IP,R1<<3,InterruptVectorTable] // Call through table
    RTI
    //
    InterruptHandler0:
    // do what is necessary
    // note this can all be written in C
    RET
    InterruptHandler1::



    *1: At the time, couldn't figure a good way to shave more logic off the mechanism. Though, now, the most obvious candidate now would be to
    eliminate the implicit SP/SSP swapping (this part is currently handled
    in the instruction decoder).

    So, instead, the ISR entry point would do something like:
    MOV SP, SSP
    MOV 0xDE00, SP //Designated ISR stack SRAM
    MOV.Q R0, (SP, 0)
    NOV.Q R1, (SP, 8)
    ... Now save off everything else ...

    But, didn't really think of it at the time.


    There is already the trick of requiring VBR to be aligned (currently 64B
    in practice; formally 256B), mostly so as to allow the "address
    computation" to be done via bit-slicing.

    Not sure if many CPUs have a cheaper mechanism here...

    Treat the CPU state and the register state as cache lines and have
    HW shuffle them in and out. You can even start the 5 cache line reads
    before you start the CPU state writes; saving latency (which you cannot
    using SW only methods).

    Note that in my case, generally the interrupt handlers are written in C,
    with the compiler managing all the ISR prolog/epilog stuff (mostly saving/restoring pretty much the entire CPU state to the ISR stack).

    My 66000 compiler remains blissfully ignorant of ISR prologue and
    epilogue and it still works.

    Generally, the ISR's also need to deal with having a comparably small
    stack (with 0.75K already used for the saved CPU state).

    Where:
    0000..7FFF: Boot ROM
    8000..BFFF: (Optional) Extended Boot ROM
    C000..DFFF: Boot/ISR SRAM
    E000..FFFF: (Optional) Extended SRAM

    Generally, much of the work of the context switch is pulled off using "memcpy" calls (with the compiler providing a special "__arch_regsave" variable giving the address of the location it has dumped the CPU
    registers into; which in turn covers most of the core state that needs
    to be saved/restored for a process context switch).

    Why not just make the HW push and pull cache lines.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Wed Nov 22 19:36:28 2023
    Robert Finch wrote:

    On 2023-11-21 5:12 p.m., MitchAlsup wrote:

    In My 66000, every <effective> SysCall goes deeper into the privilege
    hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV,
    Guest HV SysCalls real HV. No data structures need maintenance during
    these transitions of the hierarchy.

    Does it follow the same way for hardware interrupts? I think RISCV goes
    to the deepest level first, machine level, then redirects to lower
    levels as needed. I was planning on Q+ operating the same way.

    It depends, there is the school of thought that just deliver control to
    someone who can always deal with it (Machine level in RISC-V) and there
    is the other school of thought that some table should encode which level
    of the system control is delivered to. The former allow SW to control
    every step of the process, the later gets rid of all the SW checking
    and simplifies the process of getting to and back from interrupt handlers
    (and their associated soft IRQs.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Nov 22 17:17:30 2023
    Why not just treat the RF as a cache with a known address in physical memory. In MY 66000 that is what I do and then just push and pull 4 cache lines at a

    Hmm... I thought the "66000" came from the CDC 6600 but now I wonder if
    it's not also a pun on the TI 9900.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stefan Monnier on Wed Nov 22 23:58:19 2023
    Stefan Monnier wrote:

    Why not just treat the RF as a cache with a known address in physical memory.
    In MY 66000 that is what I do and then just push and pull 4 cache lines at a

    Hmm... I thought the "66000" came from the CDC 6600 but now I wonder if
    it's not also a pun on the TI 9900.

    In reverence to CDC 6600, not came from.

    Exchange Jump on CDC 6600 causes a context switch that took 16+10 processor cycles
    (after the scoreboard cleared.) And on the 6600, NOS was in the PPs and the CPUs
    were there to just crunch numbers.

    I have a hard real time version of My 66000 where the lower levels of the OS is in HW, and if you have fewer than 1024 threads running, you do not expend any (zero, 0, nada, zilch) cycles in the OS performing context switches or priority alterations. This system has the property that if an interrupt (or message) arrives to unblock a waiting thread that is of higher priority than any CPU in affinity group of CPUs, then the lowest priority CPU in that group receives the higher priority thread (without an excursion through the OS (damaging cache state).)

    I have a Linux friendly version where context switch is a single instruction. When you write a context pointer that entire context is now available to support
    whatever you want it to support. So, a unprivileged application can context switch to another unprivileged application by writing a single control register leaving Guest OS, Guest HV and Real HV in their original configuration. Guest OS can context switch to a different Guest OS in a single instruction and then the Guest OS receiving control needs to context switch to an application it wants
    to run--so 20-ish cycles to perform a Guest OS switch. (This now costs typical old architectures 10,000 cycles)

    But nowhere does any thread receiving control have to execute and state or register saving or restoring......Just like Exchange Jump.....

    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to MitchAlsup on Wed Nov 22 21:50:30 2023
    On 11/22/2023 12:38 PM, MitchAlsup wrote:
    BGB wrote:

    On 11/21/2023 4:12 PM, MitchAlsup wrote:
    BGB wrote:

    Where, say, the Syscall interrupt handler doesn't generally handle
    syscalls itself (since the ISRs will only have access to
    physically-mapped addresses), but effectively instead initiates a
    context switch to the task that can handle the request (or, to
    context switch back to the task that made the request, or to yield
    to another task, ...).

    We call these things:: dispatchers.


    Yeah.

    As-is, I have several major interrupt handlers:

    Fault: Something has gone wrong, current handling is to stall the CPU
    until reset (and/or terminate the emulator). Could in premise do other
    things.

    I call these checks:: a page fault is an unanticipated SysCall to the
    Guest OS page fault handler; whereas a check is something that should
    never happen but did (ECC repair fail): These trap to Real HV.


    A lot of things here are things that could be handled, but are not
    currently handled:
    Invalid instructions;
    Access to invalid memory regions;
    Access to memory in a way which violates access protections;
    A branch to an invalid address;
    Code used the BREAK instruction or similar;
    Etc.

    Generally at present, if any of these happens, it means that something
    has gone badly enough that I want to stall immediately and probably
    debug it.

    In a "real" OS, if this happens in userland, one would typically turn
    this into "SEGFAULT" or similar.


    For the emulator, if a BREAK occurs in ISR mode (or any other fault
    happens in ISR mode), it causes the emulator to stop execution, dump a backtrace and registers, and then terminate. Otherwise, exiting the
    emulator normally will dump a bunch of profiling information (this part
    is not done if the emulator terminates due to a fault).

    Stalling the core in the Verilog core causes it to dump the state of the pipeline and some other things via "$display" (potentially relevant for debugging). Or, allows seeing the crash PC on the 7-segment display on
    the Nexys A7.


    IRQ: Deals with timer, may potentially be used for preemptive task
    scheduling (code is in place, but this is not currently enabled). Does
    not currently perform any other "complex" actions (and the "practical"
    use of IRQ's remains limited in my case, due in large part to the
    limitations of interrupt handling).

    Every My 66000 process has its own event table which combines exceptions interrupts, SysCalls,... This means there is no table surgery when
    switching
    between Guest OS and Guest Hypervisor and Real Hypervisor.


    In my case, the VBR register is global (and set up during boot).

    Any per-process event dispatching would need to be handled in software.


    I didn't go with an x86-style IDT or similar partly because this would
    have been significantly more expensive (in terms of Verilog code and
    LUTs) than the existing mechanism. The role of an x86-style IDT could be
    faked in software though.


    So, VBR is sort of like:
    (63:48): Encodes CPU state to use on ISR entry;
    (47: 6): Encodes the ISR entry point.
    In practice only (28:6) are "actually usable".
    ( 5: 0): Must be Zero

    Where, low-order bits are replaced with an entry offset:
    00: RESET
    08: FAULT
    10: IRQ
    18: TLBMISS
    20: SYSCALL
    28: Reserved

    The 8-bytes of space gives enough space to encode a relative or absolute
    branch to the actual entry point (which not being so big as to be
    needlessly wasteful).

    During CPU reset, VBR is cleared to 0, and then control is transferred
    to 0, which branches to the ROM's entry point.

    The use of a computed branch was preferable to a "vector table" as the
    vector table would have required some mechanism for the CPU to perform a
    memory load to get the address. Computed branch was easier, since no
    special memory load is needed, just branch there, and assume this lands
    on a branch instruction which takes control where it needs to go.


    TLB Miss: Handles TLB miss and ACL Miss events, may initiate further
    action if a "page fault" style event occurs (or something needs to be
    paged in/paged out from the swapfile).

    HW table walking.


    Yeah, no page-table hardware in my case.


    Had on/off considered an "Inverted Page-Table" like in IA-64, but this
    still seemed to be annoyingly expensive vs the "Throw a TLB-Miss
    Exception" route. Even if I eliminated the TLB-Miss logic, would then
    need to have Page-Fault logic, which doesn't really save anything there
    either.

    There is a designated register though for the page-table: TTB.

    With the considered inverted-page-table using a separate VIPT register,
    the idea being that VIPT would point to a region of, say, 4096x4x128b
    TLBE's (~256K), effectively functioning as a RAM-backed L3 TLB. If this
    table lacked the requested TLBE, this would still result in a TLB Miss
    fault.

    Note that the idea was still that trying to use 96-bit virtual address
    mode would require two TLBE's, effectively halving associativity. This
    in turn requires plain modulo-addressing as hashing can create a "bad situation" where a 2-way TLB will get stuck in an infinite loop (but
    this infinite loop scenario is narrowly averted with modulo addressing).

    Granted, 4-way is still better as it seems to result in a comparably
    lower TLB miss rate.

    It is still possible though to XOR the TLBE's index with a bit-pattern
    derived from the ASID, to slightly reduce the cost of context switches
    in some cases (if multiple address spaces were being used).


    Note that the L1 I$ and D$ can get along reasonably well with an
    optional 32-entry 1-way "Micro-TLB".


    SYSCALL: Mostly initiates task switches and similar, and little else.

    Part of Event table.


    All software in my case.


    Unlike x86, the design of the interrupt mechanisms means it isn't
    practical to hang the whole OS off of an interrupt handler. The
    closest option is mostly to use the interrupt handlers to trigger
    context switches (which is, ironically, slightly less of an issue, as
    many of the "hard" parts of a context switch are already performed for
    sake of dealing with the "rather minimalist" interrupt mechanism).

    My 66000 can perform a context (user->user) in a single instruction.
    Old state goes to memory, new state comes from memory; by the time
    state has arrived, you are fetching instructions in the new context
    under the new context MMU tables and privileges and priorities.


    Yeah, but that is not exactly minimalist in terms of the hardware.

    Granted, burning around 1 kilocycle of overhead per syscall isn't ideal either...


    Eg:
    Save registers to ISR stack;
    Copy registers to User context;
    Copy handler-task registers to ISR stack;
    Reload registers from ISR stack;
    Handle the syscall;
    Save registers to ISR stack;
    Copy registers to Syscall context;
    Copy User registers to ISR stack;
    Reload registers from ISR stack.


    Does mean that one needs to be economical with syscalls (say, doing
    "printf" a whole line at a time, rather than individual characters, ...).

    And, did create incentive to allow getting the microsecond-clock value
    and hardware RNG values from CPUID rather than needing a syscall (say,
    don't want to burn 20us to check the microsecond counter, ...).


    If the "memcpy's" could be eliminated, this could roughly halve the cost
    of doing a syscall.


    One other option would be to do like RISC-V's privileged spec and have
    multiple copies of the register file (and likely instructions for
    accessing these alternate register files).

    Worth the cost? Dunno.


    Not too much different to modern Windows, where slow syscalls are still
    fairly common (and despite the slowness of the mechanism, it seems like
    BJX2 sycalls still manage to be around an order of magnitude faster than Windows syscalls in terms of clock-cycle cost...).


    Well, and the seeming absurdity of WaitForSingleObject() on a mutex
    generally taking upwards of 1 million clock-cycles IIRC in past
    experiments (when the mutex isn't already locked; and, if it is
    locked... yeah...).

    You could lock a mutex... or you could render an entire frame in Doom,
    then checksum the frame image, and use the checksum as a hash key. In a
    roughly similar time-scale.


    Luckily, at least, the CriticalSection objects were not absurdly slow...




    Basically, in this design, it isn't possible to enter a new interrupt
    without first returning from the prior interrupt (at least not without
    f*ing the CPU state). And, as-is, interrupts can only operate in
    physically addressed mode.

    They also need to manually save and restore all the registers, since
    unlike either SuperH or RISC-V, BJX2 does not have any banked
    registers (apart from SP/SSP, which switch places when
    entering/leaving an ISR).

    Unlike x86 (protected mode), it doesn't have TSS's either, and unlike
    8086 real-mode, it doesn't implicitly push anything to the stack (nor
    have an "interrupt vector table").


    So, the interrupt handling is basically a computed branch; which was
    basically about the cheapest mechanism I could come up with at the time.

    Did create a little bit of a puzzle initially as to how to get the CPU
    state saved off and restored with no free registers. Though, there are
    a few CR's which capture the CPU state at the time the ISR happens
    (these registers getting overwritten every time a new interrupt occurs).

    Why not just treat the RF as a cache with a known address in physical
    memory.
    In MY 66000 that is what I do and then just push and pull 4 cache lines
    at a
    time.


    Possible, but poses its own share of problems...

    Not sure how this could be implemented cost-effectively, or for that
    matter, more cheaply than a RISC-V style mode-banked register-file.


    Though, could make sense if one has a mechanism where a context switch
    could have a mechanism to dump the whole register file to Block-RAM, and
    some sort of mechanism to access this RAM via an MMIO interface.

    Pros/cons, seems like each possibility would also come with drawbacks:
    As-is: Slowness due to needing to save/reload everything;
    RISC-V: Expensive regfile, only works for limited cases;
    MMIO Backed + RV-like: Faster U<->S, but slower task switching.
    RAM Backed: Cache coherence becomes a critical feature.



    The RISC-V like approach makes sense if one assumes:
    There is a user process;
    There is a kernel running under it;
    We want to call from the user process into the kernel.

    Doesn't make so much sense, say, for:
    User Process A calls a VTable entry which calls into User Process B;
    Service A uses a VTable to call into the VFS;
    ...

    Say, where one is making use of horizontal context switches for control
    flow between logical tasks. Which would still remain fairly expensive
    under a RISC-V like model.

    One could have enough register banks for N logical tasks, but supporting
    4 or 8 copies of the register file is going to cost more than 2 or 3.


    Granted, possibly, handling system calls via using a mechanism along the
    lines of a horizontal context switch, is a bit unusual...

    But, ironically, this sort of ended up seeming like the most
    straightforward approach in my case.


    So, say:
       Interrupt entry:
         Copy low bits of SR into high bits of EXSR;
         Copy PC into SPC.
         Copy fault address into TEA;
         Swap SP and SSP (*1);
         Set CPU flags to Supervisor+ISR mode;
           CPU Mode bits now copied from high bits of VBR.
         Computed branch relative to VBR.
           Offset depends on interrupt category.
       Interrupt return (RTE):
         Copy EXSR bits back into SR;
         Unswap SP/SSP (*1);
         Branch to SPC.

        Interrupt Entry Point::
          // by this point all the old registers have been saved where they
          // are supposed to go, and the interrupt dispatcher registers are
          // already loader up and ready to go, and the CPU is running at
          // whatever privilege level was specified.
          HR   R1<-WHY
          LD   IP,[IP,R1<<3,InterruptVectorTable] // Call through table
          RTI
    //
    InterruptHandler0:
          // do what is necessary
          // note this can all be written in C
          RET
    InterruptHandler1::


    Above, I was describing what the hardware was doing.

    The software side is basically more like:
    Branch from VBR-table to ISR entry point;
    Get R0 and R1 saved onto the stack;
    Get some of the CRs saved off (we need R0 and R1 free here);
    Get the rest of the GPRs saved onto the stack;
    Call into the main part of the ISR handler (using normal C ABI);
    Restore most of the GPRs;
    Restore most of the CRs;
    Restore R0 and R1;
    Do an RTE.


    If I were to make the ISR mechanism assume that TBR was valid:
    Branch from VBR-table to ISR entry point;
    Get R0/R1/R8/R9 saved onto the stack;
    Load the address of the register-save area from the current TBR;
    Save CRs and GPRs to register save area
    Copy over the values saved onto the stack.
    Call into the main part of the ISR handler (using normal C ABI);
    Restore everything from the potentially new TBR;
    ...

    Pros:
    Could speed up syscalls and task switches;
    No hardware-level changes needed.

    Cons:
    Now the compiler would be hard-coded for TestKern's TBR layout (this
    stuff would need to be baked into the ABI, *).


    *: This structure being comparable to the TEB in Windows (and also holds
    the location to find things like TLS variables and similar).

    It differs slightly from the Windows TEB though:
    The main part is Read-Only in Userland;
    Holds a pointer to a Kernel-Only part;
    This part holds the saved registers.
    Holds another pointer to a User Modifiable part
    This part holds the TLS variables and some execution-state stuff.


    Likely, in C land, might look something like:
    __interrupt __declspec(isr_regsave_tbr) void __isr_syscall(void)
    {
    ...
    }

    With the "__declspec(isr_regsave_tbr)" signaling to BGBCC that it should
    save registers directly into the TBR's register-save area rather than
    onto the ISR stack.

    Should be workable at least under the assumption that no one is going to
    try to invoke a syscall without a valid TBR.




    *1: At the time, couldn't figure a good way to shave more logic off
    the mechanism. Though, now, the most obvious candidate now would be to
    eliminate the implicit SP/SSP swapping (this part is currently handled
    in the instruction decoder).

    So, instead, the ISR entry point would do something like:
       MOV    SP, SSP
       MOV    0xDE00, SP  //Designated ISR stack SRAM
       MOV.Q  R0, (SP, 0)
       NOV.Q  R1, (SP, 8)
       ... Now save off everything else ...

    But, didn't really think of it at the time.


    There is already the trick of requiring VBR to be aligned (currently
    64B in practice; formally 256B), mostly so as to allow the "address
    computation" to be done via bit-slicing.

    Not sure if many CPUs have a cheaper mechanism here...

    Treat the CPU state and the register state as cache lines and have
    HW shuffle them in and out. You can even start the 5 cache line reads
    before you start the CPU state writes; saving latency (which you cannot
    using SW only methods).


    I meant hardware-side cost.

    But, yeah, software-side could be a fair bit faster...


    Note that in my case, generally the interrupt handlers are written in
    C, with the compiler managing all the ISR prolog/epilog stuff (mostly
    saving/restoring pretty much the entire CPU state to the ISR stack).

    My 66000 compiler remains blissfully ignorant of ISR prologue and
    epilogue and it still works.

    Generally, the ISR's also need to deal with having a comparably small
    stack (with 0.75K already used for the saved CPU state).

    Where:
       0000..7FFF: Boot ROM
       8000..BFFF: (Optional) Extended Boot ROM
       C000..DFFF: Boot/ISR SRAM
       E000..FFFF: (Optional) Extended SRAM

    Generally, much of the work of the context switch is pulled off using
    "memcpy" calls (with the compiler providing a special "__arch_regsave"
    variable giving the address of the location it has dumped the CPU
    registers into; which in turn covers most of the core state that needs
    to be saved/restored for a process context switch).

    Why not just make the HW push and pull cache lines.


    My current prediction is that the mechanism for doing this would make
    the register file significantly more expensive, along with making for
    more serious problems related to memory coherence if the CPU tries to
    touch any of this (unlike the RAM-backed VRAM, I can't hand-wave this,
    if things don't go perfectly, stuff is gonna explode).


    Granted, going "true multicore" it likely to require addressing the
    cache coherence issues somehow (likely needing to manually invoke cache
    flushes to deal with multithreaded code isn't really going to fly).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Thu Nov 23 16:53:04 2023
    BGB wrote:

    On 11/22/2023 12:38 PM, MitchAlsup wrote:
    BGB wrote:

    Yeah, but that is not exactly minimalist in terms of the hardware.

    Granted, burning around 1 kilocycle of overhead per syscall isn't ideal either...


    Eg:
    Save registers to ISR stack;
    Copy registers to User context;
    Copy handler-task registers to ISR stack;
    Reload registers from ISR stack;
    Handle the syscall;
    Save registers to ISR stack;
    Copy registers to Syscall context;
    Copy User registers to ISR stack;
    Reload registers from ISR stack.


    Does mean that one needs to be economical with syscalls (say, doing
    "printf" a whole line at a time, rather than individual characters, ...).

    Not at all--I have reduced SysCalls to just a bit slower than actual CALL.
    say around 10-cycles. Use them as often as you like.

    And, did create incentive to allow getting the microsecond-clock value
    and hardware RNG values from CPUID rather than needing a syscall (say,
    don't want to burn 20us to check the microsecond counter, ...).


    If the "memcpy's" could be eliminated, this could roughly halve the cost
    of doing a syscall.

    I have MM (memory move) as a 3-operand instruction.

    One other option would be to do like RISC-V's privileged spec and have multiple copies of the register file (and likely instructions for
    accessing these alternate register files).

    There is one CPU register file, and every running thread has an address
    where that file comes from and goes to--just like a block of 4 cache lines; There is a 5th cache line that contains all the other PSW stuff.

    Worth the cost? Dunno.

    In my opinion--Absolutely worth it.

    Not too much different to modern Windows, where slow syscalls are still fairly common (and despite the slowness of the mechanism, it seems like
    BJX2 sycalls still manage to be around an order of magnitude faster than Windows syscalls in terms of clock-cycle cost...).

    Now, just get it down to a cache missing {L1, L2} instruction fetch.


    Why not just treat the RF as a cache with a known address in physical
    memory.
    In MY 66000 that is what I do and then just push and pull 4 cache lines
    at a
    time.


    Possible, but poses its own share of problems...

    Not sure how this could be implemented cost-effectively, or for that
    matter, more cheaply than a RISC-V style mode-banked register-file.

    1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead
    of having 4 cache lines of state and 1 doubleword of address, you need
    16 cache lines of state.

    Though, could make sense if one has a mechanism where a context switch
    could have a mechanism to dump the whole register file to Block-RAM, and
    some sort of mechanism to access this RAM via an MMIO interface.

    Just put it in DRAM at SW controlled (via TLB) addresses.

    Pros/cons, seems like each possibility would also come with drawbacks:
    As-is: Slowness due to needing to save/reload everything;
    RISC-V: Expensive regfile, only works for limited cases;
    MMIO Backed + RV-like: Faster U<->S, but slower task switching.
    RAM Backed: Cache coherence becomes a critical feature.



    The RISC-V like approach makes sense if one assumes:
    There is a user process;
    There is a kernel running under it;
    We want to call from the user process into the kernel.

    So if you ae running under a Real OS you don't need 2 sets of RFs in my
    model.

    Doesn't make so much sense, say, for:
    User Process A calls a VTable entry which calls into User Process B;
    Service A uses a VTable to call into the VFS;
    ...

    Say, where one is making use of horizontal context switches for control
    flow between logical tasks. Which would still remain fairly expensive
    under a RISC-V like model.

    Yes, but PTHREADing can be done without privilege and in a single instruction.

    One could have enough register banks for N logical tasks, but supporting
    4 or 8 copies of the register file is going to cost more than 2 or 3.


    Above, I was describing what the hardware was doing.

    The software side is basically more like:
    Branch from VBR-table to ISR entry point;
    Get R0 and R1 saved onto the stack;

    Where did you get the address of this stack ??

    Get some of the CRs saved off (we need R0 and R1 free here);
    Get the rest of the GPRs saved onto the stack;
    Call into the main part of the ISR handler (using normal C ABI);
    Restore most of the GPRs;
    Restore most of the CRs;
    Restore R0 and R1;
    Do an RTE.

    If HW does register file save/restore the above looks like::

    The software side is basically more like:
    Branch from VBR-table to ISR entry point;
    Call into the main part of the ISR handler (using normal C ABI);
    Do an RTE.

    See what it saves ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Thu Nov 23 19:17:14 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Robert Finch wrote:

    On 2023-11-21 5:12 p.m., MitchAlsup wrote:

    In My 66000, every <effective> SysCall goes deeper into the privilege
    hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV, >>> Guest HV SysCalls real HV. No data structures need maintenance during
    these transitions of the hierarchy.

    Does it follow the same way for hardware interrupts? I think RISCV goes
    to the deepest level first, machine level, then redirects to lower
    levels as needed. I was planning on Q+ operating the same way.

    It depends, there is the school of thought that just deliver control to >someone who can always deal with it (Machine level in RISC-V) and there
    is the other school of thought that some table should encode which level
    of the system control is delivered to. The former allow SW to control
    every step of the process, the later gets rid of all the SW checking
    and simplifies the process of getting to and back from interrupt handlers >(and their associated soft IRQs.)

    ARMv8 allows the interrupt and fast interrupt (IRQ, FIQ) signals to be delivered to the EL1 (operating system) ring unless system registers at
    higher (more privileged) exception levels trap the signal. EL3 (firmware) level is the most privileged level and generally 'owns' the FIQ signal,
    while the IRQ signal is owned by EL1 (bare metal OS) or EL2 (hypervisor).

    The destination exception level of each signal is controlled by
    bits in system registers (SCR_EL3 to direct them to EL3, HCR_EL2 to
    direct them to EL2).

    Interrupts can be assigned to one of two groups - group 0 which is
    always delivered as an FIQ and group 1 which is delivered as an IRQ.

    Group zero interrupts are considered "secure" interrupts and only
    secure accesses can modify the configuration of such interrupts.

    Group one interrupts can be either non-secure or secure depending on
    the security state of the target exception level (secure or non-secure).

    The higher priority half of the interrupt priority (8 bits) is considered
    a secure range, the rest non-secure, thus secure interrupts will always have higher priority than non-secure interrupts.

    There is no software "checking" required.

    Exception return (i.e. context switch) loads the PSR from SPSR_ELx and
    the PC from ELR_ELx[*] and that's the entirety of the software visible state handled by the hardware. Each exception level has its own page table
    root registers (TTBR0_ELx, TTBR1_ELx for each half of the VA space), so
    there is nothing for software to reload. Hardware manages the TLB entries which are tagged with both security state and exception level.

    [*] Both are system registers (flops, not ram)

    [**] The secure flag (!SCR_EL3[NS]) acts like an 'invisible'
    address bit at bit N (where N is the number of bits of supported
    physical address). This provides two completely distinct N-bit
    address spaces - one secure and one non-secure with SCR_EL3[NS]
    controlling which space is used by accesses. NS only applies
    to EL 0 - 2, EL3 is always considered secure. N is typically 48,
    but can be up to 52 in the current versions of the architecture.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Thu Nov 23 21:08:45 2023
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Stefan Monnier wrote:


    I have a Linux friendly version where context switch is a single instruction.


    The Burroughs B3500 had a single such instruction, called
    Branch Reinstate (BRE).

    My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
    each privilege level has its own {IP, RF, Root Pointer, CSP, Exception {Enabled, Raised}, and a few more things contained in 5 contiguous cache
    lines.

    The 4 privilege levels, each, have a pointer to those 5 cache lines. By
    writing the control register (HR instruction) one can change the control
    point for each level (of course you have to have appropriate permission--
    but I decided that a user should have the ability to context switch to
    another user without needing OS intervention--thus pthreads do not need
    an excursion through the Guest OS to switch threads under the same memory
    map {but do when crossing processes}.

    Thus, all 4 privileges are always resident in the privilege hierarchy
    at the cost of 4 DoubleWord registers instead of at the cost of 4 RFs.
    With these levels all resident simultaneously, no table surgery is needed
    to switch levels {Root pointers, MTRR,...} and no RF save/restore is
    needed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Thu Nov 23 20:46:38 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Stefan Monnier wrote:


    I have a Linux friendly version where context switch is a single instruction.


    The Burroughs B3500 had a single such instruction, called
    Branch Reinstate (BRE).

    The task context (base register, limit register, accumulator, comparison
    and overflow flags) were stored in small region at absolute address 60
    and BRE would restore that state (and interrupts would save it).
    Index registers were mapped to base-relative addresses 8, 16 and 24
    (8 digits each).

    The V-Series did a complete revamp of the processor architecture to
    support larger memory sizes (both per task and systemwide) and
    SMP. A segmentation scheme was adopted (for backward compatability)
    and seven additional base-limit pairs were added to support direct
    access to 8 segments at any time (called an evironment). There
    could be up to 1,000,000 environments per task, each with up to
    8 active memory areas (and 92 inactive memory areas accessible to
    three special instructions for data movement and comparison).

    The instruction was renamed Branch Reinstate Virtual (BRV) and would
    read the task table entry and load all the relevent state, including
    loading the active environment table into the processor base-limit
    registers. BRV accessed a table in memory, indexed by task number,
    that stored all the state of the task (200 digits worth).

    At the same time, we added SMP support including an inter-cpu
    communication instruction (my invention) similar to the
    mechanism adopted a few years later when Intel added SMP
    support for P5.

    We also added hardware mutex and condition variable instructions;
    the "LOK" instruction would atomically acquire the mutex, if
    available, or interrupt to a microkernel scheduler if unavailable.
    "UNLK" would interrupt if a higher priority task was waiting
    for the lock. There were CAUS and WAIT instructions that
    offered capabilities similar to posix condition variables.

    Each defined lock had a canonical lock level (a 4 digit
    number) and the hardware would fail a lock request where
    the new lock canonical lock number is less than the current
    lock owned by the task (if any). Unlock enforced the
    reverse. This prevented any A-B deadlock situations from
    occuring, although with many locks in a large subsystem (e.g
    the MCP OS) it was tricky sometimes to assign lock numbers.
    This also implicitly encouraged programmers to minimize
    the critical section and avoid nested locking where possible.

    The microkernel only handled scheduling and interrupts, all
    MCP code ran in the context of either the task making the
    request, or in an 'independent runner' (a kernel thread)
    dispatched from the microkernel. I/O interrupts were dispatched
    to two different independent runners, one for normal interrupts
    and one for real-time interrupts. Real-time interrupts were
    used for document sorters (e.g. MICR reader/sorters processing checks/cheques/utility bills, etc) in order to be able to
    select the destination pocket for each document in the
    time interval from the read station to the pocket-select
    station (at 2500 documents per minute - 42 per second,
    one document every 24 milliseconds). We supported ten
    active sorters per host. Even had one host installed
    on an L-1011 with reader/sorters that processed
    checks on coast-to-coast overnight flights.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul A. Clayton@21:1/5 to MitchAlsup on Thu Nov 23 17:13:03 2023
    On 11/23/23 4:08 PM, MitchAlsup wrote:
    [snip]
    The 4 privilege levels, each, have a pointer to those 5 cache
    lines. By writing the control register (HR instruction) one
    can change the control point for each level (of course you
    have to have appropriate permission-- but I decided that a
    user should have the ability to context switch to another
    user without needing OS intervention--thus pthreads do not
    need an excursion through the Guest OS to switch threads
    under the same memory map {but do when crossing processes}.

    My 66000 also has Port Holes, which seem to offer some
    cross-protection-domain access.

    While not significantly helpful, I also wonder if privilege
    reducing operations could be lower cost by not involving the
    OS. This would require the OS to store the allowed privilege
    elsewhere, but this might be done anyway. It would also have
    little use (I suspect) and still require OS involvement to
    restore privilege. There might be some cases where privilege
    is only needed in an initialization stage, but that seems
    likely to be rare.

    Writing to the accessed and dirty bits of a PTE would also
    seem to be something that could, in theory, be allowed to a
    user-level process. Clearing the dirty bit could be dangerous
    if stale data was from another protection domain. Clearing
    the accessed bit would seem to only "strongly hint" that the
    page be victimized earlier; setting the dirty bit would not
    be different than a "silent store" [not useful it seems since
    a load/store instruction pair could accomplish the same] and
    setting the accessed bit would seem the same as performing a
    non-caching load to any location in the page acting as a
    "keep me" hint [probably not useful]. Even with this little
    thought, allowing these PTE changes seems not worthwhile.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to MitchAlsup on Thu Nov 23 15:53:47 2023
    On 11/23/2023 10:53 AM, MitchAlsup wrote:
    BGB wrote:

    On 11/22/2023 12:38 PM, MitchAlsup wrote:
    BGB wrote:

    Yeah, but that is not exactly minimalist in terms of the hardware.

    Granted, burning around 1 kilocycle of overhead per syscall isn't
    ideal either...


    Eg:
       Save registers to ISR stack;
       Copy registers to User context;
       Copy handler-task registers to ISR stack;
       Reload registers from ISR stack;
       Handle the syscall;
       Save registers to ISR stack;
       Copy registers to Syscall context;
       Copy User registers to ISR stack;
       Reload registers from ISR stack.


    Does mean that one needs to be economical with syscalls (say, doing
    "printf" a whole line at a time, rather than individual characters, ...).

    Not at all--I have reduced SysCalls to just a bit slower than actual CALL. say around 10-cycles. Use them as often as you like.


    OK.

    Well, they aren't very fast in my case, in any case.


    And, did create incentive to allow getting the microsecond-clock value
    and hardware RNG values from CPUID rather than needing a syscall (say,
    don't want to burn 20us to check the microsecond counter, ...).


    If the "memcpy's" could be eliminated, this could roughly halve the
    cost of doing a syscall.

    I have MM (memory move) as a 3-operand instruction.


    None in my case...

    But, a memcpy loop can move ~ 280-300 MB/s in the L1 cache at 50MHz.
    Still might be better to not do a memcpy in these cases.


    Say, if the ISR handler could "merely" reassign the TBR register to
    switch from one task to another to perform the context switch (still
    ignoring all the loads/stores hidden in the prolog and epilog).


    One other option would be to do like RISC-V's privileged spec and have
    multiple copies of the register file (and likely instructions for
    accessing these alternate register files).

    There is one CPU register file, and every running thread has an address
    where that file comes from and goes to--just like a block of 4 cache lines; There is a 5th cache line that contains all the other PSW stuff.


    No direct equivalent.


    I was thinking sort of like the RISC-V Privileged spec, there are User/Supervisor/Machine sets, with the mode effecting which of these is visible.

    Obvious drawback in my case is that this would effectively increase the
    number of internal GPRs from 64 to 192 (and, at that point, may as well
    go to 4 copies and have 256).

    If this were handled in the decoder, this would mean roughly a 9-bit
    register selector field (vs the current 7 bits).

    The increase in the number of CRs could be less, since only a few of
    them actually need duplication.


    But, don't want to go this way, and it would only be a partial solution
    that also does not map up well to my current implementation.



    Not sure how an OS on SH-4 would have managed all this, but I suspect
    their interrupt model would have had similar limitations to mine.

    Major differences:
    SH-4 banked out R0..R7 when entering an interrupt;
    The VBR relative entry-point offsets were a bit, ad-hoc.

    There were some fairly arbitrary displacements based on the type of
    interrupt. Almost like they designed their interrupt mechanism around a particular chunk of ASM code or something. In my case, I kept a similar
    idea, but just used a fixed 8-byte spacing, with the idea of these spots branching to the actual entry point.

    Though, one other difference is in my case I ended up adding a dedicated SYSCALL handler; on SH-4 they had used a TRAP instruction, which would
    have gone to the FAULT handler instead.


    It is in-theory possible to jump from Interrupt Mode to normal
    Supervisor Mode without a full context switch, but the specifics of
    doing so would get a bit more hairy and arcane (which is sort of why I
    just sorta ended up using a context switch).

    Not sure what Linux on SH-4 had done, didn't really investigate this
    part of the code all that much at the time.


    In theory, the ISR handlers could be made to mimic the x86 TSS
    mechanism, but this wouldn't gain much.

    I think at one point, I had considered having tasks have both User and Supervisor state (with two stacks and two copies of all the registers),
    but ended up not going this way (and instead giving the syscalls their designated own task context; which also saves on per-task memory overhead).


    Worth the cost? Dunno.

    In my opinion--Absolutely worth it.

    Not too much different to modern Windows, where slow syscalls are
    still fairly common (and despite the slowness of the mechanism, it
    seems like BJX2 sycalls still manage to be around an order of
    magnitude faster than Windows syscalls in terms of clock-cycle cost...).

    Now, just get it down to a cache missing {L1, L2} instruction fetch.


    Looked into it a little more, realized that "an order of magnitude" may
    have actually been a little conservative; seems like Windows syscalls
    may be more in the area of 50-100k cycles.

    Why exactly? Dunno.


    This is still ignoring some of the "slow cases" which may take millions
    of clock cycles.

    It also seems like fast-ish syscalls may be more of a Linux thing.



    Why not just treat the RF as a cache with a known address in physical
    memory.
    In MY 66000 that is what I do and then just push and pull 4 cache
    lines at a
    time.


    Possible, but poses its own share of problems...

    Not sure how this could be implemented cost-effectively, or for that
    matter, more cheaply than a RISC-V style mode-banked register-file.

    1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead
    of having 4 cache lines of state and 1 doubleword of address, you need
    16 cache lines of state.


    OK.


    Having only 1 set of registers is good...

    Issue is the mechanism for how to get all the contents in/out of the
    register file, in a way that is both cost effective, and faster than
    using a series of Load/Store instructions would have otherwise been.

    Short of a pipeline redesign, it is unlikely to exceed a best case of
    around 128 bits per clock cycle, with (in practice) there typically
    being other penalties due to things like L1 misses and similar.


    One bit of trickery would be, "what if" the Boot SRAM region were inside
    the L1 cache rather than out on the ringbus?...

    But, then one would have the cost of keeping 8K of SRAM close to the CPU
    core that is mostly only ever used during interrupt handling (but,
    probably still cheaper than making the register file 3x bigger, in any case...).

    Though keeping it tied to a specific CPU core (and effectively processor
    local) would avoid the ugly "what if" scenario of two CPU cores trying
    to service an interrupt at the same time and potentially stepping on
    each others' stacks. The main tradeoff vs putting the stacks in DRAM is
    mostly that DRAM may have (comparably more expensive) L2 misses.


    Would add a potential "wonk" factor though, if this SRAM region were
    only visible for D$ access, but inaccessible from the I$. But, I guess
    one can argue, there isn't really a valid reason to try to run code from
    the ISR stack or similar.


    Though, could make sense if one has a mechanism where a context switch
    could have a mechanism to dump the whole register file to Block-RAM,
    and some sort of mechanism to access this RAM via an MMIO interface.

    Just put it in DRAM at SW controlled (via TLB) addresses.


    Possibly.

    It is also possible that some of the TBR / "struct TKPE_TaskInfo_s"
    stuff could be baked into hardware... But, I don't want to go this route (baking parts of it into the C ABI is at least "slightly" less evil).

    Also possible could be to add another CR for "Dump context registers
    here", this adds the costs of another CR though.


    I guess I can probably safely rule out MMIO under the basis that context switching via moving registers via MMIO would be slower than the current mechanism (of using a series of Load/Store instructions).


    Pros/cons, seems like each possibility would also come with drawbacks:
       As-is: Slowness due to needing to save/reload everything;
       RISC-V: Expensive regfile, only works for limited cases;
       MMIO Backed + RV-like: Faster U<->S, but slower task switching.
       RAM Backed: Cache coherence becomes a critical feature.



    The RISC-V like approach makes sense if one assumes:
       There is a user process;
       There is a kernel running under it;
       We want to call from the user process into the kernel.

    So if you ae running under a Real OS you don't need 2 sets of RFs in my model.


    OK.


    Whether or not my "OS" is "Real" is still a bit debatable.
    From what I can tell, it is sort of loosely in Win 3.x territory (at best).


    As-in, can have multiple tasks and task switching, memory protection is
    rather lacking, and still using cooperative scheduling (preemptive has
    been experimented with, but at the moment is prone to cause stuff to
    explode; I will need to "sort stuff out a bit more" and add things like
    mutex locks around various things before this point).


    Main obvious difference is:
    while(cond)
    {
    thrd_yield();
    cond=some_check();
    }
    Is OK, but:
    while(cond)
    cond=some_check();

    May potentially lock up the OS if it gets stuck in an infinite loop.


    In my current "GUI experiments", its stability is an almost comedic
    level of badness (to what extent things work at all).


    But, then again, Win3.x in DOSBox is not exactly "rock solid" either, so
    even as primitive as it is, it seems "almost within reach". Like, "It
    may work, it may cause the video driver to corrupt itself (leading to a
    screen of indecipherable garbage or similar), or the Windows install
    might just decide to corrupt its files badly enough that one has to
    reinstall it to make it work again, ...".


    Though, ironically, I am still left making some uses of 16 color BMP
    images and CRAM and similar. Though, slightly atypical, in that I am
    using CRAM as a still image format, and hacked things so that both
    formats can support transparency.

    Say: 16-color BMP: The "High Intensity Magenta" color can be used as a transparent color if needed. For 8-bit CRAM, a 256-color palette is
    used, with one of the colors (0x80 in this case) being used as a
    transparent color.

    Note that "actual Windows" can't load these CRAM BMP's (but, also can't
    load a few of the "should work" formats either; like 2-bpp images or the
    older BITMAPCOREHEADER format).

    Then again, one could argue, maybe it doesn't make much sense for modern programs to be able to load formats that haven't seen much use since the
    days of CGA and Windows 1.x ?...



    Doesn't make so much sense, say, for:
       User Process A calls a VTable entry which calls into User Process B;
       Service A uses a VTable to call into the VFS;
       ...

    Say, where one is making use of horizontal context switches for
    control flow between logical tasks. Which would still remain fairly
    expensive under a RISC-V like model.

    Yes, but PTHREADing can be done without privilege and in a single instruction.


    OK.

    Luckily, a thread-switch only needs to go 1-way, reducing it to around
    500 cycles as-is in my case.

    Theoretical minimum would be around 150-200 cycles, with most of the
    savings based on eliminating around 1.5kB worth of "memcpy()"...

    This need not involve an ISA change, could in theory be done by making
    the SYSCALL ISR mandate that TBR be valid (and the associated compiler
    changes, likely the main issue here).



    Well, nevermind any cost of locating the next thread, but at the moment,
    I am using a fairly simplistic round-robin scheduling strategy, so the scheduler mostly starts at a given PID, and looks for the next PID that
    holds a valid/running task (wrapping back to PID 1 if it hits the end,
    and stopping the search if it gets back to the original PID).


    The high-level threading model wasn't based on pthreads in my case, but
    rather C11 threads (and had implemented a lot of the "threads.h" stuff).

    One could potentially mimic pthreads on top of C11 threads though.

    At the moment, I forgot why I decided to go with C11 threads over
    pthreads, but IIRC I think I had felt at the time like C11 threads were
    a better fit.


    One could have enough register banks for N logical tasks, but
    supporting 4 or 8 copies of the register file is going to cost more
    than 2 or 3.


    Above, I was describing what the hardware was doing.

    The software side is basically more like:
       Branch from VBR-table to ISR entry point;
       Get R0 and R1 saved onto the stack;

    Where did you get the address of this stack ??


    SP and SSP swap places on interrupt entry (currently by renumbering the registers in the instruction decoder).

    SSP is initialized early on to the SRAM stack, so when an interrupt
    happens, the 'SP' register automatically becomes the SRAM stack.

    Essentially, both SP and SSP are SPRs, but:
    SP is mapped into R15 in the GPR space;
    SSP is mapped into the CR space.

    So, when executing an ISR, it is effectively using SSP as its SP.


    If I were eliminate this implicit register-swap mechanism, then the ISR
    entry would likely need to reload a constant address each time. Though,
    this change would also break binary compatibility with my existing code.

    But, in theory, eliminating the register swap could allow demoting SP to
    being a normal GPR.

    Also, things like renumbering parts of the register space based on CPU
    mode is expensive.


    Though, some of my more recent design ideas would have gone over to an
    ordering slightly more like RISC-V, say:
    R0: ZR or PC (ALU or MEM)
    R1: LR or TBR (ALU or MEM)
    R2: SP
    R3: GP (GBR)
    R4 -R15: Scratch
    R16-R31: Callee Save
    R32-R47: Scratch
    R48-R63: Callee Save

    Would likely not adopt RISC-V's C ABI though.

    Though, if one assumes R4..R63 are GPRs, this would allow both this ISA
    and RISC-V to still use the same register numbering.

    This is already fairly close to the register numbering scheme used in
    XG2RV, though the assumption was that XG2RV would have used RV's ABI,
    but this was stalled out mostly due to compiler issues (getting BGBCC to
    be able to follow RISC-V's C ABI rules would be a non-trivial level of
    effort; but is rendered moot if one still needs to use call thunking).


    The interpretation for R0 and R1 would depend on how they are used:
    ALU or similar: ZR and LR (Zero and Link Register)
    Load/Store Base: PC and TBR.

    Idea being that in userland, TBR effectively still exists as a Read-Only register (allowing userland to modify TBR would effectively also allow
    userland to wreck the OS).


    Thing is mostly that needing to renumber registers in the decoder based
    on CPU mode isn't entirely free in terms of LUT cost or timing latency
    (even if it only applies to a subset of the register space).

    Note that for RV decoding:
    X0..X31 -> R0 ..R31 (more or less)
    F0..F31 -> R32..R63

    But, RV's FPU instructions don't match up exactly 1:1, and some cases
    would have semantic differences.

    Though, it seems like most RV code could likely tolerate some deviation
    in some areas (will it care that the high 32 bits of a Binary32 register
    don't hold NaN? Will it care about the extra funkiness going on in LR? ...).


       Get some of the CRs saved off (we need R0 and R1 free here);
       Get the rest of the GPRs saved onto the stack;
       Call into the main part of the ISR handler (using normal C ABI);
       Restore most of the GPRs;
       Restore most of the CRs;
       Restore R0 and R1;
       Do an RTE.

    If HW does register file save/restore the above looks like::

    The software side is basically more like:
       Branch from VBR-table to ISR entry point;
       Call into the main part of the ISR handler (using normal C ABI);
       Do an RTE.

    See what it saves ??

    This is fewer instructions.

    But, hardware cost, and clock-cycle savings?...


    As-is, I can't come up with much that is both:
    Fairly cheap to implement in hardware;
    Would saves a lot of clock-cycles over software-based options.

    As noted, the former is also why I had thus far mostly rejected the
    RISC-V strategy (*).


    *: Ironically, despite RISC-V having fewer GPRs, to implement the
    Privileged spec, RISC-V would still end up needing a somewhat bigger
    register file... Nevermind what exactly is going on with CSRs...

    Say:
    BJX2: 64 GPRs, ~ 14 CRs in use.
    Some of the CRs defined (like the SMT set) don't currently exist.
    TEAH is specific to Addr96 mode;
    VIPT doesn't currently exist
    Will only exist if/when inverted page tables are added.
    STTB exists but isn't currently being used
    Was intended for supervisor-mode page tables;
    But, N/A if Supervisor Mode is reached via a task switch...

    RISC-V: 3x ( 32 GPRs + 32 FPRs), 3x a bunch of CSRs.
    So, theoretically, 192 registers, plus a bunch more CSRs.
    Nevermind that the 'V' extension would add more registers.
    Would we also need 3 copies of all the Vector registers, ... ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Thu Nov 23 23:30:50 2023
    BGB wrote:

    On 11/23/2023 10:53 AM, MitchAlsup wrote:
    BGB wrote:


    If the "memcpy's" could be eliminated, this could roughly halve the
    cost of doing a syscall.

    I have MM (memory move) as a 3-operand instruction.


    None in my case...

    But, a memcpy loop can move ~ 280-300 MB/s in the L1 cache at 50MHz.
    Still might be better to not do a memcpy in these cases.

    Say, if the ISR handler could "merely" reassign the TBR register to
    switch from one task to another to perform the context switch (still
    ignoring all the loads/stores hidden in the prolog and epilog).


    One other option would be to do like RISC-V's privileged spec and have
    multiple copies of the register file (and likely instructions for
    accessing these alternate register files).

    There is one CPU register file, and every running thread has an address
    where that file comes from and goes to--just like a block of 4 cache lines; >> There is a 5th cache line that contains all the other PSW stuff.


    No direct equivalent.


    I was thinking sort of like the RISC-V Privileged spec, there are User/Supervisor/Machine sets, with the mode effecting which of these is visible.

    Obvious drawback in my case is that this would effectively increase the number of internal GPRs from 64 to 192 (and, at that point, may as well
    go to 4 copies and have 256).

    If this were handled in the decoder, this would mean roughly a 9-bit
    register selector field (vs the current 7 bits).

    Decode is not the problem, sensing 1:256 is a big problem, in practice
    even SRAMs only have 32-pairs of cells on a bit line using exotic timed
    sense amps.
    {{Decode is almost NEVER the logic delay problem:: ½ is situation recognition, the other ½ is fan-out buffering--driving the lines into the decoder is more gates of delay than determining if a given select line should be asserted.}}

    The increase in the number of CRs could be less, since only a few of
    them actually need duplication.


    But, don't want to go this way, and it would only be a partial solution
    that also does not map up well to my current implementation.



    Not sure how an OS on SH-4 would have managed all this, but I suspect
    their interrupt model would have had similar limitations to mine.

    Major differences:
    SH-4 banked out R0..R7 when entering an interrupt;
    The VBR relative entry-point offsets were a bit, ad-hoc.

    There were some fairly arbitrary displacements based on the type of interrupt. Almost like they designed their interrupt mechanism around a particular chunk of ASM code or something. In my case, I kept a similar
    idea, but just used a fixed 8-byte spacing, with the idea of these spots branching to the actual entry point.

    Though, one other difference is in my case I ended up adding a dedicated SYSCALL handler; on SH-4 they had used a TRAP instruction, which would
    have gone to the FAULT handler instead.


    It is in-theory possible to jump from Interrupt Mode to normal
    Supervisor Mode without a full context switch,

    but why ?? the probability that control returns from a given IST to its
    softIRQ is less than ½ in a loaded system.

    but the specifics of
    doing so would get a bit more hairy and arcane (which is sort of why I
    just sorta ended up using a context switch).

    Not sure what Linux on SH-4 had done, didn't really investigate this
    part of the code all that much at the time.


    In theory, the ISR handlers could be made to mimic the x86 TSS
    mechanism, but this wouldn't gain much.

    Stay away from anything you see in x86 except in using it a moniker
    to avoid.

    I think at one point, I had considered having tasks have both User and Supervisor state (with two stacks and two copies of all the registers),
    but ended up not going this way (and instead giving the syscalls their designated own task context; which also saves on per-task memory overhead).


    Worth the cost? Dunno.

    In my opinion--Absolutely worth it.

    Not too much different to modern Windows, where slow syscalls are
    still fairly common (and despite the slowness of the mechanism, it
    seems like BJX2 sycalls still manage to be around an order of
    magnitude faster than Windows syscalls in terms of clock-cycle cost...).

    Now, just get it down to a cache missing {L1, L2} instruction fetch.


    Looked into it a little more, realized that "an order of magnitude" may
    have actually been a little conservative; seems like Windows syscalls
    may be more in the area of 50-100k cycles.

    Why exactly? Dunno.


    This is still ignoring some of the "slow cases" which may take millions
    of clock cycles.

    It also seems like fast-ish syscalls may be more of a Linux thing.



    Why not just treat the RF as a cache with a known address in physical
    memory.
    In MY 66000 that is what I do and then just push and pull 4 cache
    lines at a
    time.


    Possible, but poses its own share of problems...

    Not sure how this could be implemented cost-effectively, or for that
    matter, more cheaply than a RISC-V style mode-banked register-file.

    1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead
    of having 4 cache lines of state and 1 doubleword of address, you need
    16 cache lines of state.


    OK.


    Having only 1 set of registers is good...

    Issue is the mechanism for how to get all the contents in/out of the
    register file, in a way that is both cost effective, and faster than
    using a series of Load/Store instructions would have otherwise been.

    6R6W RFs are as big as one can practically build. You can get as much
    Read BW by duplication, but you only have "so much" Write BW (even when
    you know each write is to a different register).

    Short of a pipeline redesign, it is unlikely to exceed a best case of
    around 128 bits per clock cycle, with (in practice) there typically
    being other penalties due to things like L1 misses and similar.

    6R ports are 6*64-bits = 384-bits out and 384-bits in per cycle.

    One bit of trickery would be, "what if" the Boot SRAM region were inside
    the L1 cache rather than out on the ringbus?...

    2 things::
    a) By giving threadstate an address you gain the ability to load the
    initial RF image from ROM as the CPU comes out of reset--it comes out
    with a complete RF, a complete thread.header, mapping tables, privilege
    and priority.
    b) Those ROM-based TLB entries map to the L1 and L2 caches in Allocate
    state (no underlying DRAM address availible) so you have ~1MB to play around with until you find DRAM, configure, initialize, and put in fee-pool.)
    So, here, you HAVE "enough" storage to program BOOT activities in a HLL
    (of your choice).

    But, then one would have the cost of keeping 8K of SRAM close to the CPU
    core that is mostly only ever used during interrupt handling (but,
    probably still cheaper than making the register file 3x bigger, in any case...).

    Is the Icache and Dcache not close enough ?? If not then add L2 !!

    Though keeping it tied to a specific CPU core (and effectively processor local) would avoid the ugly "what if" scenario of two CPU cores trying
    to service an interrupt at the same time and potentially stepping on
    each others' stacks. The main tradeoff vs putting the stacks in DRAM is mostly that DRAM may have (comparably more expensive) L2 misses.

    The interrupt (re)mapping table takes care of this prior to the CPU being bothered. A {CPU or device} sends an interrupt to the Interrupt mapping
    table associated with the "Originating" thread. (IO/-MMU). That interrupt
    is logged into the table and if enabled its priority is used to determine
    which set of CPUs should be bothered, the affinity mask of the "Originating" thread is used to qualify which CPU from the priority set, and one of these
    is selected. The selected CPU is tapped on the shoulder, and sends a get- Interrupt request to the Interrupt table logic which sends back the priority and number of a pending interrupt. If the CPU is still at lower priority
    than the returning interrupt, the CPU <at this point> stops running code
    from the old thread and begins running code on the new thread.
    {{During the sending of the interrupt to the CPU and the receipt of the claim-Interrupt message, that interrupt will not get handed to any other
    CPU}} So, the CPU continues to run instructions while the CPUs contend
    for and claim unique interrupts. There are 512 unique interrupt at each of
    64 priority levels, and each process can have its own Interrupt Table.
    These tables need no maintenance except when interrupts are created and destroyed.}}

    HV, Guest HV, Guest OS each have their own unique interrupt tables;
    Although it could be arranged such that all could use the same table.

    Would add a potential "wonk" factor though, if this SRAM region were
    only visible for D$ access, but inaccessible from the I$. But, I guess
    one can argue, there isn't really a valid reason to try to run code from
    the ISR stack or similar.


    Though, could make sense if one has a mechanism where a context switch
    could have a mechanism to dump the whole register file to Block-RAM,
    and some sort of mechanism to access this RAM via an MMIO interface.

    Just put it in DRAM at SW controlled (via TLB) addresses.


    Possibly.

    It is also possible that some of the TBR / "struct TKPE_TaskInfo_s"
    stuff could be baked into hardware... But, I don't want to go this route (baking parts of it into the C ABI is at least "slightly" less evil).

    My mechanism is taking that struct task.....s (at least the part HW
    needs to understand) and associating each one into a table that points
    at DRAM. Now, when you want this thread to run, you load up the pointer
    set the e-bit (enabled) and write it into the current header at its
    privilege level. Poof--all 5 cache lines of state from the currently
    running thread goes back to where it permanent home in DRAM is, and
    the new thread fetches 5 cache lines of state of the new thread.
    a) you can start the reads before you start the writes
    b) you can start the writes anytime you have outbound access to "the bus"
    c) the writes can be no late than the ½ cycle before the reads get written. Which is a lot faster than you can do in SW with LDs and STs.

    Also possible could be to add another CR for "Dump context registers
    here", this adds the costs of another CR though.

    I config-space mapped all my CRs, so you get an unlimited number of them.

    I guess I can probably safely rule out MMIO under the basis that context switching via moving registers via MMIO would be slower than the current mechanism (of using a series of Load/Store instructions).
    .................
    Yes, but PTHREADing can be done without privilege and in a single
    instruction.


    OK.

    Luckily, a thread-switch only needs to go 1-way, reducing it to around
    500 cycles as-is in my case.

    In my case it is about MemoryLatency+5 cycles.

    Yes, thread switch is a 1-way function--which is the reason you can
    allow a user to preempt himself and allow a compatriot to run in his
    place.....

    Theoretical minimum would be around 150-200 cycles, with most of the
    savings based on eliminating around 1.5kB worth of "memcpy()"...

    My Real Time version of MY 66000 does 10-ish cycle context switch
    (as seen at the CPU) but here a hunk of HW has gathered up those 5 cache
    lines and sent them to the targeted CPU and all the CPU has to do is push
    out the old state (5-cache liens) So the data was heading towards the CPU before the CPU even knew it wanted that data !!

    This need not involve an ISA change, could in theory be done by making
    the SYSCALL ISR mandate that TBR be valid (and the associated compiler changes, likely the main issue here).



    Well, nevermind any cost of locating the next thread, but at the moment,
    I am using a fairly simplistic round-robin scheduling strategy, so the scheduler mostly starts at a given PID, and looks for the next PID that
    holds a valid/running task (wrapping back to PID 1 if it hits the end,
    and stopping the search if it gets back to the original PID).


    The high-level threading model wasn't based on pthreads in my case, but rather C11 threads (and had implemented a lot of the "threads.h" stuff).

    One could potentially mimic pthreads on top of C11 threads though.

    At the moment, I forgot why I decided to go with C11 threads over
    pthreads, but IIRC I think I had felt at the time like C11 threads were
    a better fit.


    One could have enough register banks for N logical tasks, but
    supporting 4 or 8 copies of the register file is going to cost more
    than 2 or 3.


    Above, I was describing what the hardware was doing.

    The software side is basically more like:
       Branch from VBR-table to ISR entry point;
       Get R0 and R1 saved onto the stack;

    Where did you get the address of this stack ??


    SP and SSP swap places on interrupt entry (currently by renumbering the registers in the instruction decoder).

    So, in effect, you actually have 33 registers with only 32 visible at
    any instant. I am just so glad not to have gone down that rabbet hole
    this time......

    SSP is initialized early on to the SRAM stack, so when an interrupt
    happens, the 'SP' register automatically becomes the SRAM stack.

    Essentially, both SP and SSP are SPRs, but:
    SP is mapped into R15 in the GPR space;
    SSP is mapped into the CR space.

    So, when executing an ISR, it is effectively using SSP as its SP.


    If I were eliminate this implicit register-swap mechanism, then the ISR
    entry would likely need to reload a constant address each time. Though,
    this change would also break binary compatibility with my existing code.

    But, in theory, eliminating the register swap could allow demoting SP to being a normal GPR.

    Also, things like renumbering parts of the register space based on CPU
    mode is expensive.


    Though, some of my more recent design ideas would have gone over to an ordering slightly more like RISC-V, say:
    R0: ZR or PC (ALU or MEM)
    R1: LR or TBR (ALU or MEM)
    R2: SP
    R3: GP (GBR)
    R4 -R15: Scratch
    R16-R31: Callee Save
    R32-R47: Scratch
    R48-R63: Callee Save

    Would likely not adopt RISC-V's C ABI though.

    R0:: GPR, Return Address, proxy for IP, proxy for 0
    R1..R9 Arguments and results passed in registers
    R10..R15 Temporary Registers (scratch)
    R16..R29 Callee Save
    R30 FP when in use, Callee Save
    R31 SP

    Though, if one assumes R4..R63 are GPRs, this would allow both this ISA
    and RISC-V to still use the same register numbering.

    This is already fairly close to the register numbering scheme used in
    XG2RV, though the assumption was that XG2RV would have used RV's ABI,
    but this was stalled out mostly due to compiler issues (getting BGBCC to
    be able to follow RISC-V's C ABI rules would be a non-trivial level of effort; but is rendered moot if one still needs to use call thunking).


    The interpretation for R0 and R1 would depend on how they are used:
    ALU or similar: ZR and LR (Zero and Link Register)
    Load/Store Base: PC and TBR.

    Idea being that in userland, TBR effectively still exists as a Read-Only register (allowing userland to modify TBR would effectively also allow userland to wreck the OS).


    Thing is mostly that needing to renumber registers in the decoder based
    on CPU mode isn't entirely free in terms of LUT cost or timing latency
    (even if it only applies to a subset of the register space).

    Note that for RV decoding:
    X0..X31 -> R0 ..R31 (more or less)
    F0..F31 -> R32..R63

    But, RV's FPU instructions don't match up exactly 1:1, and some cases
    would have semantic differences.

    Though, it seems like most RV code could likely tolerate some deviation
    in some areas (will it care that the high 32 bits of a Binary32 register don't hold NaN? Will it care about the extra funkiness going on in LR? ...).


       Get some of the CRs saved off (we need R0 and R1 free here);
       Get the rest of the GPRs saved onto the stack;
       Call into the main part of the ISR handler (using normal C ABI);
       Restore most of the GPRs;
       Restore most of the CRs;
       Restore R0 and R1;
       Do an RTE.

    If HW does register file save/restore the above looks like::

    The software side is basically more like:
       Branch from VBR-table to ISR entry point;
       Call into the main part of the ISR handler (using normal C ABI);
       Do an RTE.

    See what it saves ??

    This is fewer instructions.

    But, hardware cost,

    the HW cost has already been purchased by the state machine that
    writes out 5-cache lines and waits for 5-cache lines to arrive.

    and clock-cycle savings?...
    The reads can arrive before you start the writes, you can go so far
    as to organize your pipeline so the read data being written pushes
    out the write data that needs to return to memory-making the timing
    brain dead easy to achieve.


    As-is, I can't come up with much that is both:
    Fairly cheap to implement in hardware;
    Would saves a lot of clock-cycles over software-based options.

    As noted, the former is also why I had thus far mostly rejected the
    RISC-V strategy (*).

    Yet, you seem to be buying insurance as if you might need to head in that direction.

    *: Ironically, despite RISC-V having fewer GPRs, to implement the
    Privileged spec, RISC-V would still end up needing a somewhat bigger
    register file... Nevermind what exactly is going on with CSRs...

    Whereas that special State is only a dozen register <with state>
    in My 66000--the rest being either memory resident or memory mapped.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Finch@21:1/5 to MitchAlsup on Thu Nov 23 21:36:41 2023
    On 2023-11-23 6:30 p.m., MitchAlsup wrote:
    BGB wrote:

    On 11/23/2023 10:53 AM, MitchAlsup wrote:
    BGB wrote:


    If the "memcpy's" could be eliminated, this could roughly halve the
    cost of doing a syscall.

    I have MM (memory move) as a 3-operand instruction.


    None in my case...

    But, a memcpy loop can move ~ 280-300 MB/s in the L1 cache at 50MHz.
    Still might be better to not do a memcpy in these cases.

    Say, if the ISR handler could "merely" reassign the TBR register to
    switch from one task to another to perform the context switch (still
    ignoring all the loads/stores hidden in the prolog and epilog).


    One other option would be to do like RISC-V's privileged spec and
    have multiple copies of the register file (and likely instructions
    for accessing these alternate register files).

    There is one CPU register file, and every running thread has an address
    where that file comes from and goes to--just like a block of 4 cache
    lines;
    There is a 5th cache line that contains all the other PSW stuff.


    No direct equivalent.


    I was thinking sort of like the RISC-V Privileged spec, there are
    User/Supervisor/Machine sets, with the mode effecting which of these
    is visible.

    Obvious drawback in my case is that this would effectively increase
    the number of internal GPRs from 64 to 192 (and, at that point, may as
    well go to 4 copies and have 256).

    If this were handled in the decoder, this would mean roughly a 9-bit
    register selector field (vs the current 7 bits).

    Decode is not the problem, sensing 1:256 is a big problem, in practice
    even SRAMs only have 32-pairs of cells on a bit line using exotic timed
    sense amps.
    {{Decode is almost NEVER the logic delay problem:: ½ is situation recognition,
    the other ½ is fan-out buffering--driving the lines into the decoder is
    more
    gates of delay than determining if a given select line should be
    asserted.}}

    The increase in the number of CRs could be less, since only a few of
    them actually need duplication.


    But, don't want to go this way, and it would only be a partial
    solution that also does not map up well to my current implementation.



    Not sure how an OS on SH-4 would have managed all this, but I suspect
    their interrupt model would have had similar limitations to mine.

    Major differences:
       SH-4 banked out R0..R7 when entering an interrupt;
       The VBR relative entry-point offsets were a bit, ad-hoc.

    There were some fairly arbitrary displacements based on the type of
    interrupt. Almost like they designed their interrupt mechanism around
    a particular chunk of ASM code or something. In my case, I kept a
    similar idea, but just used a fixed 8-byte spacing, with the idea of
    these spots branching to the actual entry point.

    Though, one other difference is in my case I ended up adding a
    dedicated SYSCALL handler; on SH-4 they had used a TRAP instruction,
    which would have gone to the FAULT handler instead.


    It is in-theory possible to jump from Interrupt Mode to normal
    Supervisor Mode without a full context switch,

    but why ?? the probability that control returns from a given IST to its softIRQ is less than ½ in a loaded system.

                                                    but the specifics of
    doing so would get a bit more hairy and arcane (which is sort of why I
    just sorta ended up using a context switch).

    Not sure what Linux on SH-4 had done, didn't really investigate this
    part of the code all that much at the time.


    In theory, the ISR handlers could be made to mimic the x86 TSS
    mechanism, but this wouldn't gain much.

    Stay away from anything you see in x86 except in using it a moniker
    to avoid.

    I think at one point, I had considered having tasks have both User and
    Supervisor state (with two stacks and two copies of all the
    registers), but ended up not going this way (and instead giving the
    syscalls their designated own task context; which also saves on
    per-task memory overhead).


    Worth the cost? Dunno.

    In my opinion--Absolutely worth it.

    Not too much different to modern Windows, where slow syscalls are
    still fairly common (and despite the slowness of the mechanism, it
    seems like BJX2 sycalls still manage to be around an order of
    magnitude faster than Windows syscalls in terms of clock-cycle
    cost...).

    Now, just get it down to a cache missing {L1, L2} instruction fetch.


    Looked into it a little more, realized that "an order of magnitude"
    may have actually been a little conservative; seems like Windows
    syscalls may be more in the area of 50-100k cycles.

    Why exactly? Dunno.


    This is still ignoring some of the "slow cases" which may take
    millions of clock cycles.

    It also seems like fast-ish syscalls may be more of a Linux thing.



    Why not just treat the RF as a cache with a known address in
    physical memory.
    In MY 66000 that is what I do and then just push and pull 4 cache
    lines at a
    time.


    Possible, but poses its own share of problems...

    Not sure how this could be implemented cost-effectively, or for that
    matter, more cheaply than a RISC-V style mode-banked register-file.

    1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead >>> of having 4 cache lines of state and 1 doubleword of address, you need
    16 cache lines of state.


    OK.


    Having only 1 set of registers is good...

    Issue is the mechanism for how to get all the contents in/out of the
    register file, in a way that is both cost effective, and faster than
    using a series of Load/Store instructions would have otherwise been.

    6R6W RFs are as big as one can practically build. You can get as much
    Read BW by duplication, but you only have "so much" Write BW (even when
    you know each write is to a different register).

    Short of a pipeline redesign, it is unlikely to exceed a best case of
    around 128 bits per clock cycle, with (in practice) there typically
    being other penalties due to things like L1 misses and similar.

    6R ports are 6*64-bits = 384-bits out and 384-bits in per cycle.

    One bit of trickery would be, "what if" the Boot SRAM region were
    inside the L1 cache rather than out on the ringbus?...

    2 things::
    a) By giving threadstate an address you gain the ability to load the
    initial RF image from ROM as the CPU comes out of reset--it comes out
    with a complete RF, a complete thread.header, mapping tables, privilege
    and priority.
    b) Those ROM-based TLB entries map to the L1 and L2 caches in Allocate
    state (no underlying DRAM address availible) so you have ~1MB to play
    around
    with until you find DRAM, configure, initialize, and put in fee-pool.)
    So, here, you HAVE "enough" storage to program BOOT activities in a HLL
    (of your choice).

    But, then one would have the cost of keeping 8K of SRAM close to the
    CPU core that is mostly only ever used during interrupt handling (but,
    probably still cheaper than making the register file 3x bigger, in any
    case...).

    Is the Icache and Dcache not close enough ?? If not then add L2 !!

    Though keeping it tied to a specific CPU core (and effectively
    processor local) would avoid the ugly "what if" scenario of two CPU
    cores trying to service an interrupt at the same time and potentially
    stepping on each others' stacks. The main tradeoff vs putting the
    stacks in DRAM is mostly that DRAM may have (comparably more
    expensive) L2 misses.

    The interrupt (re)mapping table takes care of this prior to the CPU being bothered. A {CPU or device} sends an interrupt to the Interrupt mapping
    table associated with the "Originating" thread. (IO/-MMU). That interrupt
    is logged into the table and if enabled its priority is used to determine which set of CPUs should be bothered, the affinity mask of the
    "Originating"
    thread is used to qualify which CPU from the priority set, and one of these is selected. The selected CPU is tapped on the shoulder, and sends a get- Interrupt request to the Interrupt table logic which sends back the
    priority
    and number of a pending interrupt. If the CPU is still at lower priority
    than the returning interrupt, the CPU <at this point> stops running code
    from the old thread and begins running code on the new thread.
    {{During the sending of the interrupt to the CPU and the receipt of the claim-Interrupt message, that interrupt will not get handed to any other CPU}} So, the CPU continues to run instructions while the CPUs contend
    for and claim unique interrupts. There are 512 unique interrupt at each of
    64 priority levels, and each process can have its own Interrupt Table.
    These tables need no maintenance except when interrupts are created and destroyed.}}

    HV, Guest HV, Guest OS each have their own unique interrupt tables;
    Although it could be arranged such that all could use the same table.

    Would add a potential "wonk" factor though, if this SRAM region were
    only visible for D$ access, but inaccessible from the I$. But, I guess
    one can argue, there isn't really a valid reason to try to run code
    from the ISR stack or similar.


    Though, could make sense if one has a mechanism where a context
    switch could have a mechanism to dump the whole register file to
    Block-RAM, and some sort of mechanism to access this RAM via an MMIO
    interface.

    Just put it in DRAM at SW controlled (via TLB) addresses.


    Possibly.

    It is also possible that some of the TBR / "struct TKPE_TaskInfo_s"
    stuff could be baked into hardware... But, I don't want to go this
    route (baking parts of it into the C ABI is at least "slightly" less
    evil).

    My mechanism is taking that struct task.....s (at least the part HW
    needs to understand) and associating each one into a table that points
    at DRAM. Now, when you want this thread to run, you load up the pointer
    set the e-bit (enabled) and write it into the current header at its
    privilege level. Poof--all 5 cache lines of state from the currently
    running thread goes back to where it permanent home in DRAM is, and
    the new thread fetches 5 cache lines of state of the new thread.
    a) you can start the reads before you start the writes
    b) you can start the writes anytime you have outbound access to "the bus"
    c) the writes can be no late than the ½ cycle before the reads get written. Which is a lot faster than you can do in SW with LDs and STs.

    Also possible could be to add another CR for "Dump context registers
    here", this adds the costs of another CR though.

    I config-space mapped all my CRs, so you get an unlimited number of them.

    I guess I can probably safely rule out MMIO under the basis that
    context switching via moving registers via MMIO would be slower than
    the current mechanism (of using a series of Load/Store instructions).
    .................
    Yes, but PTHREADing can be done without privilege and in a single
    instruction.


    OK.

    Luckily, a thread-switch only needs to go 1-way, reducing it to around
    500 cycles as-is in my case.

    In my case it is about MemoryLatency+5 cycles.
    Yes, thread switch is a 1-way function--which is the reason you can
    allow a user to preempt himself and allow a compatriot to run in his place.....

    Theoretical minimum would be around 150-200 cycles, with most of the
    savings based on eliminating around 1.5kB worth of "memcpy()"...

    My Real Time version of MY 66000 does 10-ish cycle context switch
    (as seen at the CPU) but here a hunk of HW has gathered up those 5 cache lines and sent them to the targeted CPU and all the CPU has to do is push
    out the old state (5-cache liens) So the data was heading towards the
    CPU before the CPU even knew it wanted that data !!

    This need not involve an ISA change, could in theory be done by making
    the SYSCALL ISR mandate that TBR be valid (and the associated compiler
    changes, likely the main issue here).



    Well, nevermind any cost of locating the next thread, but at the
    moment, I am using a fairly simplistic round-robin scheduling
    strategy, so the scheduler mostly starts at a given PID, and looks for
    the next PID that holds a valid/running task (wrapping back to PID 1
    if it hits the end, and stopping the search if it gets back to the
    original PID).


    The high-level threading model wasn't based on pthreads in my case,
    but rather C11 threads (and had implemented a lot of the "threads.h"
    stuff).

    One could potentially mimic pthreads on top of C11 threads though.

    At the moment, I forgot why I decided to go with C11 threads over
    pthreads, but IIRC I think I had felt at the time like C11 threads
    were a better fit.


    One could have enough register banks for N logical tasks, but
    supporting 4 or 8 copies of the register file is going to cost more
    than 2 or 3.


    Above, I was describing what the hardware was doing.

    The software side is basically more like:
       Branch from VBR-table to ISR entry point;
       Get R0 and R1 saved onto the stack;

    Where did you get the address of this stack ??


    SP and SSP swap places on interrupt entry (currently by renumbering
    the registers in the instruction decoder).

    So, in effect, you actually have 33 registers with only 32 visible at
    any instant. I am just so glad not to have gone down that rabbet hole
    this time......

    SSP is initialized early on to the SRAM stack, so when an interrupt
    happens, the 'SP' register automatically becomes the SRAM stack.

    Essentially, both SP and SSP are SPRs, but:
       SP is mapped into R15 in the GPR space;
       SSP is mapped into the CR space.

    So, when executing an ISR, it is effectively using SSP as its SP.


    If I were eliminate this implicit register-swap mechanism, then the
    ISR entry would likely need to reload a constant address each time.
    Though, this change would also break binary compatibility with my
    existing code.

    But, in theory, eliminating the register swap could allow demoting SP
    to being a normal GPR.

    Also, things like renumbering parts of the register space based on CPU
    mode is expensive.


    Though, some of my more recent design ideas would have gone over to an
    ordering slightly more like RISC-V, say:
       R0: ZR or PC  (ALU or MEM)
       R1: LR or TBR (ALU or MEM)
       R2: SP
       R3: GP (GBR)
       R4 -R15: Scratch
       R16-R31: Callee Save
       R32-R47: Scratch
       R48-R63: Callee Save

    Would likely not adopt RISC-V's C ABI though.

    R0::     GPR, Return Address, proxy for IP, proxy for 0
    R1..R9   Arguments and results passed in registers
    R10..R15 Temporary Registers (scratch)
    R16..R29 Callee Save
    R30      FP when in use, Callee Save
    R31      SP

    Though, if one assumes R4..R63 are GPRs, this would allow both this
    ISA and RISC-V to still use the same register numbering.

    This is already fairly close to the register numbering scheme used in
    XG2RV, though the assumption was that XG2RV would have used RV's ABI,
    but this was stalled out mostly due to compiler issues (getting BGBCC
    to be able to follow RISC-V's C ABI rules would be a non-trivial level
    of effort; but is rendered moot if one still needs to use call thunking).


    The interpretation for R0 and R1 would depend on how they are used:
       ALU or similar: ZR and LR (Zero and Link Register)
       Load/Store Base: PC and TBR.

    Idea being that in userland, TBR effectively still exists as a
    Read-Only register (allowing userland to modify TBR would effectively
    also allow userland to wreck the OS).


    Thing is mostly that needing to renumber registers in the decoder
    based on CPU mode isn't entirely free in terms of LUT cost or timing
    latency (even if it only applies to a subset of the register space).

    Note that for RV decoding:
       X0..X31 -> R0 ..R31 (more or less)
       F0..F31 -> R32..R63

    But, RV's FPU instructions don't match up exactly 1:1, and some cases
    would have semantic differences.

    Though, it seems like most RV code could likely tolerate some
    deviation in some areas (will it care that the high 32 bits of a
    Binary32 register don't hold NaN? Will it care about the extra
    funkiness going on in LR? ...).


       Get some of the CRs saved off (we need R0 and R1 free here);
       Get the rest of the GPRs saved onto the stack;
       Call into the main part of the ISR handler (using normal C ABI);
       Restore most of the GPRs;
       Restore most of the CRs;
       Restore R0 and R1;
       Do an RTE.

    If HW does register file save/restore the above looks like::

    The software side is basically more like:
       Branch from VBR-table to ISR entry point;
       Call into the main part of the ISR handler (using normal C ABI);
       Do an RTE.

    See what it saves ??

    This is fewer instructions.

    But, hardware cost,

    the HW cost has already been purchased by the state machine that writes
    out 5-cache lines and waits for 5-cache lines to arrive.

    and clock-cycle savings?...
    The reads can arrive before you start the writes, you can go so far as
    to organize your pipeline so the read data being written pushes
    out the write data that needs to return to memory-making the timing
    brain dead easy to achieve.


    As-is, I can't come up with much that is both:
       Fairly cheap to implement in hardware;
       Would saves a lot of clock-cycles over software-based options.

    As noted, the former is also why I had thus far mostly rejected the
    RISC-V strategy (*).

    Yet, you seem to be buying insurance as if you might need to head in that direction.

    *: Ironically, despite RISC-V having fewer GPRs, to implement the
    Privileged spec, RISC-V would still end up needing a somewhat bigger
    register file... Nevermind what exactly is going on with CSRs...

    Whereas that special State is only a dozen register <with state>
    in My 66000--the rest being either memory resident or memory mapped.

    My 68000 CPU core had a couple of task switching instructions added to
    it. I made a dedicated task switch RAM wide enough to load or store all
    the 68k registers in a single clock. Total task switch time was about
    four clocks IIRC. The interrupt vector table was setup to be able to automatically task switch on interrupt. The RAM had storage for up to
    512 tasks, but it was dedicated inside the CPU core rather than storing
    task information in the memory system.

    Q+ has a 64 register file, so it would take eight or nine cache lines to
    store the context. Q+ register file is 4w18r ATM. Getting from the
    register file to or from a cache line is a challenge. To access groups
    of eight registers at once would mean adding or using eight register
    file ports. The register file has only four write ports so only ½ of a
    cache line could be written to the file in a clock cycle. It is
    appealing to handle multiple registers per clock. Read/write ports are dedicated to specific function units, so making use of them for task
    switching may involve additional logic. I called the CSR to store the
    task state address the TS CSR.

    As I understand it normally RISCV does not use multiple register files,
    it has only a single file. There may be implementations out there that
    do make use of multiple files, but I think the standard is setup to get
    by with a single file.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Fri Nov 24 03:11:17 2023
    Robert Finch wrote:

    On 2023-11-23 6:30 p.m., MitchAlsup wrote:
    BGB wrote:



    Whereas that special State is only a dozen register <with state>
    in My 66000--the rest being either memory resident or memory mapped.

    My 68000 CPU core had a couple of task switching instructions added to
    it. I made a dedicated task switch RAM wide enough to load or store all
    the 68k registers in a single clock. Total task switch time was about
    four clocks IIRC. The interrupt vector table was setup to be able to automatically task switch on interrupt. The RAM had storage for up to
    512 tasks, but it was dedicated inside the CPU core rather than storing
    task information in the memory system.

    This is headed in the right direction. Make context switching something
    easy to pull off.

    Q+ has a 64 register file, so it would take eight or nine cache lines to store the context. Q+ register file is 4w18r ATM. Getting from the
    register file to or from a cache line is a challenge. To access groups
    of eight registers at once would mean adding or using eight register
    file ports. The register file has only four write ports so only ½ of a
    cache line could be written to the file in a clock cycle. It is
    appealing to handle multiple registers per clock. Read/write ports are dedicated to specific function units, so making use of them for task switching may involve additional logic. I called the CSR to store the
    task state address the TS CSR.

    4W generally ends up with 4R and replications lead to 8R 12R 16R and 20R.
    Yet you chose 18. Why ?

    This is above and beyond the "typical" operand consumption of a RISC ISA.
    Your typical 4-wide RISC ISA would have 8R (6-wide is better balanced at
    12R allowing 1 FU to consume 3-registers and 1 FU having only 1-operand
    (or forwarding). What are you using the other 5-operands for ??

    As I understand it normally RISCV does not use multiple register files,

    RISC-V has a 32 entry GPR and a 32 entry FPR.

    it has only a single file. There may be implementations out there that
    do make use of multiple files, but I think the standard is setup to get
    by with a single file.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Finch@21:1/5 to MitchAlsup on Thu Nov 23 23:37:54 2023
    On 2023-11-23 10:11 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    On 2023-11-23 6:30 p.m., MitchAlsup wrote:
    BGB wrote:



    Whereas that special State is only a dozen register <with state>
    in My 66000--the rest being either memory resident or memory mapped.

    My 68000 CPU core had a couple of task switching instructions added to
    it. I made a dedicated task switch RAM wide enough to load or store
    all the 68k registers in a single clock. Total task switch time was
    about four clocks IIRC. The interrupt vector table was setup to be
    able to automatically task switch on interrupt. The RAM had storage
    for up to 512 tasks, but it was dedicated inside the CPU core rather
    than storing task information in the memory system.

    This is headed in the right direction. Make context switching something
    easy to pull off.

    Q+ has a 64 register file, so it would take eight or nine cache lines
    to store the context. Q+ register file is 4w18r ATM. Getting from the
    register file to or from a cache line is a challenge. To access groups
    of eight registers at once would mean adding or using eight register
    file ports. The register file has only four write ports so only ½ of a
    cache line could be written to the file in a clock cycle. It is
    appealing to handle multiple registers per clock. Read/write ports are
    dedicated to specific function units, so making use of them for task
    switching may involve additional logic. I called the CSR to store the
    task state address the TS CSR.

    4W generally ends up with 4R and replications lead to 8R 12R 16R and 20R.
    Yet you chose 18. Why ?
    This is above and beyond the "typical" operand consumption of a RISC ISA. Your typical 4-wide RISC ISA would have 8R (6-wide is better balanced at
    12R allowing 1 FU to consume 3-registers and 1 FU having only 1-operand
    (or forwarding). What are you using the other 5-operands for ??

    As I understand it normally RISCV does not use multiple register files,

    RISC-V has a 32 entry GPR and a 32 entry FPR.

    it has only a single file. There may be implementations out there that
    do make use of multiple files, but I think the standard is setup to
    get by with a single file.

    I have 4w1r replicated 18 times. That is enough read ports to supply
    three operands each to six functional units. All six functional units
    may be scheduled at the same time. I have thought of trying to use fewer
    read ports by prioritizing the ports as it is unlikely that all ports
    would be needed at the same time. The current design is simple, but not resource efficient. Six function units are ALU0, ALU1, FPU, FCU, LOAD,
    STORE. The FCU really only needs two source operands.

    There is no forwarding in the design (yet). I have read this cost about
    10% in performance. I think this may be made up for by a smaller design
    that can operate at a higher fmax. I have found in the past that
    forwarding muxes appear on the critical timing path. I have seen another
    design eliminating forwarding. It made the difference between operating
    at 50 MHz or 60 MHz+. 20% gain in fmax. I think this may be an aspect of
    an FPGA implementation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to MitchAlsup on Fri Nov 24 00:44:04 2023
    On 11/23/2023 5:30 PM, MitchAlsup wrote:
    BGB wrote:

    On 11/23/2023 10:53 AM, MitchAlsup wrote:
    BGB wrote:


    If the "memcpy's" could be eliminated, this could roughly halve the
    cost of doing a syscall.

    I have MM (memory move) as a 3-operand instruction.


    None in my case...

    But, a memcpy loop can move ~ 280-300 MB/s in the L1 cache at 50MHz.
    Still might be better to not do a memcpy in these cases.

    Say, if the ISR handler could "merely" reassign the TBR register to
    switch from one task to another to perform the context switch (still
    ignoring all the loads/stores hidden in the prolog and epilog).


    One other option would be to do like RISC-V's privileged spec and
    have multiple copies of the register file (and likely instructions
    for accessing these alternate register files).

    There is one CPU register file, and every running thread has an address
    where that file comes from and goes to--just like a block of 4 cache
    lines;
    There is a 5th cache line that contains all the other PSW stuff.


    No direct equivalent.


    I was thinking sort of like the RISC-V Privileged spec, there are
    User/Supervisor/Machine sets, with the mode effecting which of these
    is visible.

    Obvious drawback in my case is that this would effectively increase
    the number of internal GPRs from 64 to 192 (and, at that point, may as
    well go to 4 copies and have 256).

    If this were handled in the decoder, this would mean roughly a 9-bit
    register selector field (vs the current 7 bits).

    Decode is not the problem, sensing 1:256 is a big problem, in practice
    even SRAMs only have 32-pairs of cells on a bit line using exotic timed
    sense amps.
    {{Decode is almost NEVER the logic delay problem:: ½ is situation recognition,
    the other ½ is fan-out buffering--driving the lines into the decoder is
    more
    gates of delay than determining if a given select line should be
    asserted.}}


    I had noted that there is a noticeable LUT cost difference between 32
    and 64 GPRs, which seems to go somewhat bigger than the difference
    expected from going from 5b/3b LUTRAMs to 6b/2b LUTRAMs.

    Like, adding a bit to the internal register ID fields (6b to 7b)
    propagated cost across the whole pipeline.


    The alternative would be to handle the register banking in the register
    file, using the CPU mode to select between the possible register banks.

    However, if still using LUTRAMs, the increase in register file size
    would likely increase the number of LUTs by roughly 5x.


    A theoretical estimate for the core number of LUTRAMs and "array support
    LUTs":
    32 GPRs: 396
    64 GPRs: 576
    256 GPRs: 2880

    This is ignoring the LUTs going into things like register forwarding, etc.

    Based on past experience, I suspect the actual cost difference to be a
    bit larger (given, say, the difference between a 32 GPR and 64 GPR configuration is notably larger than 180 LUTs).



    The increase in the number of CRs could be less, since only a few of
    them actually need duplication.


    But, don't want to go this way, and it would only be a partial
    solution that also does not map up well to my current implementation.



    Not sure how an OS on SH-4 would have managed all this, but I suspect
    their interrupt model would have had similar limitations to mine.

    Major differences:
       SH-4 banked out R0..R7 when entering an interrupt;
       The VBR relative entry-point offsets were a bit, ad-hoc.

    There were some fairly arbitrary displacements based on the type of
    interrupt. Almost like they designed their interrupt mechanism around
    a particular chunk of ASM code or something. In my case, I kept a
    similar idea, but just used a fixed 8-byte spacing, with the idea of
    these spots branching to the actual entry point.

    Though, one other difference is in my case I ended up adding a
    dedicated SYSCALL handler; on SH-4 they had used a TRAP instruction,
    which would have gone to the FAULT handler instead.


    It is in-theory possible to jump from Interrupt Mode to normal
    Supervisor Mode without a full context switch,

    but why ?? the probability that control returns from a given IST to its softIRQ is less than ½ in a loaded system.


    One might want to jump to save the cost of 2 context switches, but the
    hair this would involve didn't seem worth it.

    It would also result in a few other issues:
    System calls would not be interruptible;
    System calls could not reschedule the caller.
    Effectively, this would hinder things like "usleep()" or "yield()".

    Seemed better to go the route I did.


                                                    but the specifics of
    doing so would get a bit more hairy and arcane (which is sort of why I
    just sorta ended up using a context switch).

    Not sure what Linux on SH-4 had done, didn't really investigate this
    part of the code all that much at the time.


    In theory, the ISR handlers could be made to mimic the x86 TSS
    mechanism, but this wouldn't gain much.

    Stay away from anything you see in x86 except in using it a moniker
    to avoid.


    Yeah, not really losing much by not having a TSS...
    But, Intel probably thought it was a good idea...


    I think at one point, I had considered having tasks have both User and
    Supervisor state (with two stacks and two copies of all the
    registers), but ended up not going this way (and instead giving the
    syscalls their designated own task context; which also saves on
    per-task memory overhead).


    Worth the cost? Dunno.

    In my opinion--Absolutely worth it.

    Not too much different to modern Windows, where slow syscalls are
    still fairly common (and despite the slowness of the mechanism, it
    seems like BJX2 sycalls still manage to be around an order of
    magnitude faster than Windows syscalls in terms of clock-cycle
    cost...).

    Now, just get it down to a cache missing {L1, L2} instruction fetch.


    Looked into it a little more, realized that "an order of magnitude"
    may have actually been a little conservative; seems like Windows
    syscalls may be more in the area of 50-100k cycles.

    Why exactly? Dunno.


    This is still ignoring some of the "slow cases" which may take
    millions of clock cycles.

    It also seems like fast-ish syscalls may be more of a Linux thing.



    Why not just treat the RF as a cache with a known address in
    physical memory.
    In MY 66000 that is what I do and then just push and pull 4 cache
    lines at a
    time.


    Possible, but poses its own share of problems...

    Not sure how this could be implemented cost-effectively, or for that
    matter, more cheaply than a RISC-V style mode-banked register-file.

    1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead >>> of having 4 cache lines of state and 1 doubleword of address, you need
    16 cache lines of state.


    OK.


    Having only 1 set of registers is good...

    Issue is the mechanism for how to get all the contents in/out of the
    register file, in a way that is both cost effective, and faster than
    using a series of Load/Store instructions would have otherwise been.

    6R6W RFs are as big as one can practically build. You can get as much
    Read BW by duplication, but you only have "so much" Write BW (even when
    you know each write is to a different register).

    Short of a pipeline redesign, it is unlikely to exceed a best case of
    around 128 bits per clock cycle, with (in practice) there typically
    being other penalties due to things like L1 misses and similar.

    6R ports are 6*64-bits = 384-bits out and 384-bits in per cycle.


    If I were to use all the ports currently available, this would be:
    384 bits out, 192 bits in, per cycle.

    MOV.X can move 128-bits per cycle.

    My L1 cache can currently deal with 128 bits, but going bigger would
    pose issues. Biggest I could theoretically go at present would be
    256-bits with a mandatory 256-bit alignment.

    Anything beyond this would likely require significantly redesigning the
    L1 cache, and possibly also needing to modify the ringbus to do bigger transfers (a 256-bit front-end interface buys little if the operation is dominated by L1 misses).


    And, as-is, a fair chunk of the cost is L1 misses, and bigger transfers
    wont fix this.

    128-bits per cycle works, I can do this from software, ...


    But, yeah, I can still get a theoretical 50% reduction by
    saving/restoring registers directly into the TBR register-save area,
    rather than using "memcpy()" to do so...


    One bit of trickery would be, "what if" the Boot SRAM region were
    inside the L1 cache rather than out on the ringbus?...

    2 things::
    a) By giving threadstate an address you gain the ability to load the
    initial RF image from ROM as the CPU comes out of reset--it comes out
    with a complete RF, a complete thread.header, mapping tables, privilege
    and priority.
    b) Those ROM-based TLB entries map to the L1 and L2 caches in Allocate
    state (no underlying DRAM address availible) so you have ~1MB to play
    around
    with until you find DRAM, configure, initialize, and put in fee-pool.)
    So, here, you HAVE "enough" storage to program BOOT activities in a HLL
    (of your choice).


    My Boot ROM is already mostly written in C...

    Well, apart from the sanity checks, which are written in ASM.

    I did cut some corners to save space, for example, the Boot ROM's FAT
    driver is Read-Only and lacks support for LFNs, ... Mostly since for
    finding "bootload.sys" in the root directory, I don't need anything
    beyond 8.3 filenames, ...


    But, then one would have the cost of keeping 8K of SRAM close to the
    CPU core that is mostly only ever used during interrupt handling (but,
    probably still cheaper than making the register file 3x bigger, in any
    case...).

    Is the Icache and Dcache not close enough ?? If not then add L2 !!


    Accessing the Boot SRAM is around the same latency as accessing the L2
    cache, but has the advantage that it can never have an L2 miss.


    Though keeping it tied to a specific CPU core (and effectively
    processor local) would avoid the ugly "what if" scenario of two CPU
    cores trying to service an interrupt at the same time and potentially
    stepping on each others' stacks. The main tradeoff vs putting the
    stacks in DRAM is mostly that DRAM may have (comparably more
    expensive) L2 misses.


    Realized a simpler solution to the above issue (without needing to significantly redesign stuff):
    Make the SRAM area bigger and then subdivide it for each CPU core.

    Say:
    Core 1 gets a stack at 0x0000DF00, Core 2 at 0x0000FF00
    Or:
    Core 1 at 0x0000CF80
    Core 2 at 0x0000DF80
    Core 3 at 0x0000EF80
    Core 4 at 0x0000FF80

    And then hope none of the ISR's overflows their assigned stack space.


    The interrupt (re)mapping table takes care of this prior to the CPU being bothered. A {CPU or device} sends an interrupt to the Interrupt mapping
    table associated with the "Originating" thread. (IO/-MMU). That interrupt
    is logged into the table and if enabled its priority is used to determine which set of CPUs should be bothered, the affinity mask of the
    "Originating"
    thread is used to qualify which CPU from the priority set, and one of these is selected. The selected CPU is tapped on the shoulder, and sends a get- Interrupt request to the Interrupt table logic which sends back the
    priority
    and number of a pending interrupt. If the CPU is still at lower priority
    than the returning interrupt, the CPU <at this point> stops running code
    from the old thread and begins running code on the new thread.
    {{During the sending of the interrupt to the CPU and the receipt of the claim-Interrupt message, that interrupt will not get handed to any other CPU}} So, the CPU continues to run instructions while the CPUs contend
    for and claim unique interrupts. There are 512 unique interrupt at each of
    64 priority levels, and each process can have its own Interrupt Table.
    These tables need no maintenance except when interrupts are created and destroyed.}}

    HV, Guest HV, Guest OS each have their own unique interrupt tables;
    Although it could be arranged such that all could use the same table.


    Hmm, my boot-time state is more like:
    SR: Initialized to Supervisor mode and BJX2 Baseline;
    PC: Set to 0;
    MMCR: Set to 0;
    VBR: Set to 0;
    Most everything else: Potentially random garbage.

    When the RESET signal is asserted, some logic on the RingBus also
    effectively flushes anything on the bus to 0, since on the FPGA, it
    tends to start up in a state where the bus is filled with garbage (all
    the FF's and LUTRAMs tend to start up containing garbage, but curiously
    all of the BRAM's seem to be cleared to 0).

    Interestingly, my Verilog simulations artificially inject some amount of
    random garbage for testing purposes (otherwise, Verilator seems to start
    up with everything cleared to 0, unlike the real FPGA).


    Would add a potential "wonk" factor though, if this SRAM region were
    only visible for D$ access, but inaccessible from the I$. But, I guess
    one can argue, there isn't really a valid reason to try to run code
    from the ISR stack or similar.


    Though, could make sense if one has a mechanism where a context
    switch could have a mechanism to dump the whole register file to
    Block-RAM, and some sort of mechanism to access this RAM via an MMIO
    interface.

    Just put it in DRAM at SW controlled (via TLB) addresses.


    Possibly.

    It is also possible that some of the TBR / "struct TKPE_TaskInfo_s"
    stuff could be baked into hardware... But, I don't want to go this
    route (baking parts of it into the C ABI is at least "slightly" less
    evil).

    My mechanism is taking that struct task.....s (at least the part HW
    needs to understand) and associating each one into a table that points
    at DRAM. Now, when you want this thread to run, you load up the pointer
    set the e-bit (enabled) and write it into the current header at its
    privilege level. Poof--all 5 cache lines of state from the currently
    running thread goes back to where it permanent home in DRAM is, and
    the new thread fetches 5 cache lines of state of the new thread.
    a) you can start the reads before you start the writes
    b) you can start the writes anytime you have outbound access to "the bus"
    c) the writes can be no late than the ½ cycle before the reads get written. Which is a lot faster than you can do in SW with LDs and STs.


    I ended up simplifying this problem slightly:
    A previously reserved pointer at offset 0x0020 was repurposed to being a designated pointer to the register-save area.

    This effectively turns most of both the TKPE_TaskInfo_s and
    TKPE_TaskInfoKern_s structures into "don't care" as far as the ISR prolog/epilog needs to be concerned.

    Or, IOW: "(TBR, 0x20) holds a 64-bit pointer to the register save area".


    Also possible could be to add another CR for "Dump context registers
    here", this adds the costs of another CR though.

    I config-space mapped all my CRs, so you get an unlimited number of them.


    I have encoding space for 64 in theory (in XG2), 32 in Baseline.

    In practice, it is a little more limited, as the register ID space is
    also used for SPRs and special internal-use registers (like ZR, IMM, ...).

    Say:
    00..3F: GPRs
    40..5F: SPRs and special-use
    60..7F: CRs.

    The bigger issue though, is the (unlike GRPs), the CRs are implemented
    using FF's rather than LUTRAM, and thus, every CR is relatively expensive.


    A bunch of potential CR assignments were burnt on the SMT feature, which
    never materialized (and probably wont; mostly as I realized that the
    original considered strategy for trying to implement SMT would have
    ended up likely being more expensive than having two logical processor
    cores).

    About the only resources that would really make sense to share SMT style
    ATM would likely be the FPU and SIMD unit. It would likely also make
    more sense to have two mostly-independent pipelines, rather than a
    single extra-wide pipeline with logically co-issued threads.

    But, if I revisit the idea, it would likely end up looking more like a
    pair of semi-conjoined cores behaving as-if they were two independent cores.

    But, not yet reclaimed the register numbers.


    I guess I can probably safely rule out MMIO under the basis that
    context switching via moving registers via MMIO would be slower than
    the current mechanism (of using a series of Load/Store instructions).
    .................
    Yes, but PTHREADing can be done without privilege and in a single
    instruction.


    OK.

    Luckily, a thread-switch only needs to go 1-way, reducing it to around
    500 cycles as-is in my case.

    In my case it is about MemoryLatency+5 cycles.
    Yes, thread switch is a 1-way function--which is the reason you can
    allow a user to preempt himself and allow a compatriot to run in his place.....


    OK.


    Theoretical minimum would be around 150-200 cycles, with most of the
    savings based on eliminating around 1.5kB worth of "memcpy()"...

    My Real Time version of MY 66000 does 10-ish cycle context switch
    (as seen at the CPU) but here a hunk of HW has gathered up those 5 cache lines and sent them to the targeted CPU and all the CPU has to do is push
    out the old state (5-cache liens) So the data was heading towards the
    CPU before the CPU even knew it wanted that data !!


    Hmm.

    In my case, it seems more like stuff is going to get caught up in a
    series of L1 misses.

    Though, rapid-fire syscalls would at least have the advantage that there
    data is more likely to already be in-cache.


    This need not involve an ISA change, could in theory be done by making
    the SYSCALL ISR mandate that TBR be valid (and the associated compiler
    changes, likely the main issue here).



    Well, nevermind any cost of locating the next thread, but at the
    moment, I am using a fairly simplistic round-robin scheduling
    strategy, so the scheduler mostly starts at a given PID, and looks for
    the next PID that holds a valid/running task (wrapping back to PID 1
    if it hits the end, and stopping the search if it gets back to the
    original PID).


    The high-level threading model wasn't based on pthreads in my case,
    but rather C11 threads (and had implemented a lot of the "threads.h"
    stuff).

    One could potentially mimic pthreads on top of C11 threads though.

    At the moment, I forgot why I decided to go with C11 threads over
    pthreads, but IIRC I think I had felt at the time like C11 threads
    were a better fit.


    One could have enough register banks for N logical tasks, but
    supporting 4 or 8 copies of the register file is going to cost more
    than 2 or 3.


    Above, I was describing what the hardware was doing.

    The software side is basically more like:
       Branch from VBR-table to ISR entry point;
       Get R0 and R1 saved onto the stack;

    Where did you get the address of this stack ??


    SP and SSP swap places on interrupt entry (currently by renumbering
    the registers in the instruction decoder).

    So, in effect, you actually have 33 registers with only 32 visible at
    any instant. I am just so glad not to have gone down that rabbet hole
    this time......


    More like 65 with 64 visible at any given time.

    Early on, R0 and R1 had also swapped places with doppelganger
    counterparts, in a similar way, but I eliminated this early on.


    SSP is initialized early on to the SRAM stack, so when an interrupt
    happens, the 'SP' register automatically becomes the SRAM stack.

    Essentially, both SP and SSP are SPRs, but:
       SP is mapped into R15 in the GPR space;
       SSP is mapped into the CR space.

    So, when executing an ISR, it is effectively using SSP as its SP.


    If I were eliminate this implicit register-swap mechanism, then the
    ISR entry would likely need to reload a constant address each time.
    Though, this change would also break binary compatibility with my
    existing code.

    But, in theory, eliminating the register swap could allow demoting SP
    to being a normal GPR.

    Also, things like renumbering parts of the register space based on CPU
    mode is expensive.


    Though, some of my more recent design ideas would have gone over to an
    ordering slightly more like RISC-V, say:
       R0: ZR or PC  (ALU or MEM)
       R1: LR or TBR (ALU or MEM)
       R2: SP
       R3: GP (GBR)
       R4 -R15: Scratch
       R16-R31: Callee Save
       R32-R47: Scratch
       R48-R63: Callee Save

    Would likely not adopt RISC-V's C ABI though.

    R0::     GPR, Return Address, proxy for IP, proxy for 0
    R1..R9   Arguments and results passed in registers
    R10..R15 Temporary Registers (scratch)
    R16..R29 Callee Save
    R30      FP when in use, Callee Save
    R31      SP



    As-is, it is more like:
    R0: DLR or PC
    R1: DHR or GBR
    R2/R3: Scratch / Return
    R4..R7: Scratch / Arg0..Arg3
    R8..R14: Callee Save
    R15: SP
    R16..R19: Scratch
    R20..R23: Scratch / Arg4..Arg7
    R24..R31: Callee Save
    R32..R35: Scratch
    R36..R39: Scratch / Arg8..Arg11 (Opt)
    R40..R47: Callee Save
    R48..R51: Scratch
    R52..R55: Scratch / Arg12..Arg15 (Opt)
    R56..R63: Callee Save

    Which was effectively taking the general pattern for R0..R15, and then essentially repeating it 4 times.


    With RISC-V using partial remapping:
    X0: ZR
    X1: LR
    X2: SP
    X3: GBR
    X4: TBR (Read Only in Usermode)
    X5: DHR
    X6..X13: R6..R13
    X14: R2
    X15: R3
    X16..X31: R16..R31


    XG2RV uses RISC-V's register space, with a slightly tweaked version of
    XG2's encoding scheme.

    Initial plan was for XG2RV to use RISC-V's ABI, which could in theory
    allow thunk-free cross-ISA function calls, but... Getting BGBCC to
    support RISC-V's ABI would be a pain, and otherwise there is no real
    plausible way at the moment to link XG2 code and RISC-V code into a
    single binary, rendering the whole idea "kinda moot".


    Though, if one assumes R4..R63 are GPRs, this would allow both this
    ISA and RISC-V to still use the same register numbering.

    This is already fairly close to the register numbering scheme used in
    XG2RV, though the assumption was that XG2RV would have used RV's ABI,
    but this was stalled out mostly due to compiler issues (getting BGBCC
    to be able to follow RISC-V's C ABI rules would be a non-trivial level
    of effort; but is rendered moot if one still needs to use call thunking).


    The interpretation for R0 and R1 would depend on how they are used:
       ALU or similar: ZR and LR (Zero and Link Register)
       Load/Store Base: PC and TBR.

    Idea being that in userland, TBR effectively still exists as a
    Read-Only register (allowing userland to modify TBR would effectively
    also allow userland to wreck the OS).


    Thing is mostly that needing to renumber registers in the decoder
    based on CPU mode isn't entirely free in terms of LUT cost or timing
    latency (even if it only applies to a subset of the register space).

    Note that for RV decoding:
       X0..X31 -> R0 ..R31 (more or less)
       F0..F31 -> R32..R63

    But, RV's FPU instructions don't match up exactly 1:1, and some cases
    would have semantic differences.

    Though, it seems like most RV code could likely tolerate some
    deviation in some areas (will it care that the high 32 bits of a
    Binary32 register don't hold NaN? Will it care about the extra
    funkiness going on in LR? ...).


       Get some of the CRs saved off (we need R0 and R1 free here);
       Get the rest of the GPRs saved onto the stack;
       Call into the main part of the ISR handler (using normal C ABI);
       Restore most of the GPRs;
       Restore most of the CRs;
       Restore R0 and R1;
       Do an RTE.

    If HW does register file save/restore the above looks like::

    The software side is basically more like:
       Branch from VBR-table to ISR entry point;
       Call into the main part of the ISR handler (using normal C ABI);
       Do an RTE.

    See what it saves ??

    This is fewer instructions.

    But, hardware cost,

    the HW cost has already been purchased by the state machine that writes
    out 5-cache lines and waits for 5-cache lines to arrive.


    I don't have anything like this either...

    Miss handling is more like:
    L1 Cache sees that request has missed;
    Signal a pipeline stall;
    Throw requests onto the ringbus;
    Wait for responses to arrive;
    Execution continues when "all is good".

    State is mostly controlled with state flags, say:
    Has A sent a Store request;
    Has A sent a Load request;
    Has A gotten a response for a Store request;
    Has A gotten a response for a Load request;
    Has B sent a Store request;
    Has B sent a Load request;
    Has B gotten a response for a Store request;
    Has B gotten a response for a Load request;
    ...
    With an if/else tree dealing with the various cases.

    The ordering of the if/else tree will determine which order requests are
    sent, say:
    Store A
    Store B
    Load A
    Load B

    And checks to keep the pipeline stalled if a request has been sent but
    the corresponding response has not yet arrived.


    and clock-cycle savings?...
    The reads can arrive before you start the writes, you can go so far as
    to organize your pipeline so the read data being written pushes
    out the write data that needs to return to memory-making the timing
    brain dead easy to achieve.


    Wait, so, like pipelining requests/responses to external RAM?...

    In my case, all L1<->L2 communication is effectively synchronous and
    there isn't really any overlap between separate memory accesses (a given
    access will need to finish before any new memory access can begin).

    Getting too fancy here could raise issues, as the bus design introduces
    a certain level of "chaos" (responses will often not arrive in the same
    order the requests were sent, at the whim of L2 hit/miss).


    I have noted that I can have RAM backed framebuffer and a rasterizer
    module without significantly effecting memory bandwidth for the main CPU
    core (the different entities using the bus being mostly invisible to
    each other).


    Well, except when trying to switch the display module into 800x600 72Hz 256-color or hi-color mode or similar (then the screen turns to garbage
    and memory performance seemingly goes to crap).

    Seemingly, about the highest it can manage relatively OK is 640x480 60Hz 256-color.

    Previously, 800x600 worked OK at 36Hz, but this is rather non-standard.

    The 256-color mode works OK-ish, but:
    Still don't have a particularly good "generic" 256-color palette.

    Say, examples of the palette can be seen here:
    https://twitter.com/cr88192/status/1727574073566257232


    Also the process of remapping everything from RGB555 to 256-color seems
    to be kinda slow.

    With a sort of "no great option":
    15-bit lookup tables are too big, and will result in excessive L1 cache
    misses.

    Bit-twiddling RGB555 into GBR533 has a lot of bit twiddly, slightly
    worse looking results, but faster start-up times (it is a lot faster to regenerate an 11-bit lookup table than a 15 bit lookup table).

    Where, say, RGB555 to GBR533 is, say:
    ((v&0x3FC)<<1)|((v>>12)&7)


    Generally, things related to redrawing the windows and refreshing the
    screen tending to dominate CPU usage in this case.

    As can be noted, both Doom and Hexen stuck at "rather crap" framerates
    of around 8 fps.

    Palette used for the Doom image:
    16 shades of 16 colors;
    0z: Grayscale
    1z-6z: High Saturation
    9z-Ez: Low Saturation
    7z/8z/Fz: Off-White
    Newer palette:
    13 shades of 18 colors + RGBI colors;
    0z: Grayscale
    1z-6z: High Saturation
    9z-Ez: Low Saturation
    7z/8z/Fz: Off-White
    Vertically:
    z0: RGBI
    z1: Orange / Amber
    z2: Sky Blue

    In a prior variant, there was an olive-green axis, but I left it out of
    this one. The RGBI colors were because otherwise these colors could not
    be recreated faithfully (eg: for console text or 16-color bitmaps).
    Orange and Sky Blue slightly improve color-fidelity, and the loss of
    various "almost but not quite black" colors was not a huge loss.

    However, 0x70 and 0x80 were redundant, so 0x80 was repurposed as a
    transparent color. F0 and 0F are "not quite" redundant:
    F0 is 7FFF, 0F is 7BDE.



    As-is, I can't come up with much that is both:
       Fairly cheap to implement in hardware;
       Would saves a lot of clock-cycles over software-based options.

    As noted, the former is also why I had thus far mostly rejected the
    RISC-V strategy (*).

    Yet, you seem to be buying insurance as if you might need to head in that direction.


    Yeah, but I don't really want to go there either...


    *: Ironically, despite RISC-V having fewer GPRs, to implement the
    Privileged spec, RISC-V would still end up needing a somewhat bigger
    register file... Nevermind what exactly is going on with CSRs...

    Whereas that special State is only a dozen register <with state>
    in My 66000--the rest being either memory resident or memory mapped.

    OK.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul A. Clayton@21:1/5 to MitchAlsup on Fri Nov 24 08:19:34 2023
    On 11/23/23 6:30 PM, MitchAlsup wrote:
    [snip]
    Stay away from anything you see in x86 except in using it a moniker
    to avoid.

    Even a stopped (12-hour) clock is right twice a day.

    I hope you are not going to remove from My 66000 variable length
    instruction encoding, hardware handling of (some for x86, XSAVE/
    XRSTR) context saving and restoring, or even Memory Move.

    One could go even further and claim avoiding anything seen in x86
    means not having registers (a storage region with simple, compact
    addressing that an implementation will optimize as the common case
    for operands — the Mill's Belt counts as "registers" in this sense
    and even something like a transport-trigger architecture would
    likely have storage for values with temporal locality coarser than
    immediate use but frequent enough to justify simpler and more
    compact addressing).

    Yes, x86 messes up even these aspects. VLE does not have to be
    byte granular or use multiple prefixes in variable order. Hardware
    context save/restore does not have to be limited to extended
    state. A memory move instruction does not *need* to have a variant
    for each possible/likely chunk size or be implemented as
    substantially less performant than a software implementation, even
    with compile-time known size and alignment. Registers do not have
    to be limited to 8 or be accessed in sub-units.

    (Sub-unit access has some attraction to me for more efficiently
    using a limited storage space while still trying to keep access
    simple by limiting variability of shifting and complexity of
    partial write ordering, but less efficient storage use can easily
    be better than complexity of accessing the fastest and most
    commonly accessed storage. More recent ISAs have implemented
    partial register accesses. IBM ZArch, S/360 descendant, spit GPRs
    into high and low halves to increase the number of values
    available in the nominally 16 GPRs. AArch64 has 32-bit computer
    operations motivated, I think, for power saving, which do not
    increase the number of values and so avoids the shift and partial-
    write problems.)

    I suspect you could write a multi-volume treatise on x86 about hardware-software interface design and management (including the
    social and economic considerations of project/product management).
    Ignoring human factors, including those outside the organization
    owning the interface, seems attractive to a certain engineering
    mindset but human factors are significant design considerations.

    [Yet once more stating what is obvious, especially to one skilled
    in the art.]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Finch@21:1/5 to Paul A. Clayton on Fri Nov 24 09:43:23 2023
    On 2023-11-24 8:19 a.m., Paul A. Clayton wrote:
    On 11/23/23 6:30 PM, MitchAlsup wrote:
    [snip]
    Stay away from anything you see in x86 except in using it a moniker
    to avoid.

    Even a stopped (12-hour) clock is right twice a day.

    I hope you are not going to remove from My 66000 variable length
    instruction encoding, hardware handling of (some for x86, XSAVE/
    XRSTR) context saving and restoring, or even Memory Move.

    One could go even further and claim avoiding anything seen in x86
    means not having registers (a storage region with simple, compact
    addressing that an implementation will optimize as the common case
    for operands — the Mill's Belt counts as "registers" in this sense
    and even something like a transport-trigger architecture would
    likely have storage for values with temporal locality coarser than
    immediate use but frequent enough to justify simpler and more
    compact addressing).

    Yes, x86 messes up even these aspects. VLE does not have to be
    byte granular or use multiple prefixes in variable order. Hardware
    context save/restore does not have to be limited to extended
    state. A memory move instruction does not *need* to have a variant
    for each possible/likely chunk size or be implemented as
    substantially less performant than a software implementation, even
    with compile-time known size and alignment. Registers do not have
    to be limited to 8 or be accessed in sub-units.

    (Sub-unit access has some attraction to me for more efficiently
    using a limited storage space while still trying to keep access
    simple by limiting variability of shifting and complexity of
    partial write ordering, but less efficient storage use can easily
    be better than complexity of accessing the fastest and most
    commonly accessed storage. More recent ISAs have implemented
    partial register accesses. IBM ZArch, S/360 descendant, spit GPRs
    into high and low halves to increase the number of values
    available in the nominally 16 GPRs. AArch64 has 32-bit computer
    operations motivated, I think, for power saving, which do not
    increase the number of values and so avoids the shift and partial-
    write problems.)

    I suspect you could write a multi-volume treatise on x86 about hardware-software interface design and management (including the
    social and economic considerations of project/product management).
    Ignoring human factors, including those outside the organization
    owning the interface, seems attractive to a certain engineering
    mindset but human factors are significant design considerations.

    [Yet once more stating what is obvious, especially to one skilled
    in the art.]

    There is a lot of value in having a unique architecture. The x86 has had
    a lot of things bolted on to it. It has adapted over time. Being able to
    see how things have changed is valuable. I suspect just about any
    architecture adapted over a 40 or 50 years period would look no so
    appealing. I happen to like the segmented approach, not necessarily
    because it is a good way to do things, but it was certainly interesting
    and challenging. An interesting, challenging, and somewhat mysterious architecture may be more appealing than the best organized, most
    performant, energy efficient one. There is a trade-off between ‘the
    best’ and the ‘human factor’. I can imagine that there might be treaties limiting computer performance somewhere. Just how fast of a CPU is legal?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Fri Nov 24 13:41:26 2023
    My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}

    Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Fri Nov 24 18:24:00 2023
    Paul A. Clayton wrote:

    On 11/23/23 6:30 PM, MitchAlsup wrote:
    [snip]
    Stay away from anything you see in x86 except in using it a moniker
    to avoid.

    Even a stopped (12-hour) clock is right twice a day.

    I hope you are not going to remove from My 66000 variable length
    instruction encoding, hardware handling of (some for x86, XSAVE/
    XRSTR) context saving and restoring, or even Memory Move.

    It is these things which allows my architecture to only need 70%
    of the instructions RISC-V needs.

    One could go even further and claim avoiding anything seen in x86
    means not having registers (a storage region with simple, compact
    addressing that an implementation will optimize as the common case
    for operands — the Mill's Belt counts as "registers" in this sense
    and even something like a transport-trigger architecture would
    likely have storage for values with temporal locality coarser than
    immediate use but frequent enough to justify simpler and more
    compact addressing).

    Having 1 set of flat (any register can do any result or operand) is
    a My 66000 requirement, The only things I took from x86-64 is
    the [base+index<<scale+displacement] memory addressing model, and
    the 2-level MMU, even here I used the I/O MMU version rather than
    the processor version.

    Yes, x86 messes up even these aspects.
    VLE does not have to be byte granular or use multiple prefixes in variable order.
    VLE does not need prefixes of any kind.
    Hardware context save/restore does not have to be limited to extended state.
    HW S/R is most useful when it deals with ALL the state.
    A memory move instruction does not *need* to have a variant
    for each possible/likely chunk size or be implemented as
    substantially less performant than a software implementation,
    One can synthesize SIMD and Vector saving 90% of the OpCode space
    < even with compile-time known size and alignment. Registers do not have
    to be limited to 8 or be accessed in sub-units.

    (Sub-unit access has some attraction to me for more efficiently
    using a limited storage space while still trying to keep access
    simple by limiting variability of shifting and complexity of
    partial write ordering, but less efficient storage use can easily
    be better than complexity of accessing the fastest and most
    commonly accessed storage. More recent ISAs have implemented
    partial register accesses. IBM ZArch, S/360 descendant, spit GPRs
    into high and low halves to increase the number of values
    available in the nominally 16 GPRs. AArch64 has 32-bit computer
    operations motivated, I think, for power saving, which do not
    increase the number of values and so avoids the shift and partial-
    write problems.)

    I suspect this came out of already having to implement HW for
    IC (insert Character) instruction from System 360 time.

    I suspect you could write a multi-volume treatise on x86 about hardware-software interface design and management (including the
    social and economic considerations of project/product management).
    Ignoring human factors, including those outside the organization
    owning the interface, seems attractive to a certain engineering
    mindset but human factors are significant design considerations.

    It would be more beneficial to the world just to build an architecture
    without any of those flaws--just to show them how its done.

    [Yet once more stating what is obvious, especially to one skilled
    in the art.]

    Captain Obvious would be proud.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stefan Monnier on Fri Nov 24 22:21:53 2023
    Stefan Monnier wrote:

    My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}

    Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

    This seems to mimic RISC-V set of levels but done/named differently.

    The Guest OS and Guest HV levels are done in such a way that you can have a stack of Guest OSs of any depth and a stack or Guest HVs of any depth, although the HW only supports 4 levels HW with SW intervention supports any number of levels.

    In particular: Guest OS manages faults from Application, Guest HV manages faults from Guest OS {Which makes it possible to recover from page faults
    in the "sticky" places of interrupt and exception handling}, Real HV
    manages faults from Guest HV.



    Application accesses only 63-bits of virtual address space. If application makes an access with the HOB of the virtual address set, the access takes
    a fault.

    Guest OS can reach down into Application by accessing with the HOB clear (0)
    or access its own VAS with the HOB set (1).

    Guest HV can reach down into Guest OS by accessing with the HOB set (1)
    or access its own VAS with the HOB clear(0).

    Real HV can reach down into Guest HV by accessing with the HOB clear (0)
    or access its own VAS with the HOB set (1).


    Assuming we are running with a HV::
    Application accesses use 2-level paging through Application Mapping Tables. Guest OS accesses Application use 2-level paging through Application Mapping Tables; and accesses Guest OS use 2-level paging through Guest OS Tables.
    Guest HV accessing Guest OS use 2-level paging through Guest OS Tables,
    and access Guest HV use 1-level paging through Guest HV Tables.
    Real HV accesses Guest HV use 1-level paging through Guest HV Tables,
    and accesses Real HV use 1-level paging through Real HV Tables.

    --------------

    When a 2-level Mapping creates an UnCacheable, MMI/O, ROM or config space access, this intermediate addressed space determines the memory order. So, Guest OS can make a process address sequentially consistent by making all
    the PTEs use MMI/O space accesses. The second level of translation will,
    then, translate that access back to <say> cacheable DRAM to be performed. Likewise, should the second level of translation produce an access other
    than cacheable DRAM, memory order is determined by the stronger method
    of both translations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stefan Monnier on Sat Nov 25 00:01:00 2023
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}

    Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

    Would require priority decoders to differeniate rather
    than simple gates, probably.

    Although I wonder at the missing firmware privilege level, a la SMM or EL3.

    ARM added support for nested hypervisors without adding a
    new exception level. Although interesting, there isn't much
    evidence of it being used in production. Yet anyway.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Scott Lurndal on Fri Nov 24 20:57:03 2023
    On 11/24/2023 6:01 PM, Scott Lurndal wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV} >>
    Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

    Would require priority decoders to differeniate rather
    than simple gates, probably.

    Although I wonder at the missing firmware privilege level, a la SMM or EL3.

    ARM added support for nested hypervisors without adding a
    new exception level. Although interesting, there isn't much
    evidence of it being used in production. Yet anyway.


    It seems to me, one should be able to get away with 3 modes:
    Machine / ISR;
    Supervisor;
    User.

    With pretty much anything that isn't "bare metal" being put in User Mode (potentially using emulation traps as needed).

    Something like a Soft-TLB or Inverted-Page-Table does not need any
    special hardware to support nested translation (whereas hardware
    page-walking would require dedicated support).

    Not entirely sure how multi-level virtualization works with page-tables,
    but works "somehow".


    Then again, it is possible that doing everything in software could lead
    to people working in inner levels being jealous of those working in the
    outer levels for being closer to the hardware (and thus presumably
    having lower performance overheads).

    ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to MitchAlsup on Fri Nov 24 20:49:57 2023
    On 11/24/2023 12:24 PM, MitchAlsup wrote:
    Paul A. Clayton wrote:

    On 11/23/23 6:30 PM, MitchAlsup wrote:
    [snip]
    Stay away from anything you see in x86 except in using it a moniker
    to avoid.

    Even a stopped (12-hour) clock is right twice a day.

    I hope you are not going to remove from My 66000 variable length
    instruction encoding, hardware handling of (some for x86, XSAVE/
    XRSTR) context saving and restoring, or even Memory Move.

    It is these things which allows my architecture to only need 70%
    of the instructions RISC-V needs.


    In some of my tests, the total number of executed instructions tends to
    be less than RISC-V as well.


    Best I can tell, the main things that save instructions are mostly:
    Register-indexed load/store (~ 30% of Ld/St);
    MOV.X (~ 12% of Ld/St);
    Jumbo prefixes (~ 6%).


    Though, apparently, someone posted something recently showing RV64 and
    ARM64 to be much closer than expected, which is curious. The main
    instructions that seem to have "the most bang for the buck" are ones
    that ARM64 has equivalents of.

    More testing may be needed.



    In other news:
    Did add the compiler support to eliminate the "memcpy()" step from the task-switching (by having the prolog/epilog saving/restoring registers
    directly from the task context).

    Should roughly halve syscall overhead, along with shaving 480 bytes off
    the stack frame (some of the GPRs and CRs still need to be shuffled via
    the stack, so it only saves 480 bytes rather than 640).


    One could go even further and claim avoiding anything seen in x86
    means not having registers (a storage region with simple, compact
    addressing that an implementation will optimize as the common case
    for operands — the Mill's Belt counts as "registers" in this sense
    and even something like a transport-trigger architecture would
    likely have storage for values with temporal locality coarser than
    immediate use but frequent enough to justify simpler and more
    compact addressing).

    Having 1 set of flat (any register can do any result or operand) is
    a My 66000 requirement, The only things I took from x86-64 is
    the [base+index<<scale+displacement] memory addressing model, and
    the 2-level MMU, even here I used the I/O MMU version rather than the processor version.


    Yeah. Flat registers are good.

    Internally, [base+index*scale] or [base+disp*scale] can do "most of it"...


    Having [base+index*scale+disp] could do a little more, but seems to be
    somewhat rarer. I had experimented with such an encoding, but it didn't
    seem like it saw enough use-cases to justify the cost of its existence.

    Granted, it might be more useful if it could be encoded like:
    JumboImm+JumboOp+LdSt
    With a 33 bit displacement (rather than an 9/11 bit displacement), as
    this could potentially allow using it to address global arrays.

    IOW, potentially allowing:
    FEdd-dddd-FFw0-0Vdd-F0nm-0eoZ MOV.x (Rm, Ro*Sc, Disp33s), Rn


    Yes, x86 messes up even these aspects. VLE does not have to be byte
    granular or use multiple prefixes in variable order.
    VLE does not need prefixes of any kind.
    Hardware context save/restore does not have to be limited to extended
    state.
    HW S/R is most useful when it deals with ALL the state.
    A memory move instruction does not *need* to have a variant
    for each possible/likely chunk size or be implemented as
    substantially less performant than a software implementation,
    One can synthesize SIMD and Vector saving 90% of the OpCode space
    < even with compile-time known size and alignment. Registers do not have
    to be limited to 8 or be accessed in sub-units.

    (Sub-unit access has some attraction to me for more efficiently
    using a limited storage space while still trying to keep access
    simple by limiting variability of shifting and complexity of
    partial write ordering, but less efficient storage use can easily
    be better than complexity of accessing the fastest and most
    commonly accessed storage. More recent ISAs have implemented
    partial register accesses. IBM ZArch, S/360 descendant, spit GPRs
    into high and low halves to increase the number of values
    available in the nominally 16 GPRs. AArch64 has 32-bit computer
    operations motivated, I think, for power saving, which do not
    increase the number of values and so avoids the shift and partial-
    write problems.)

    I suspect this came out of already having to implement HW for IC (insert Character) instruction from System 360 time.

    Seems sane.


    I suspect you could write a multi-volume treatise on x86 about
    hardware-software interface design and management (including the
    social and economic considerations of project/product management).
    Ignoring human factors, including those outside the organization
    owning the interface, seems attractive to a certain engineering
    mindset but human factors are significant design considerations.

    It would be more beneficial to the world just to build an architecture without any of those flaws--just to show them how its done.


    People can probably debate what is ideal.


    There seem to be people around who see RISC-V as the model of perfection.

    I disagree, where some things seem to be corner cutting in areas where
    doing so is a foot gun, and other areas being needlessly expensive (and
    some things in the reaches of "extensions land" being just kinda absurd).

    In some ways, it is (as I see it) better to define some things and leave
    them as optional, rather than define little, and leave everyone else to
    make an incoherent mess of things.


    Then again, likely there is disagreements as to what sorts of features
    seem meaningful, wasteful, or needless extravagance.


    Granted, it does seem like x86 probably needs to be retired at some point...


    [Yet once more stating what is obvious, especially to one skilled
    in the art.]

    Captain Obvious would be proud.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to BGB on Sat Nov 25 16:55:30 2023
    BGB <cr88192@gmail.com> writes:
    On 11/24/2023 6:01 PM, Scott Lurndal wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV} >>>
    Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

    Would require priority decoders to differeniate rather
    than simple gates, probably.

    Although I wonder at the missing firmware privilege level, a la SMM or EL3. >>
    ARM added support for nested hypervisors without adding a
    new exception level. Although interesting, there isn't much
    evidence of it being used in production. Yet anyway.


    It seems to me, one should be able to get away with 3 modes:
    Machine / ISR;
    Supervisor;
    User.

    With pretty much anything that isn't "bare metal" being put in User Mode >(potentially using emulation traps as needed).

    Something like a Soft-TLB or Inverted-Page-Table does not need any
    special hardware to support nested translation (whereas hardware
    page-walking would require dedicated support).

    It's been tried. And performance sucked big-time. The reason
    that AMD added back support for the DS limit register in AMD64
    was to support xen (and vmware) before Pacifica (the AMD project
    that became Secure Virtual Machine (SVM) known now as AMD-V).

    Both intel and AMD use a block of memory to record guest state
    and have instructions to enter and leave VM mode (e.g. vmenter);
    ARM stores guest state in system registers - less overhead
    when switching from guest to host or guest to guest.


    Not entirely sure how multi-level virtualization works with page-tables,
    but works "somehow".

    But not well, nor performant.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Scott Lurndal on Sat Nov 25 12:31:36 2023
    On 11/25/2023 10:55 AM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/24/2023 6:01 PM, Scott Lurndal wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}

    Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

    Would require priority decoders to differeniate rather
    than simple gates, probably.

    Although I wonder at the missing firmware privilege level, a la SMM or EL3. >>>
    ARM added support for nested hypervisors without adding a
    new exception level. Although interesting, there isn't much
    evidence of it being used in production. Yet anyway.


    It seems to me, one should be able to get away with 3 modes:
    Machine / ISR;
    Supervisor;
    User.

    With pretty much anything that isn't "bare metal" being put in User Mode
    (potentially using emulation traps as needed).

    Something like a Soft-TLB or Inverted-Page-Table does not need any
    special hardware to support nested translation (whereas hardware
    page-walking would require dedicated support).

    It's been tried. And performance sucked big-time. The reason
    that AMD added back support for the DS limit register in AMD64
    was to support xen (and vmware) before Pacifica (the AMD project
    that became Secure Virtual Machine (SVM) known now as AMD-V).


    OK.

    I wouldn't expect nested inverted-page-table translation to be *that*
    much slower than normal inverted page tables. Though, would add a bit of multi-level translation wonk in the top-level miss handler (and likely
    still better than multi-level soft-TLB, where a miss in the outer TLB
    level means needing to propagate the interrupt inwards and then
    emulating it the whole way up).

    Granted, there is still the annoyance that the OS's tend to deal with page-tables, and one needs to translate to inverted page tables, which typically have a finite associativity (such as 4 or 8 way).

    Would mean that multi-level interrupt handling would still be needed
    whenever the page isn't in the guest's TLB or VIPT (short of breaking abstraction and faking the use of hardware page walking for the guest OS's).


    Granted, full soft TLB isn't ideal for performance either (in general),
    my workaround was mostly making the TLB big enough that the average-case
    miss rate is kept fairly low (well, and for now, putting the whole OS in
    one big address space).

    But, multiple address spaces is sort of the whole point of VMs, so...

    Seems like one might need a mechanism to remap the VM from real CR's to
    a partially emulated set of CR's (VCR's ?...).


    Both intel and AMD use a block of memory to record guest state
    and have instructions to enter and leave VM mode (e.g. vmenter);
    ARM stores guest state in system registers - less overhead
    when switching from guest to host or guest to guest.


    OK.



    Not entirely sure how multi-level virtualization works with page-tables,
    but works "somehow".

    But not well, nor performant.


    As far as I know, the whole "nested page tables" was the core of how virtualization worked on x86...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Sat Nov 25 19:27:04 2023
    BGB wrote:

    On 11/25/2023 10:55 AM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/24/2023 6:01 PM, Scott Lurndal wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}

    Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

    Would require priority decoders to differeniate rather
    than simple gates, probably.

    Although I wonder at the missing firmware privilege level, a la SMM or EL3.

    ARM added support for nested hypervisors without adding a
    new exception level. Although interesting, there isn't much
    evidence of it being used in production. Yet anyway.


    It seems to me, one should be able to get away with 3 modes:
    Machine / ISR;
    Supervisor;
    User.

    With pretty much anything that isn't "bare metal" being put in User Mode >>> (potentially using emulation traps as needed).

    Something like a Soft-TLB or Inverted-Page-Table does not need any
    special hardware to support nested translation (whereas hardware
    page-walking would require dedicated support).

    It's been tried. And performance sucked big-time. The reason
    that AMD added back support for the DS limit register in AMD64
    was to support xen (and vmware) before Pacifica (the AMD project
    that became Secure Virtual Machine (SVM) known now as AMD-V).


    OK.

    I wouldn't expect nested inverted-page-table translation to be *that*
    much slower than normal inverted page tables. Though, would add a bit of multi-level translation wonk in the top-level miss handler (and likely
    still better than multi-level soft-TLB, where a miss in the outer TLB
    level means needing to propagate the interrupt inwards and then
    emulating it the whole way up).

    Think of it like this:: Privilege inversion::

    If HV is performing table walks on behalf of Guest OS, HV is having to
    rummage through Guest OS tables and then rummage through HV own tables.
    Here having HV rummage through Guest OS tables is more than a hassle,
    nothing in HV should directly touch anything in Guest OS unless Guest
    OS grants access (and not implicitly as is herein).

    What you REALLY want is for Guest OS to manage its own tables and HV to
    manage its own tables. Thereby no particular piece of SW is capable of operating at the lowest privilege of {Guest OS, HV} it can be 1 or the
    other.

    The above holds for any kind of tables, nested, inverted, nested inverted,
    ..

    Granted, there is still the annoyance that the OS's tend to deal with page-tables, and one needs to translate to inverted page tables, which typically have a finite associativity (such as 4 or 8 way).

    Would mean that multi-level interrupt handling would still be needed
    whenever the page isn't in the guest's TLB or VIPT (short of breaking abstraction and faking the use of hardware page walking for the guest OS's).


    Granted, full soft TLB isn't ideal for performance either (in general),
    my workaround was mostly making the TLB big enough that the average-case
    miss rate is kept fairly low (well, and for now, putting the whole OS in
    one big address space).

    But, multiple address spaces is sort of the whole point of VMs, so...

    Seems like one might need a mechanism to remap the VM from real CR's to
    a partially emulated set of CR's (VCR's ?...).


    Both intel and AMD use a block of memory to record guest state
    and have instructions to enter and leave VM mode (e.g. vmenter);
    ARM stores guest state in system registers - less overhead
    when switching from guest to host or guest to guest.


    OK.



    Not entirely sure how multi-level virtualization works with page-tables, >>> but works "somehow".

    But not well, nor performant.


    As far as I know, the whole "nested page tables" was the core of how virtualization worked on x86...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to BGB on Sat Nov 25 19:28:50 2023
    BGB <cr88192@gmail.com> writes:
    On 11/25/2023 10:55 AM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/24/2023 6:01 PM, Scott Lurndal wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}

    Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

    Would require priority decoders to differeniate rather
    than simple gates, probably.

    Although I wonder at the missing firmware privilege level, a la SMM or EL3.

    ARM added support for nested hypervisors without adding a
    new exception level. Although interesting, there isn't much
    evidence of it being used in production. Yet anyway.


    It seems to me, one should be able to get away with 3 modes:
    Machine / ISR;
    Supervisor;
    User.

    With pretty much anything that isn't "bare metal" being put in User Mode >>> (potentially using emulation traps as needed).

    Something like a Soft-TLB or Inverted-Page-Table does not need any
    special hardware to support nested translation (whereas hardware
    page-walking would require dedicated support).

    It's been tried. And performance sucked big-time. The reason
    that AMD added back support for the DS limit register in AMD64
    was to support xen (and vmware) before Pacifica (the AMD project
    that became Secure Virtual Machine (SVM) known now as AMD-V).


    OK.

    I wouldn't expect nested inverted-page-table translation to be *that*
    much slower than normal inverted page tables. Though, would add a bit of >multi-level translation wonk in the top-level miss handler (and likely
    still better than multi-level soft-TLB, where a miss in the outer TLB
    level means needing to propagate the interrupt inwards and then
    emulating it the whole way up).

    Let's look at a hardware example of a nested page table walk,
    using the AMD nested page table feature as a guide. The AMD
    version uses the same PTE format as the non-virtualized page
    tables (which reduces the amount of kernel code required to
    manage the page tables) unlike Intel's EPT.

    Assuming 4k-byte pages in both the primary and nested page tables,
    a page table walk must make 22 memory accesses to satisfy a
    VA to PA translation, versus only four in a non-virtualized
    table walk. This can be reduced to 11 if you have the luxury
    of using 1GB mappings in the nested page table.

    Performing all those accesses in a kernel fault handler would
    consume a great deal more time than a hardware table walker will (particularly if the hardware table walkers can cache the intermediate results
    of the higher-level blocks in the walk in the walk hardware).

    The downsides of IPT pretty much preclude their use in most
    modern operating systems where shared memory between processes
    is common (explicitly -or- implicitly (such as VDSO on linux));
    some of goals listed as benefits for IPT (e.g. easier whole
    process swapping) are made irrelevent by modern operating
    systems that don't do that. There's a rather incoherent
    description of IPT at geeksforgeeks - I'd not recommend it
    as a useful resource.


    Would mean that multi-level interrupt handling would still be needed
    whenever the page isn't in the guest's TLB or VIPT (short of breaking >abstraction and faking the use of hardware page walking for the guest OS's).

    If you're taking an interrupt, to resolve guest TLB misses,
    performance is clearly not high priority.



    Seems like one might need a mechanism to remap the VM from real CR's to
    a partially emulated set of CR's (VCR's ?...).

    ARM does this by adding a layer above the OS ring that can trap
    accesses to certain control registers used by the OS to the
    hypervisor for resolution. But for the most part, the guest just
    uses the same control registers as if it were running bare metal with
    no trapping - they're just loaded by the hypervisor before the guest
    is dispatched and saved by the hypervisor when scheduling a new
    guest. Thats an advantage of the exception level scheme, where
    each level has its own set of control registers.

    However, shortcoming of the initial implementation was if the
    hypervisor was type II, the hypervisor needed to have a special
    privileged guest to run standard user-mode code[*]. So they
    added (in V8.1) the virtual host extensions (VHE) which allowed
    the hypervisor exception level (EL2) to directly dispatch
    user-mode code to EL0 (with the normal traps from usermode
    to the OS directed to the hypervisor instead of a guest OS). This
    let the hypervisor (e.g. KVM) to act both as a hypervisor and
    a guest OS with out the context switches required to support a
    privileged guest).

    [*] And also to provide VFIO support for non-SRIOV hardware devices.


    But not well, nor performant.


    As far as I know, the whole "nested page tables" was the core of how >virtualization worked on x86...

    Before AMD added NPT (Nested Page Tables), the hypervisor needed to
    be able to recognize and trap any accesses from the guest OS to
    its own page tables and update the real page tables accordingly.
    To do that, they had several options:
    1) Paravirtualization (i.e. all guest page table ops call the
    hypervisor rather than changing the page tables directly);
    Xen did this.
    2) Write-protecting the page tables and trapping any writes in
    the hypervisor. Difficult to do since the page tables in
    common OS are not allocated contiguously and they are updated
    using normal loads and stores (the HV does know them, however,
    as it can trap writes to CR3 and from there can write-protect
    the entire table in the real page tables).
    3) Binary patch the guest operating system. This was the approach used
    by VMware before AMD introduced NPT.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to BGB on Sat Nov 25 22:10:45 2023
    BGB <cr88192@gmail.com> writes:
    On 11/25/2023 1:28 PM, Scott Lurndal wrote:


    If you're taking an interrupt, to resolve guest TLB misses,
    performance is clearly not high priority.


    If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
    CPU), then the cost of the TLB miss handling is on par with other things
    like handling the timer interrupt, etc...

    Any cycle used by the miss handler is a cycle that could
    have been used for useful work. Timer interrupt handling
    is often very short (increment a memory location, a comparison
    and a return if no timer has expired). And we're long
    past the days of using regular timer interrupts for scheduling
    (see tickless kernels, for example).



    But, what one does need, is a way to perform context switches without
    also triggering a huge wave of TLB misses in the process.

    Why?

    Note that depending on the number of entries in your TLB
    and the scheduler behavior, it's unlikely that any prior
    TLB entries will be useful to a newly scheduled thread
    (in a different address space).

    Having multiple banks of TLBs that you can switch between
    might be able to provide you with the capability to
    reduce the TLB miss rate on scheduling a new thread of
    execution - but CAMs aren't cheap.

    For the most part, industry has settled on a large number
    of tagged TLB entries as a good compromise. Some architectures have
    a global bit in the entry that can be set via the page
    table that indicates that ASID and/or VMID qualifications
    aren't necessary for a hit.


    Big TLB + strategic sharing and ASIDs can help here at least (whereas, a
    full TLB flush on context-switch would suck pretty bad).

    That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half
    of the virtrual address space is shared by all processes - there's no reason that those entries need to be flushed on context-switch.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Scott Lurndal on Sat Nov 25 15:39:53 2023
    On 11/25/2023 1:28 PM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/25/2023 10:55 AM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/24/2023 6:01 PM, Scott Lurndal wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}

    Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

    Would require priority decoders to differeniate rather
    than simple gates, probably.

    Although I wonder at the missing firmware privilege level, a la SMM or EL3.

    ARM added support for nested hypervisors without adding a
    new exception level. Although interesting, there isn't much
    evidence of it being used in production. Yet anyway.


    It seems to me, one should be able to get away with 3 modes:
    Machine / ISR;
    Supervisor;
    User.

    With pretty much anything that isn't "bare metal" being put in User Mode >>>> (potentially using emulation traps as needed).

    Something like a Soft-TLB or Inverted-Page-Table does not need any
    special hardware to support nested translation (whereas hardware
    page-walking would require dedicated support).

    It's been tried. And performance sucked big-time. The reason
    that AMD added back support for the DS limit register in AMD64
    was to support xen (and vmware) before Pacifica (the AMD project
    that became Secure Virtual Machine (SVM) known now as AMD-V).


    OK.

    I wouldn't expect nested inverted-page-table translation to be *that*
    much slower than normal inverted page tables. Though, would add a bit of
    multi-level translation wonk in the top-level miss handler (and likely
    still better than multi-level soft-TLB, where a miss in the outer TLB
    level means needing to propagate the interrupt inwards and then
    emulating it the whole way up).

    Let's look at a hardware example of a nested page table walk,
    using the AMD nested page table feature as a guide. The AMD
    version uses the same PTE format as the non-virtualized page
    tables (which reduces the amount of kernel code required to
    manage the page tables) unlike Intel's EPT.

    Assuming 4k-byte pages in both the primary and nested page tables,
    a page table walk must make 22 memory accesses to satisfy a
    VA to PA translation, versus only four in a non-virtualized
    table walk. This can be reduced to 11 if you have the luxury
    of using 1GB mappings in the nested page table.

    Performing all those accesses in a kernel fault handler would
    consume a great deal more time than a hardware table walker will (particularly
    if the hardware table walkers can cache the intermediate results
    of the higher-level blocks in the walk in the walk hardware).


    OK.

    The downsides of IPT pretty much preclude their use in most
    modern operating systems where shared memory between processes
    is common (explicitly -or- implicitly (such as VDSO on linux));
    some of goals listed as benefits for IPT (e.g. easier whole
    process swapping) are made irrelevent by modern operating
    systems that don't do that. There's a rather incoherent
    description of IPT at geeksforgeeks - I'd not recommend it
    as a useful resource.


    I was thinking of an IPT where one basically keeps stuff from all of the currently running process in a shared IPT, mostly treating it like a big memory-backed form of the TLB.

    Though, sharing is a concern:
    If you hash entries based on ASID, then there are fewer collisions, but
    no sharing;
    Sharing requires addressing to effectively be plain modulo within the
    areas that can be shared.


    Initially, I had assumed non-hashed modulo indexing, but this does mean
    a potentially higher collision rate if different ASIDs have different
    pages in the same overlapping address ranges.

    Something like 8-way associativity would be "better" here at reducing
    this issue, but more expensive to deal with in hardware than 4-way.



    Would mean that multi-level interrupt handling would still be needed
    whenever the page isn't in the guest's TLB or VIPT (short of breaking
    abstraction and faking the use of hardware page walking for the guest OS's).

    If you're taking an interrupt, to resolve guest TLB misses,
    performance is clearly not high priority.


    If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
    CPU), then the cost of the TLB miss handling is on par with other things
    like handling the timer interrupt, etc...

    But, what one does need, is a way to perform context switches without
    also triggering a huge wave of TLB misses in the process.

    Big TLB + strategic sharing and ASIDs can help here at least (whereas, a
    full TLB flush on context-switch would suck pretty bad).



    But, doing traditional "every process gets its own address space" takes
    a hit here (no good option other than to limit the task-switching
    frequency, but this may become obvious to the user if the task switching
    is too slow).

    So, for something like a 50MHz core, this might mean, say, allowing a
    process to run for up to 250ms before the preemptive task-switch
    mechanism kicks in. But, 250ms is slow enough to become obvious to a
    user (or, at least, much more so than, say, 100ms).


    Though, probably still better than a purely cooperative scheduler, where
    a process failing to call "thrd_yeild()" effectively locks up the whole
    rest of the system (in my GUI experiments, this effect results in, say,
    Doom effectively locking up the whole GUI until it running the game
    proper, where it then starts calling "thrd_yeild()").

    Though, might make sense to consider also being able to forcibly yield
    threads on system calls and/or in some other C library calls.

    Though, in these cases, will likely still need to start adding mutex
    locks in some areas.




    Seems like one might need a mechanism to remap the VM from real CR's to
    a partially emulated set of CR's (VCR's ?...).

    ARM does this by adding a layer above the OS ring that can trap
    accesses to certain control registers used by the OS to the
    hypervisor for resolution. But for the most part, the guest just
    uses the same control registers as if it were running bare metal with
    no trapping - they're just loaded by the hypervisor before the guest
    is dispatched and saved by the hypervisor when scheduling a new
    guest. Thats an advantage of the exception level scheme, where
    each level has its own set of control registers.

    However, shortcoming of the initial implementation was if the
    hypervisor was type II, the hypervisor needed to have a special
    privileged guest to run standard user-mode code[*]. So they
    added (in V8.1) the virtual host extensions (VHE) which allowed
    the hypervisor exception level (EL2) to directly dispatch
    user-mode code to EL0 (with the normal traps from usermode
    to the OS directed to the hypervisor instead of a guest OS). This
    let the hypervisor (e.g. KVM) to act both as a hypervisor and
    a guest OS with out the context switches required to support a
    privileged guest).

    [*] And also to provide VFIO support for non-SRIOV hardware devices.


    OK.


    But not well, nor performant.


    As far as I know, the whole "nested page tables" was the core of how
    virtualization worked on x86...

    Before AMD added NPT (Nested Page Tables), the hypervisor needed to
    be able to recognize and trap any accesses from the guest OS to
    its own page tables and update the real page tables accordingly.
    To do that, they had several options:
    1) Paravirtualization (i.e. all guest page table ops call the
    hypervisor rather than changing the page tables directly);
    Xen did this.
    2) Write-protecting the page tables and trapping any writes in
    the hypervisor. Difficult to do since the page tables in
    common OS are not allocated contiguously and they are updated
    using normal loads and stores (the HV does know them, however,
    as it can trap writes to CR3 and from there can write-protect
    the entire table in the real page tables).
    3) Binary patch the guest operating system. This was the approach used
    by VMware before AMD introduced NPT.


    OK.

    FWIW:
    One feature of my VUGID/ACLID scheme, is that it is possible to have
    memory be Read/Write to one task, and Read-Only to another task (with a
    trap if they try to write to it), without needing to use separate
    mappings (and thus, both tasks can share the same TLBE's; but will get different access depending on who accesses it).

    Though, I don't expect this scheme would see much adoption in mainline
    OS's, nor likely much adoption in targets based around hardware
    page-table walkers...





    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Sun Nov 26 01:50:39 2023
    Scott Lurndal wrote:

    BGB <cr88192@gmail.com> writes:
    On 11/25/2023 10:55 AM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/24/2023 6:01 PM, Scott Lurndal wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}

    Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

    Would require priority decoders to differeniate rather
    than simple gates, probably.

    Although I wonder at the missing firmware privilege level, a la SMM or EL3.

    ARM added support for nested hypervisors without adding a
    new exception level. Although interesting, there isn't much
    evidence of it being used in production. Yet anyway.


    It seems to me, one should be able to get away with 3 modes:
    Machine / ISR;
    Supervisor;
    User.

    With pretty much anything that isn't "bare metal" being put in User Mode >>>> (potentially using emulation traps as needed).

    Something like a Soft-TLB or Inverted-Page-Table does not need any
    special hardware to support nested translation (whereas hardware
    page-walking would require dedicated support).

    It's been tried. And performance sucked big-time. The reason
    that AMD added back support for the DS limit register in AMD64
    was to support xen (and vmware) before Pacifica (the AMD project
    that became Secure Virtual Machine (SVM) known now as AMD-V).


    OK.

    I wouldn't expect nested inverted-page-table translation to be *that*
    much slower than normal inverted page tables. Though, would add a bit of >>multi-level translation wonk in the top-level miss handler (and likely >>still better than multi-level soft-TLB, where a miss in the outer TLB
    level means needing to propagate the interrupt inwards and then
    emulating it the whole way up).

    Let's look at a hardware example of a nested page table walk,
    using the AMD nested page table feature as a guide. The AMD
    version uses the same PTE format as the non-virtualized page
    tables (which reduces the amount of kernel code required to
    manage the page tables) unlike Intel's EPT.

    Assuming 4k-byte pages in both the primary and nested page tables,
    a page table walk must make 22 memory accesses to satisfy a
    VA to PA translation, versus only four in a non-virtualized
    table walk. This can be reduced to 11 if you have the luxury
    of using 1GB mappings in the nested page table.

    20 of those 22 accesses are subject to caching of various flavors.

    Performing all those accesses in a kernel fault handler would
    consume a great deal more time than a hardware table walker will (particularly
    if the hardware table walkers can cache the intermediate results
    of the higher-level blocks in the walk in the walk hardware).

    The downsides of IPT pretty much preclude their use in most
    modern operating systems where shared memory between processes
    is common (explicitly -or- implicitly (such as VDSO on linux));
    some of goals listed as benefits for IPT (e.g. easier whole
    process swapping) are made irrelevent by modern operating
    systems that don't do that. There's a rather incoherent
    description of IPT at geeksforgeeks - I'd not recommend it
    as a useful resource.

    If you want to run any form of *nix you must design the center of
    control at/in the CPU[s].....for better or worse.

    Would mean that multi-level interrupt handling would still be needed >>whenever the page isn't in the guest's TLB or VIPT (short of breaking >>abstraction and faking the use of hardware page walking for the guest OS's).

    If you're taking an interrupt, to resolve guest TLB misses,
    performance is clearly not high priority.



    Seems like one might need a mechanism to remap the VM from real CR's to
    a partially emulated set of CR's (VCR's ?...).

    ARM does this by adding a layer above the OS ring that can trap
    accesses to certain control registers used by the OS to the
    hypervisor for resolution. But for the most part, the guest just
    uses the same control registers as if it were running bare metal with
    no trapping - they're just loaded by the hypervisor before the guest
    is dispatched and saved by the hypervisor when scheduling a new
    guest. Thats an advantage of the exception level scheme, where
    each level has its own set of control registers.

    My 66000 memory maps control registers {CPU, LLC, NorthBridge,
    device, ...} into MMI/O space. A CPU, with access permission,
    can read or write another CPU's control registers--used sparingly
    to get out of trouble. Mainly this is used to allow a CPU to
    read or write device control registers.

    However, shortcoming of the initial implementation was if the
    hypervisor was type II, the hypervisor needed to have a special
    privileged guest to run standard user-mode code[*]. So they
    added (in V8.1) the virtual host extensions (VHE) which allowed
    the hypervisor exception level (EL2) to directly dispatch
    user-mode code to EL0 (with the normal traps from usermode
    to the OS directed to the hypervisor instead of a guest OS). This
    let the hypervisor (e.g. KVM) to act both as a hypervisor and
    a guest OS with out the context switches required to support a
    privileged guest).

    [*] And also to provide VFIO support for non-SRIOV hardware devices.


    But not well, nor performant.


    As far as I know, the whole "nested page tables" was the core of how >>virtualization worked on x86...

    Before AMD added NPT (Nested Page Tables), the hypervisor needed to
    be able to recognize and trap any accesses from the guest OS to
    its own page tables and update the real page tables accordingly.
    To do that, they had several options:
    1) Paravirtualization (i.e. all guest page table ops call the
    hypervisor rather than changing the page tables directly);
    Xen did this.
    2) Write-protecting the page tables and trapping any writes in
    the hypervisor. Difficult to do since the page tables in
    common OS are not allocated contiguously and they are updated
    using normal loads and stores (the HV does know them, however,
    as it can trap writes to CR3 and from there can write-protect
    the entire table in the real page tables).
    3) Binary patch the guest operating system. This was the approach used
    by VMware before AMD introduced NPT.

    Nested Page Tables are the best solution (Fewest SW instructions of
    overhead and total cycles of latency) we currently know of.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Sun Nov 26 16:01:30 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:


    Seems like one might need a mechanism to remap the VM from real CR's to
    a partially emulated set of CR's (VCR's ?...).

    ARM does this by adding a layer above the OS ring that can trap
    accesses to certain control registers used by the OS to the
    hypervisor for resolution. But for the most part, the guest just
    uses the same control registers as if it were running bare metal with
    no trapping - they're just loaded by the hypervisor before the guest
    is dispatched and saved by the hypervisor when scheduling a new
    guest. Thats an advantage of the exception level scheme, where
    each level has its own set of control registers.

    My 66000 memory maps control registers {CPU, LLC, NorthBridge,
    device, ...} into MMI/O space. A CPU, with access permission,
    can read or write another CPU's control registers--used sparingly
    to get out of trouble. Mainly this is used to allow a CPU to
    read or write device control registers.

    ARM supports access to CPU system registers via MMIO;
    primarily for debug purposes. System Registers may be accessed
    either via MMIO accesses from a running core, subject to
    permission controls, or via JTAG interface(s).

    The preferred way to access a cores own system registers is
    via the MSR/MRS instructions.

    <snip>

    Nested Page Tables are the best solution (Fewest SW instructions of
    overhead and total cycles of latency) we currently know of.

    Indeed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Sun Nov 26 19:28:13 2023
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:


    Seems like one might need a mechanism to remap the VM from real CR's to >>>>a partially emulated set of CR's (VCR's ?...).

    ARM does this by adding a layer above the OS ring that can trap
    accesses to certain control registers used by the OS to the
    hypervisor for resolution. But for the most part, the guest just
    uses the same control registers as if it were running bare metal with
    no trapping - they're just loaded by the hypervisor before the guest
    is dispatched and saved by the hypervisor when scheduling a new
    guest. Thats an advantage of the exception level scheme, where
    each level has its own set of control registers.

    My 66000 memory maps control registers {CPU, LLC, NorthBridge,
    device, ...} into MMI/O space. A CPU, with access permission,
    can read or write another CPU's control registers--used sparingly
    to get out of trouble. Mainly this is used to allow a CPU to
    read or write device control registers.

    ARM supports access to CPU system registers via MMIO;
    primarily for debug purposes. System Registers may be accessed
    either via MMIO accesses from a running core, subject to
    permission controls, or via JTAG interface(s).

    Nice to know someone already blazed the trail.

    The preferred way to access a cores own system registers is
    via the MSR/MRS instructions.

    My 66000 has a HR (Header Register) instruction to access one
    register at a time, but a MM (memory to memory move) instruction
    can be used to swap the entire core-stack {HV-level context switch.}
    MM to a MMI/O space is guaranteed to be ATOMIC across the entire
    transfer.

    But it is not just system registers, but all storage within a
    CPU/core, the L2 control status registers, the HostBridge
    control and status registers,...EVEN the register Registers
    are available--remotely.

    <snip>

    Nested Page Tables are the best solution (Fewest SW instructions of >>overhead and total cycles of latency) we currently know of.

    Indeed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Scott Lurndal on Sun Nov 26 15:17:06 2023
    On 11/25/2023 4:10 PM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/25/2023 1:28 PM, Scott Lurndal wrote:


    If you're taking an interrupt, to resolve guest TLB misses,
    performance is clearly not high priority.


    If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
    CPU), then the cost of the TLB miss handling is on par with other things
    like handling the timer interrupt, etc...

    Any cycle used by the miss handler is a cycle that could
    have been used for useful work. Timer interrupt handling
    is often very short (increment a memory location, a comparison
    and a return if no timer has expired). And we're long
    past the days of using regular timer interrupts for scheduling
    (see tickless kernels, for example).


    It takes roughly as much time to service a timer interrupt as to service
    a TLB miss...

    Much of the work in the time spent in the latter is saving/restoring the relevant registers, with the actual page table walk and 'LDTLB'
    instruction typically a fairly minor part in comparison...

    At least, excluding something like using B-Tree based page tables...

    It could be made faster, but would likely require doing the TLB miss
    handler in ASM and only saving/restoring the minimum number of registers
    (well, at least until we detect that there will be a page-fault, which
    would still require falling back to a "more comprehensive" handler).


    Any L1 miss penalties from the page-walk itself would likely also apply
    to a hardware page-walker.



    But, what one does need, is a way to perform context switches without
    also triggering a huge wave of TLB misses in the process.

    Why?

    Note that depending on the number of entries in your TLB
    and the scheduler behavior, it's unlikely that any prior
    TLB entries will be useful to a newly scheduled thread
    (in a different address space).


    I am mostly using a 256x 4-way TLB (so, 1024 TLBE's).
    With a 16K page size, this is basically enough to keep roughly something
    the size of the working set of Doom entirely in the TLB.


    In my past experiments, 16K seemed to be the local optimum for the
    programs tested:
    4K and 8K resulted in higher miss rates;
    32K and 64K resulted in a more "internal fragmentation" without much
    reduction in miss rate.


    Having multiple banks of TLBs that you can switch between
    might be able to provide you with the capability to
    reduce the TLB miss rate on scheduling a new thread of
    execution - but CAMs aren't cheap.


    This is why my TLB is 4-way set-associative.

    An 8-way TLB would be a lot more expensive, and a fully-associative TLB
    (of nearly any non-trivial size) would be effectively implausible.


    For the most part, industry has settled on a large number
    of tagged TLB entries as a good compromise. Some architectures have
    a global bit in the entry that can be set via the page
    table that indicates that ASID and/or VMID qualifications
    aren't necessary for a hit.


    Yeah.

    I guess a factor here is mostly defining rules to both allow for and
    control the scope of global pages.

    In my case:
    The TTB register defines an ASID in the high order bits;
    The TLBE also has an ASID;
    The ASID is split into two parts (6 and 10 bits).
    In the ASID, 0 designates global pages
    But they are broken into "groups"
    So typically a global page is only shared within a given group.

    I am thinking the 6.10 split may have given too many bits to the group,
    and 4.12 or 2.14 might have been better.

    As-is, say, ASID 03DE would be able see global pages in 0000, but 045F
    would not (but would see global pages in ASID 0400).

    So, say, in the current scheme:
    ASID's 0000, 0400, 0800, 0C00, ... would exist as mirrors of the
    global address space.


    Where, say, if during a TLB Miss, if a page is marked global, it can be
    put into one of these ASIDs rather than the main ASID of the current
    process (if not in an ASID range which disallows global pages).

    The size of the group will have an effect on miss rate in cases where
    there are a lot of active PIDs though.



    Big TLB + strategic sharing and ASIDs can help here at least (whereas, a
    full TLB flush on context-switch would suck pretty bad).

    That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half of the virtrual address space is shared by all processes - there's no reason that those entries need to be flushed on context-switch.


    AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
    the defined behavior?... Well, at least ignoring the support for global
    pages.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Sun Nov 26 22:38:05 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:


    Seems like one might need a mechanism to remap the VM from real CR's to >>>>>a partially emulated set of CR's (VCR's ?...).

    ARM does this by adding a layer above the OS ring that can trap
    accesses to certain control registers used by the OS to the
    hypervisor for resolution. But for the most part, the guest just
    uses the same control registers as if it were running bare metal with
    no trapping - they're just loaded by the hypervisor before the guest
    is dispatched and saved by the hypervisor when scheduling a new
    guest. Thats an advantage of the exception level scheme, where
    each level has its own set of control registers.

    My 66000 memory maps control registers {CPU, LLC, NorthBridge,
    device, ...} into MMI/O space. A CPU, with access permission,
    can read or write another CPU's control registers--used sparingly
    to get out of trouble. Mainly this is used to allow a CPU to
    read or write device control registers.

    ARM supports access to CPU system registers via MMIO;
    primarily for debug purposes. System Registers may be accessed
    either via MMIO accesses from a running core, subject to
    permission controls, or via JTAG interface(s).

    Nice to know someone already blazed the trail.

    Note that a handful of system registers, when accessed
    using the MRS/MSR instructions are self-synchronizing
    with-respect to other state. This, architecturally,
    does _not_ hold when accessed via MMIO.


    The preferred way to access a cores own system registers is
    via the MSR/MRS instructions.

    My 66000 has a HR (Header Register) instruction to access one
    register at a time, but a MM (memory to memory move) instruction
    can be used to swap the entire core-stack {HV-level context switch.}
    MM to a MMI/O space is guaranteed to be ATOMIC across the entire
    transfer.

    But it is not just system registers, but all storage within a
    CPU/core, the L2 control status registers, the HostBridge
    control and status registers,...EVEN the register Registers
    are available--remotely.

    <snip>

    Nested Page Tables are the best solution (Fewest SW instructions of >>>overhead and total cycles of latency) we currently know of.

    Indeed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Sun Nov 26 22:41:37 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:




    The preferred way to access a cores own system registers is
    via the MSR/MRS instructions.

    My 66000 has a HR (Header Register) instruction to access one
    register at a time, but a MM (memory to memory move) instruction
    can be used to swap the entire core-stack {HV-level context switch.}
    MM to a MMI/O space is guaranteed to be ATOMIC across the entire
    transfer.

    But it is not just system registers, but all storage within a
    CPU/core, the L2 control status registers, the HostBridge
    control and status registers,...EVEN the register Registers
    are available--remotely.

    Yes, we do that (useful on chips that can also be a PCIe endpoint).

    Even AMD does that with the memory controllers, SMI, I2C/I3C
    etc. appearing as PCI endpoints.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to BGB on Sun Nov 26 22:46:55 2023
    BGB <cr88192@gmail.com> writes:
    On 11/25/2023 4:10 PM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/25/2023 1:28 PM, Scott Lurndal wrote:


    If you're taking an interrupt, to resolve guest TLB misses,
    performance is clearly not high priority.


    If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
    CPU), then the cost of the TLB miss handling is on par with other things >>> like handling the timer interrupt, etc...

    Any cycle used by the miss handler is a cycle that could
    have been used for useful work. Timer interrupt handling
    is often very short (increment a memory location, a comparison
    and a return if no timer has expired). And we're long
    past the days of using regular timer interrupts for scheduling
    (see tickless kernels, for example).


    It takes roughly as much time to service a timer interrupt as to service
    a TLB miss...

    You'll need to provide more than an assertion for that.


    Much of the work in the time spent in the latter is saving/restoring the >relevant registers, with the actual page table walk and 'LDTLB'
    instruction typically a fairly minor part in comparison...

    Then you've a poorly written handler. Note that a hardware table
    walker doesn't need to save any registers.

    <snip>

    That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half >> of the virtrual address space is shared by all processes - there's no reason >> that those entries need to be flushed on context-switch.


    AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
    the defined behavior?... Well, at least ignoring the support for global >pages.

    Does x86 even tag the TLB entries with an ASID? I've been in ARMv8 land for the
    last decade.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Sun Nov 26 23:10:36 2023
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:


    My 66000 memory maps control registers {CPU, LLC, NorthBridge,
    device, ...} into MMI/O space. A CPU, with access permission,
    can read or write another CPU's control registers--used sparingly
    to get out of trouble. Mainly this is used to allow a CPU to
    read or write device control registers.

    ARM supports access to CPU system registers via MMIO;
    primarily for debug purposes. System Registers may be accessed
    either via MMIO accesses from a running core, subject to
    permission controls, or via JTAG interface(s).

    Nice to know someone already blazed the trail.

    Note that a handful of system registers, when accessed
    using the MRS/MSR instructions are self-synchronizing
    with-respect to other state. This, architecturally,
    does _not_ hold when accessed via MMIO.

    My 66000 architecture specification indicates that when a CPU control
    register is written the CPU performs as if there were a saving of all
    current state, allow the write to transpire, and then act as if you
    reloaded all the state.

    The "as if" qualifier allows an implementation to take less cycles
    when it recognizes certain situations.

    But, this is one of those things that falls out "for free"* when the
    HW knows how to perform context switches as if thread-state were in
    memory. {{(*) nothing is ever free, but if you have HW context
    switches there are a lot of other things that can be made "as if"}}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Sun Nov 26 23:13:07 2023
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:


    But it is not just system registers, but all storage within a
    CPU/core, the L2 control status registers, the HostBridge
    control and status registers,...EVEN the register Registers
    are available--remotely.

    Yes, we do that (useful on chips that can also be a PCIe endpoint).

    Even AMD does that with the memory controllers, SMI, I2C/I3C
    etc. appearing as PCI endpoints.


    I look at it like this, you are going to need the ability to reach
    into the innermost areas of the chip and look at what is going on.
    The easiest means to get here, today, is via PCIe--JTAG is not that
    useful when there are 1T bits you might want to look at.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Sun Nov 26 23:29:35 2023
    Scott Lurndal wrote:

    BGB <cr88192@gmail.com> writes:
    On 11/25/2023 4:10 PM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/25/2023 1:28 PM, Scott Lurndal wrote:


    If you're taking an interrupt, to resolve guest TLB misses,
    performance is clearly not high priority.


    If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
    CPU), then the cost of the TLB miss handling is on par with other things >>>> like handling the timer interrupt, etc...

    Any cycle used by the miss handler is a cycle that could
    have been used for useful work. Timer interrupt handling
    is often very short (increment a memory location, a comparison
    and a return if no timer has expired). And we're long
    past the days of using regular timer interrupts for scheduling
    (see tickless kernels, for example).


    It takes roughly as much time to service a timer interrupt as to service
    a TLB miss...

    You'll need to provide more than an assertion for that.

    Service a TLB miss with an L2 TLB is about 6 cycles on my 1-wide machine. Walking the page tables may be as few as 1 access or as many as 24 to
    L2 cache (adding in whatever cache miss latency transpires). With reasonable Table Walk Caching, we may average 30-cycles {Hardware table walk} So,
    at one end we have 6-cycles and at the other we have 24 serially dependent
    L2 misses:: but averaging around 30-cycles.

    Service a timer interrupt:: 10-cycles waiting for thread-state to arrive,
    Cache miss waiting for instructions for ISR dispatcher, 3 instructions to transfer control to ISR handler. Another cache miss waiting for instructions
    At this point the handler needs to tell the time it has been serviced, and optionally to send it a count of the next time it should go off. Schedule
    a DPC/softIRQ, unwind the handler/dispatcher stack, and return from dispatcher only to end up at DPC/softIRQ.

    I cant see this taking less than 100 cycles.......and vastly more if SW is burdened with doing the save and restore after finding registers to use
    while shuffling data to some stack.


    Much of the work in the time spent in the latter is saving/restoring the >>relevant registers, with the actual page table walk and 'LDTLB'
    instruction typically a fairly minor part in comparison...

    Then you've a poorly written handler. Note that a hardware table
    walker doesn't need to save any registers.

    Neither does the My 66000 ISR dispatcher. By the time control arrives,
    old thread state has been returned (at least conceptually) to memory
    and the CPU has its new thread state loaded {including IP, Root Pointer, ISRasid, ISR SP, ISR FP is desired, and pointers to things the IRS may
    want quick access to when it receives control--all reentrantly.

    So, I contend it is not the writing of the ISR handler, it is architecture which causes the ISR handler to have such a bit prologue and epilogue.

    <snip>

    That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half
    of the virtrual address space is shared by all processes - there's no reason
    that those entries need to be flushed on context-switch.


    AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
    the defined behavior?... Well, at least ignoring the support for global >>pages.

    Well done ASIDs prevent the need for TLB flushing except when kinking a thread out the ASID bucket-list.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Finch@21:1/5 to BGB on Sun Nov 26 19:08:09 2023
    On 2023-11-26 4:17 p.m., BGB wrote:
    On 11/25/2023 4:10 PM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/25/2023 1:28 PM, Scott Lurndal wrote:


    If you're taking an interrupt, to resolve guest TLB misses,
    performance is clearly not high priority.


    If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
    CPU), then the cost of the TLB miss handling is on par with other things >>> like handling the timer interrupt, etc...

    Any cycle used by the miss handler is a cycle that could
    have been used for useful work.   Timer interrupt handling
    is often very short (increment a memory location, a comparison
    and a return if no timer has expired).   And we're long
    past the days of using regular timer interrupts for scheduling
    (see tickless kernels, for example).


    It takes roughly as much time to service a timer interrupt as to service
    a TLB miss...

    Much of the work in the time spent in the latter is saving/restoring the relevant registers, with the actual page table walk and 'LDTLB'
    instruction typically a fairly minor part in comparison...

    At least, excluding something like using B-Tree based page tables...

    It could be made faster, but would likely require doing the TLB miss
    handler in ASM and only saving/restoring the minimum number of registers (well, at least until we detect that there will be a page-fault, which
    would still require falling back to a "more comprehensive" handler).


    Any L1 miss penalties from the page-walk itself would likely also apply
    to a hardware page-walker.

    A hardware table walker strikes me as not being a large component.
    Although untested yet, the Q+ table walker is only about 1,200 LUTs or
    1% of the FPGA. Given the small size I think it is worth it to have the
    table walker in hardware. It is hard to beat hardware timing wise when
    it does not need to save / restore registers.



    But, what one does need, is a way to perform context switches without
    also triggering a huge wave of TLB misses in the process.

    Why?

    Note that depending on the number of entries in your TLB
    and the scheduler behavior, it's unlikely that any prior
    TLB entries will be useful to a newly scheduled thread
    (in a different address space).


    I am mostly using a 256x 4-way TLB (so, 1024 TLBE's).
    With a 16K page size, this is basically enough to keep roughly something
    the size of the working set of Doom entirely in the TLB.


    In my past experiments, 16K seemed to be the local optimum for the
    programs tested:
    4K and 8K resulted in higher miss rates;
    32K and 64K resulted in a more "internal fragmentation" without much reduction in miss rate.


    Having multiple banks of TLBs that you can switch between
    might be able to provide you with the capability to
    reduce the TLB miss rate on scheduling a new thread of
    execution - but CAMs aren't cheap.


    This is why my TLB is 4-way set-associative.

    An 8-way TLB would be a lot more expensive, and a fully-associative TLB
    (of nearly any non-trivial size) would be effectively implausible.


    For the most part, industry has settled on a large number
    of tagged TLB entries as a good compromise.   Some architectures have
    a global bit in the entry that can be set via the page
    table that indicates that ASID and/or VMID qualifications
    aren't necessary for a hit.


    Yeah.

    I guess a factor here is mostly defining rules to both allow for and
    control the scope of global pages.

    In my case:
      The TTB register defines an ASID in the high order bits;
      The TLBE also has an ASID;
      The ASID is split into two parts (6 and 10 bits).
        In the ASID, 0 designates global pages
        But they are broken into "groups"
        So typically a global page is only shared within a given group.

    I am thinking the 6.10 split may have given too many bits to the group,
    and 4.12 or 2.14 might have been better.

    As-is, say, ASID 03DE would be able see global pages in 0000, but 045F
    would not (but would see global pages in ASID 0400).

    So, say, in the current scheme:
      ASID's 0000, 0400, 0800, 0C00, ... would exist as mirrors of the
    global address space.


    Where, say, if during a TLB Miss, if a page is marked global, it can be
    put into one of these ASIDs rather than the main ASID of the current
    process (if not in an ASID range which disallows global pages).

    The size of the group will have an effect on miss rate in cases where
    there are a lot of active PIDs though.



    Big TLB + strategic sharing and ASIDs can help here at least (whereas, a >>> full TLB flush on context-switch would suck pretty bad).

    That's unnecessaryly harsh.   Consider that on Intel/AMD/ARM the
    kernel half
    of the virtrual address space is shared by all processes - there's no
    reason
    that those entries need to be flushed on context-switch.


    AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
    the defined behavior?... Well, at least ignoring the support for global pages.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Robert Finch on Sun Nov 26 21:06:05 2023
    On 11/26/2023 6:08 PM, Robert Finch wrote:
    On 2023-11-26 4:17 p.m., BGB wrote:
    On 11/25/2023 4:10 PM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/25/2023 1:28 PM, Scott Lurndal wrote:


    If you're taking an interrupt, to resolve guest TLB misses,
    performance is clearly not high priority.


    If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
    CPU), then the cost of the TLB miss handling is on par with other
    things
    like handling the timer interrupt, etc...

    Any cycle used by the miss handler is a cycle that could
    have been used for useful work.   Timer interrupt handling
    is often very short (increment a memory location, a comparison
    and a return if no timer has expired).   And we're long
    past the days of using regular timer interrupts for scheduling
    (see tickless kernels, for example).


    It takes roughly as much time to service a timer interrupt as to
    service a TLB miss...

    Much of the work in the time spent in the latter is saving/restoring
    the relevant registers, with the actual page table walk and 'LDTLB'
    instruction typically a fairly minor part in comparison...

    At least, excluding something like using B-Tree based page tables...

    It could be made faster, but would likely require doing the TLB miss
    handler in ASM and only saving/restoring the minimum number of
    registers (well, at least until we detect that there will be a
    page-fault, which would still require falling back to a "more
    comprehensive" handler).


    Any L1 miss penalties from the page-walk itself would likely also
    apply to a hardware page-walker.

    A hardware table walker strikes me as not being a large component.
    Although untested yet, the Q+ table walker is only about 1,200 LUTs or
    1% of the FPGA. Given the small size I think it is worth it to have the
    table walker in hardware. It is hard to beat hardware timing wise when
    it does not need to save / restore registers.


    Possible, though, until TLB Miss exceeds ~ 1% or so, it isn't really a
    huge priority either.

    In my current cases, it is generally less than 0.1% of the CPU time, so
    not yet a huge priority.

    Vs, say:
    ~ 1% for the 1kHz timer interrupt
    ~ 0.6% for syscall (down from around 1.2%).

    The optimization I had used for syscalls is mostly N/A for the timer
    interrupt though.


    Had considered inverted page tables as possible as well, but, making
    this faster isn't (yet) a terribly high priority.




    But, what one does need, is a way to perform context switches without
    also triggering a huge wave of TLB misses in the process.

    Why?

    Note that depending on the number of entries in your TLB
    and the scheduler behavior, it's unlikely that any prior
    TLB entries will be useful to a newly scheduled thread
    (in a different address space).


    I am mostly using a 256x 4-way TLB (so, 1024 TLBE's).
    With a 16K page size, this is basically enough to keep roughly
    something the size of the working set of Doom entirely in the TLB.


    In my past experiments, 16K seemed to be the local optimum for the
    programs tested:
    4K and 8K resulted in higher miss rates;
    32K and 64K resulted in a more "internal fragmentation" without much
    reduction in miss rate.


    Having multiple banks of TLBs that you can switch between
    might be able to provide you with the capability to
    reduce the TLB miss rate on scheduling a new thread of
    execution - but CAMs aren't cheap.


    This is why my TLB is 4-way set-associative.

    An 8-way TLB would be a lot more expensive, and a fully-associative
    TLB (of nearly any non-trivial size) would be effectively implausible.


    For the most part, industry has settled on a large number
    of tagged TLB entries as a good compromise.   Some architectures have
    a global bit in the entry that can be set via the page
    table that indicates that ASID and/or VMID qualifications
    aren't necessary for a hit.


    Yeah.

    I guess a factor here is mostly defining rules to both allow for and
    control the scope of global pages.

    In my case:
       The TTB register defines an ASID in the high order bits;
       The TLBE also has an ASID;
       The ASID is split into two parts (6 and 10 bits).
         In the ASID, 0 designates global pages
         But they are broken into "groups"
         So typically a global page is only shared within a given group.

    I am thinking the 6.10 split may have given too many bits to the
    group, and 4.12 or 2.14 might have been better.

    As-is, say, ASID 03DE would be able see global pages in 0000, but 045F
    would not (but would see global pages in ASID 0400).

    So, say, in the current scheme:
       ASID's 0000, 0400, 0800, 0C00, ... would exist as mirrors of the
    global address space.


    Where, say, if during a TLB Miss, if a page is marked global, it can
    be put into one of these ASIDs rather than the main ASID of the
    current process (if not in an ASID range which disallows global pages).

    The size of the group will have an effect on miss rate in cases where
    there are a lot of active PIDs though.



    Big TLB + strategic sharing and ASIDs can help here at least
    (whereas, a
    full TLB flush on context-switch would suck pretty bad).

    That's unnecessaryly harsh.   Consider that on Intel/AMD/ARM the
    kernel half
    of the virtrual address space is shared by all processes - there's no
    reason
    that those entries need to be flushed on context-switch.


    AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
    the defined behavior?... Well, at least ignoring the support for
    global pages.




    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Scott Lurndal on Sun Nov 26 20:54:01 2023
    On 11/26/2023 4:46 PM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/25/2023 4:10 PM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/25/2023 1:28 PM, Scott Lurndal wrote:


    If you're taking an interrupt, to resolve guest TLB misses,
    performance is clearly not high priority.


    If one can stay under, say, 100-500 TLB misses per second (on a 50MHz
    CPU), then the cost of the TLB miss handling is on par with other things >>>> like handling the timer interrupt, etc...

    Any cycle used by the miss handler is a cycle that could
    have been used for useful work. Timer interrupt handling
    is often very short (increment a memory location, a comparison
    and a return if no timer has expired). And we're long
    past the days of using regular timer interrupts for scheduling
    (see tickless kernels, for example).


    It takes roughly as much time to service a timer interrupt as to service
    a TLB miss...

    You'll need to provide more than an assertion for that.


    If the interrupt's save/restore prolog/epilog by itself burns ~ 500+
    cycles, then the time needed to do a few memory loads, some bit
    twiddling, and an LDTLB, mostly disappears in the noise...


    Granted, it cost more cycles to walk the page-table, compose, and load
    the TLBE, than it does to increment a counter variable, but...

    Nearly all the "expensive parts" will happen similarly in both cases.


    I could get along OK using a B-Tree as a page-table, which despite the considerable cost difference between a simple 3-level page table walk
    and a B-Tree walk, this "merely doubled" the average cost of the TLB
    Miss handler...


    Both cases could be faster, but it would likely require writing the ISR handlers in ASM (and not saving/restoring all of the registers).

    And the potential savings are smaller:
    The TLB miss handler may also need to deal with ACL Miss and needs to be
    able to dispatch a Page Fault event;
    The IRQ Miss handler, meanwhile, may need to deal with other types of
    hardware events beyond just timer interrupts (though, at present, the
    timer is the only thing that generates an interrupt, pretty much
    everything else at present is polling IO).




    Much of the work in the time spent in the latter is saving/restoring the
    relevant registers, with the actual page table walk and 'LDTLB'
    instruction typically a fairly minor part in comparison...

    Then you've a poorly written handler. Note that a hardware table
    walker doesn't need to save any registers.


    Most of this logic is auto-generated by my C compiler.

    __interrupt void __isr_interrupt(void)
    {
    }

    By itself is going to save/restore all of the registers and burn roughly
    500 cycles in the process...


    Though, I had considered possibly adding a "__interrupt_min" keyword,
    which would try to minimize the number of registers saved/restored, but
    would not allow the ISR to implement a context switch...

    But, the latter restriction would make it "almost useless", as the main
    two interrupts where it might be useful (the IRQ and TLB Miss handlers),
    would also be naturally excluded as both may need to implement context switches.


    Did end up adding:
    __interrupt_tbrsave void __isr_syscall(void)
    {
    }

    Where "__interrupt_tbrsave" does at least optimize things in the case
    where we *know* we are going to do a context switch.

    In this case, it allows eliminating a few calls:
    isrsave=__arch_isrsave;
    memcpy(
    taskern->ctx_regsave,
    isrsave,
    __ARCH_SIZEOF_REGSAVE__);
    memcpy(
    isrsave,
    taskern2->ctx_regsave,
    __ARCH_SIZEOF_REGSAVE__);

    Which generally ended up burning another ~ 500 clock cycles.



    Note that at 50MHz, one would end up needing to invoke an ISR around
    1000 times per second to hit 1%.

    Though, with syscalls, it was a little worse. But the new interrupt type
    has helped some.

    Now, syscalls are just behind the timer interrupt (which is at around 1%
    of the CPU, getting triggered at around 1000 times per second).


    The TLB Miss ISR is < 0.1% of the time, mostly by averaging under 100
    TLB misses per second.



    Though, some of this does mean that, despite the BJX2 core running at
    around 3x the clock-speed of an MSP430, I can't run the clock with a
    32kHz timer interrupt without effectively eating the CPU.

    So, this is one area where it seems like the MSP430 has an advantage...


    <snip>

    That's unnecessaryly harsh. Consider that on Intel/AMD/ARM the kernel half
    of the virtrual address space is shared by all processes - there's no reason
    that those entries need to be flushed on context-switch.


    AFAIK, on x86, automatic (near) full TLB flush on CR3 modification was
    the defined behavior?... Well, at least ignoring the support for global
    pages.

    Does x86 even tag the TLB entries with an ASID? I've been in ARMv8 land for the
    last decade.

    It seems to have added "something" to support global pages, but doesn't
    appear to use an ASID.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to BGB on Mon Nov 27 15:10:25 2023
    BGB <cr88192@gmail.com> writes:
    On 11/26/2023 4:46 PM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/25/2023 4:10 PM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/25/2023 1:28 PM, Scott Lurndal wrote:


    If you're taking an interrupt, to resolve guest TLB misses,
    performance is clearly not high priority.


    If one can stay under, say, 100-500 TLB misses per second (on a 50MHz >>>>> CPU), then the cost of the TLB miss handling is on par with other things >>>>> like handling the timer interrupt, etc...

    Any cycle used by the miss handler is a cycle that could
    have been used for useful work. Timer interrupt handling
    is often very short (increment a memory location, a comparison
    and a return if no timer has expired). And we're long
    past the days of using regular timer interrupts for scheduling
    (see tickless kernels, for example).


    It takes roughly as much time to service a timer interrupt as to service >>> a TLB miss...

    You'll need to provide more than an assertion for that.


    If

    Ah, speculation. Got it.


    the interrupt's save/restore prolog/epilog by itself burns ~ 500+
    cycles, then the time needed to do a few memory loads, some bit
    twiddling, and an LDTLB, mostly disappears in the noise...

    Again, if.


    Does x86 even tag the TLB entries with an ASID? I've been in ARMv8 land for the
    last decade.

    It seems to have added "something" to support global pages, but doesn't >appear to use an ASID.

    They've had global pages since they introduced paging on the i386, IIRC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Wed Nov 29 17:15:00 2023
    On Sun, 12 Nov 2023 20:55:27 +0000, Quadibloc wrote:

    I had tried, with all sorts of ingenious compromises of register spaces
    and the like, to fit all the capabilities I wanted into the opcode space
    of a single version of the instruction set, eliminating the need for
    blocks which contained instructions belonging to alternate versions of
    the instruction set.

    But if the 16-bit instructions I'm making room for are useless to
    compilers, that's questionable.

    At first, when I mulled over this, I came up with multiple ideas to
    address it, each one crazier than the last.

    Seeing, therefore, that this was a difficult nut to crack, and not
    wanting to go down in another wrong direction... instead, I found a way
    to go that seemed to me to be reasonably sensible.

    Go back to uncompromised 32-bit instructions, even though that means
    there are no 16-bit instructions.

    Then, bring back short instructions - effectively 17 bits long - so as
    to have room for full register specifications. This means an alternative block format where 16, 32, 48, 64... bit instructions are all possible.

    *But* because of the room 17-bit short instructions take up in the
    header, the 32-bit instructions are the same regular format as in the
    other case. Not some kind of 33-bit or 35-bit instruction with a new set
    of instruction formats.

    I have now modified the 17-bit shift instructions in the diagram, so that
    they can also apply to all 32 integer registers, and I have corrected the opcodes on the page

    http://www.quadibloc.com/arch/cw0101.htm

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Robert Finch on Fri Dec 1 07:48:45 2023
    On Thu, 30 Nov 2023 11:22:55 -0500, Robert Finch wrote:

    Having a look at the ConcertiaII ISA. I like the idea of
    pseudo-immediates. All the immediates could be moved to one end of the
    block and then skipped over during instruction fetch.

    That is the general idea, with one minor correction.

    The benefit of pseudo-immediates, like that of ordinary immediates,
    are that they're already available, because they were brought into the
    CPU by instruction fetch.

    They get skipped over by the _next_ step, instruction decode.

    Why a block structure? The goal is to have a situation where
    instruction decode is largely done in parallel for the whole
    block.

    The first step is - is there a header? If not, decode all eight
    32-bit instructions in the block in parallel.

    If so, process the header, and that will directly and immediately
    reveal where every instruction in the block begins, so again the
    next step has all the instructions being decoded in parallel.

    The header allows the length that immediates would add to instructions
    to be in the pseudo-immediated instead, avoiding another potential
    complication to instruction decoding.

    In addition, having headers means that the instruction set can be
    expanded or made flexible without it being possible to change the
    mode of the CPU to cause it to read existing instruction code the
    wrong way. Any modifications to how instructions are to be interpreted
    are right there in the block header, so malware that can't alter
    code can't work around that by changing how it is to be read.

    Among the features the headers allow to be added are VLIW features,
    such as instruction predication and explicitly indicating which
    instructions can execute in parallel. This allows high-performance
    but lightweight (non-OoO) implementations if desired.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Fri Dec 1 18:37:17 2023
    Quadibloc wrote:

    On Thu, 30 Nov 2023 11:22:55 -0500, Robert Finch wrote:

    Having a look at the ConcertiaII ISA. I like the idea of
    pseudo-immediates. All the immediates could be moved to one end of the
    block and then skipped over during instruction fetch.

    That is the general idea, with one minor correction.

    The benefit of pseudo-immediates, like that of ordinary immediates,
    are that they're already available, because they were brought into the
    CPU by instruction fetch.

    They get skipped over by the _next_ step, instruction decode.

    Why a block structure? The goal is to have a situation where
    instruction decode is largely done in parallel for the whole
    block.

    What if you had the advantages of the block header without the
    cost of the block header ??

    The first step is - is there a header? If not, decode all eight
    32-bit instructions in the block in parallel.

    Why not decode assuming there is a block header and also decode as
    if there were not a block header. Then you can multiplex (choose)
    later which one prevails. This puts the choice at at least 4 gates
    of delay into the decode cycle.

    If so, process the header, and that will directly and immediately
    reveal where every instruction in the block begins, so again the
    next step has all the instructions being decoded in parallel.

    You then have to route the instructions to the decoders. Are your
    decoders expensive enough in a wide implementation that this matters?
    The alternative is to have a no-header decoder running in parallel
    with a header decoder and choose which to use.

    The header allows the length that immediates would add to instructions
    to be in the pseudo-immediated instead, avoiding another potential complication to instruction decoding.

    In addition, having headers means that the instruction set can be
    expanded or made flexible without it being possible to change the
    mode of the CPU to cause it to read existing instruction code the
    wrong way. Any modifications to how instructions are to be interpreted
    are right there in the block header, so malware that can't alter
    code can't work around that by changing how it is to be read.

    You MAY be able to alter the headers later in the architecture's life,
    but ultimately you sacrifice forward compatibility.

    Among the features the headers allow to be added are VLIW features,
    Why would you want this ??
    such as instruction predication and explicitly indicating which
    instructions can execute in parallel.
    HW does not seem to have much trouble doing this already.
    This allows high-performance
    but lightweight (non-OoO) implementations if desired.
    Have any GBnOoO machines been successful ?

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Fri Dec 1 21:58:46 2023
    BGB wrote:

    It seems to me, one should be able to get away with 3 modes:
    Machine / ISR;
    Supervisor;
    User.

    I have been thinking about this for a while::

    It seems to me that if one wants a robust system, the HyperVisor must
    support various serviced-HyperVisors. This second (less privileged HV)
    is, in essence, a HV that can crash without allowing the whole system
    to crash {just like virtual machines can crash and take their applications
    with them.}

    Secondly:: Running an ISR at HV level is a privilege inversion issue,
    the HV has to look at data structures maintained by a (not necessarily trustable) Guest OS--possibly corrupting the HV itself.

    So while a 3 level system gives you most of what you want in a modern
    system, it still has its own problems--that can be solved with a 4th.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Fri Dec 1 21:15:59 2023
    On Wed, 29 Nov 2023 17:15:00 +0000, Quadibloc wrote:

    I have now modified the 17-bit shift instructions in the diagram, so
    that they can also apply to all 32 integer registers, and I have
    corrected the opcodes on the page

    http://www.quadibloc.com/arch/cw0101.htm

    And now I have completed the process of getting back to where I was before,
    by adding in the page

    http://www.quadibloc.com/arch/cw0102.htm

    which describes the instructions longer than 32 bits.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Fri Dec 1 22:10:39 2023
    On Fri, 01 Dec 2023 18:37:17 +0000, MitchAlsup wrote:

    Why not decode assuming there is a block header and also decode as if
    there were not a block header. Then you can multiplex (choose) later
    which one prevails. This puts the choice at at least 4 gates of delay
    into the decode cycle.

    You are quite correct that this is a possible technique to speed up an implementation of the architecture, at the cost of using extra electricity
    to do work that will be thrown away later.

    I described things in terms of a naive implementation to make the concepts easier to understand.

    (quoting me)
    If so, process the header, and that will directly and immediately
    reveal where every instruction in the block begins, so again the next
    step has all the instructions being decoded in parallel.

    You then have to route the instructions to the decoders. Are your
    decoders expensive enough in a wide implementation that this matters?
    The alternative is to have a no-header decoder running in parallel with
    a header decoder and choose which to use.

    I wasn't thinking of routing instructions to decoders. Instead, the
    decoders simply sit behind the physical positions in the block where
    an instruction could begin, and the header (or the absence of a header)
    tells them to start decoding. Or, in the type of fast implementation
    you describe, to continue with decoding.

    You MAY be able to alter the headers later in the architecture's life,
    but ultimately you sacrifice forward compatibility.

    As long as I can avoid sacrificing *backwards* compatibility.

    Among the features the headers allow to be added are VLIW features,

    Why would you want this ??

    So the architecture could be used for very cheap embedded systems,
    in addition to heavyweight desktops and servers.


    This allows high-performance but lightweight (non-OoO) implementations
    if desired.

    Have any GBnOoO machines been successful ?

    Ah, you don't mean out-of-order 68000 machines. Of which there was only
    one, the 68050. You mean "great big not out of order" machines. Of which
    there were none, the design being, no doubt, so outrageous as to not
    even deserve the chance to fail, since it would have no chance to succeed.

    That's a very valid point, but any ISA for a "great big" machine can have
    a subset which no longer requires a "great big" machine.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Fri Dec 1 23:12:14 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    BGB wrote:

    It seems to me, one should be able to get away with 3 modes:
    Machine / ISR;
    Supervisor;
    User.

    I have been thinking about this for a while::

    It seems to me that if one wants a robust system, the HyperVisor must
    support various serviced-HyperVisors. This second (less privileged HV)
    is, in essence, a HV that can crash without allowing the whole system
    to crash {just like virtual machines can crash and take their applications >with them.}

    Generally there must be a privilege level more privileged than
    hypervisor, which controls the hardware - particularly if one
    intends to 'schedule' multiple independent (not nested) hypervisors.

    Then there is a requirement in the cloud for a nested hypervisor; this
    can be done with a paravirtualized hypervisor, at some performance
    cost, or with a true hardware supported nesting capability.


    Secondly:: Running an ISR at HV level is a privilege inversion issue,
    the HV has to look at data structures maintained by a (not necessarily >trustable) Guest OS--possibly corrupting the HV itself.

    Modern interrupt virtualization mechanisms (e.g. ARMv8 GICv4.1)
    handle guest interrupts completely in the hardware, with no
    hypervisor intervention involved in the most common cases
    (e.g. software generated interprocessor interrupts, virtual
    timer interrupts, message signaled interrupts, et alia).


    So while a 3 level system gives you most of what you want in a modern
    system, it still has its own problems--that can be solved with a 4th.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Sat Dec 2 02:15:35 2023
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    BGB wrote:

    It seems to me, one should be able to get away with 3 modes:
    Machine / ISR;
    Supervisor;
    User.

    I have been thinking about this for a while::

    It seems to me that if one wants a robust system, the HyperVisor must >>support various serviced-HyperVisors. This second (less privileged HV)
    is, in essence, a HV that can crash without allowing the whole system
    to crash {just like virtual machines can crash and take their applications >>with them.}

    Generally there must be a privilege level more privileged than
    hypervisor, which controls the hardware - particularly if one
    intends to 'schedule' multiple independent (not nested) hypervisors.

    So, call my HV System Manage Mode and call my Guest HV the HyperVisor.

    Then there is a requirement in the cloud for a nested hypervisor; this
    can be done with a paravirtualized hypervisor, at some performance
    cost, or with a true hardware supported nesting capability.


    Secondly:: Running an ISR at HV level is a privilege inversion issue,
    the HV has to look at data structures maintained by a (not necessarily >>trustable) Guest OS--possibly corrupting the HV itself.

    Modern interrupt virtualization mechanisms (e.g. ARMv8 GICv4.1)
    handle guest interrupts completely in the hardware, with no
    hypervisor intervention involved in the most common cases
    (e.g. software generated interprocessor interrupts, virtual
    timer interrupts, message signaled interrupts, et alia).

    My 66000 has interrupt tables similar to RISC-V (in that you can have
    as many tables as you want, and any table can interrupt to any priority.)

    Unlike RISC-V My 66000 LLC has a little machine which operates the
    tables, so devices raise an interrupt by sending a message to the
    little machine which sets a bit in the table {and when enabled
    a core operating at lower priority than NSARFs the table update
    and requests the highest priority pending and enabled interrupt
    (getInterrupt "bus transaction"). When the response arrives and
    the core is still operating at a lower priority level, core responds
    with a claimiInterrupt (or a put Interrupt) "bus transaction" and
    only at this point stops running the old context code and context
    switches to irsDispatcher.

    Cores send IPIs by using the little machine.....


    So while a 3 level system gives you most of what you want in a modern >>system, it still has its own problems--that can be solved with a 4th.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to MitchAlsup on Sat Dec 2 17:35:36 2023
    MitchAlsup wrote:

    Chris M. Thomasson wrote:

    On 12/1/2023 6:15 PM, MitchAlsup wrote:


    Cores send IPIs by using the little machine.....

    Fwiw, how would your system handle this function from Microsoft:

    https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers

    Or, would that be kernel?

    Core could send multiple IPIs in a loop or core could send a single IPI
    to a kernel function that performs the loop.

    Since performing 1 IPI requires 2 STs and does not require waiting on a response, it is probably easier if the core does the loop.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Sat Dec 2 17:34:03 2023
    Chris M. Thomasson wrote:

    On 12/1/2023 6:15 PM, MitchAlsup wrote:


    Cores send IPIs by using the little machine.....

    Fwiw, how would your system handle this function from Microsoft:

    https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers

    Or, would that be kernel?

    Core could send multiple IPIs in a loop or core could send a single IPI
    to a kernel function that performs the loop.

    Since performing 1 IPI re



    So while a 3 level system gives you most of what you want in a modern
    system, it still has its own problems--that can be solved with a 4th.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Sat Dec 2 20:39:08 2023
    Paul A. Clayton wrote:

    On 11/24/23 9:43 AM, Robert Finch wrote:
    [snip]
    There is a lot of value in having a unique architecture.

    A uniquely difficult architecture like x86 increases the barrier
    to competition both from patents and organizational knowledge and
    tools. While MIPS managed to suppress clones with its patent on
    unaligned loads (please correct any historical inaccuracy), Intel
    was better positioned to discourage software-compatible
    competition — and not just financially.

    In Intel's case one must not just execute the x86 ISA but also be
    bug-for-bug compatible. AMD K5 was essentially sacrificed to find
    that bug-for-bug compatibility--that is they found the test vector
    set that defined x86.

    I suspect that the bad reputation of x86 among computer architects
    — especially with the biases from Computer Architecture: A
    Quantitative Approach which substantially informs computer
    architecture education — might also make finding talent more
    difficult. However, the prominence of the x86 vendors (working on
    something that actually gets produced and used by millions of
    people is gratifying) and the challenge of working on a difficult architecture would also attract talent (and perhaps more qualified
    talent).

    The x86
    has had a lot of things bolted on to it. It has adapted over time.
    Being able to see how things have changed is valuable.

    x86 provides more than one lesson on change/project management.
    The binary lock-in advantage of x86 makes architectural changes
    more challenging. While something like the 8080 to 8086 "assembly
    compatible" transition might have been practical and long-term
    beneficial from an engineering perspective, from a business
    perspective such would validate binary translation, reducing the
    competitive barriers.

    (Itanium showed that mediocre hardware translation between x86 and
    a rather incompatible architecture (and microarchitecture) would
    have been problematic even if native Itanium code had competitive

    So did Transmeta.

    performance. This seems reminiscent of the Pentium Pro's "issue"
    with 16-bit code; both seem to have been at least partially
    marketing failures. On the other hand, ARM designed a 64-bit
    architecture that is only moderately compatible with the 32-bit
    architecture — flags being one example of compatibility — and 32-
    bit support is now being mostly left behind for 64-bit
    implementations.)
    ----------------
    MIPS (even with its delayed branches, lack of variable length
    encoding, etc.) would probably be a better architecture in 2023
    than x86 was around 2010. The delayed branches might have been
    deprecated, VLE might have been added in an additional mode, and
    eventually complex-but-useful instructions would probably have
    been added. (MIPS would almost certainly have caught SIMD widening
    disease and had other temporarily useful extension additions, but
    the tradeoffs in 1985 were closer to those of 2023.)

    The early 00's were a good time to avoid being an architect--SIMD
    was very appealing and is now showing its age.....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to A. Clayton on Sat Dec 2 20:53:00 2023
    In article <ukfvqu$2flaf$1@dont-email.me>, paaronclayton@gmail.com (Paul
    A. Clayton) wrote:

    This seems reminiscent of the Pentium Pro's "issue" with 16-bit
    code; both seem to have been at least partially marketing failures.

    For the scientific and technical markets, the Pentium Pro was just fine.
    I'm not sure you can call customers' desire to run 16-bit software on
    Pentium Pro a marketing failure. It was always going to happen, and if marketing people thought they could prevent it, they were fooling
    themselves.

    Mind you, these were the same marketing teams who a few years later
    wanted the Pentium 4 "NetBurst" microarchitecture, specifically because
    it would be introduced at high clock speeds. They'd been in a clockspeed
    battle with AMD for about two years, and sticking to any one thing that
    long means marketing people treat it as absolute truth.

    On the other hand, ARM designed a 64-bit architecture that is
    only moderately compatible with the 32-bit architecture - flags
    being one example of compatibility - and 32-bit support is now
    being mostly left behind for 64-bit implementations.

    Aarch64 has essentially no concessions to aarch32 compatibility as far as
    I can see. Emulating 32-bit on 64-bit would be painful because of
    predicated instructions: your dynamic binary translator has a hard time
    being sure that flags won't be used and thus need not be evaluated. The
    easy transition is, I think, due to the later date, and the small amount
    of aarch32 software written in assembler that's still in use.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Sat Dec 2 21:20:59 2023
    On Fri, 01 Dec 2023 22:10:39 +0000, Quadibloc wrote:
    On Fri, 01 Dec 2023 18:37:17 +0000, MitchAlsup wrote:

    Have any GBnOoO machines been successful ?

    Ah, you don't mean out-of-order 68000 machines. Of which there was only
    one, the 68050. You mean "great big not out of order" machines. Of which there were none, the design being, no doubt, so outrageous as to not
    even deserve the chance to fail, since it would have no chance to
    succeed.

    That's a very valid point, but any ISA for a "great big" machine can
    have a subset which no longer requires a "great big" machine.

    Also, as you are well aware, Intel has included both "performance" and "efficiency" cores in its latest generations of CPUs, similar to the
    BIG.little architecture used for some ARM processors.

    And then AMD came along, with its own twist on this feature: their
    "little" processors aren't so little, having the same circuitry as the
    big ones, but laid out more compactly so they have to have a lower
    clock speed. That way, they're not so slow as to be a total waste in
    normal full-power operation, and thus add to the total core count.

    Well, another way to address the efficiency/little cores being a
    waste of space would be to reduce the waste by making them smaller.
    If their purpose is to save power consumption when nobody's using the
    computer, to just keep the OS alive while it waits for the keyboard
    or the mouse to ask it to do something... then they should be made
    really little.

    Like Intel's _original_ Atom processors, which were in-order. As they
    were standalone processors for light and cheap laptops, Intel made
    the right decision to switch to out-of-order for later versions, so
    they wouldn't be so slow as to be useless.

    But in-order efficiency cores that are there when the demands are very
    low? With features that let one optimize code for them, though?

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Sun Dec 3 10:05:27 2023
    On Fri, 01 Dec 2023 21:15:59 +0000, Quadibloc wrote:
    On Wed, 29 Nov 2023 17:15:00 +0000, Quadibloc wrote:

    I have now modified the 17-bit shift instructions in the diagram, so
    that they can also apply to all 32 integer registers, and I have
    corrected the opcodes on the page

    http://www.quadibloc.com/arch/cw0101.htm

    And now I have completed the process of getting back to where I was
    before,
    by adding in the page

    http://www.quadibloc.com/arch/cw0102.htm

    which describes the instructions longer than 32 bits.

    Two further changes have been made.

    On the first page of the description of the ISA, I have noted that
    when VLIW features are used, indicating that instructions may be
    executed in parallel must not change the result of a calculation,
    since some implementations may ignore that directive.

    On the page about 17-bit instructions, I have changed the format
    of 128-bit floating-point numbers; instead of being a 128-bit version
    of temporary real, with more significand bits, I've added one exponent
    bit, subtracting one significand bit.

    The reason for this is to allow, with a 130-bit internal form, the
    *standard* 128-bit form for IEEE 754 floating-point numbers, which
    does have a hidden first bit, to be supported.

    In addition to having the sixteen even-numbered registers available for
    such numbers, since 130 bits is so frustratingly shorter than 256 bits,
    I also make the registers with numbers of the form 4n+1 available, using
    the same scheme as I will use for 128-bit Decimal Floating Point in the
    IBM format. Tweaked slightly to allow internal forms of up to 168 bits
    instead of up to 160 bits.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to MitchAlsup on Sun Dec 3 14:36:37 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Have any GBnOoO machines been successful ?

    Great Big in-order machines (why write non-OoO?):

    Multiflow has 7, 14, or 28 instructions per cycle, but of course its
    target market is supercomputing, i.e., throughput computing, and at
    the time the competition was pipelined SIMD (Cray etc.). Was it
    successful? Probably not that much.

    The 21164(a) is 4-wide, and was successful in its prime, but there was
    no OoO competition at the time. When the Pentium Pro appeared at
    200MHz, it took the SPECint95_base crown from the 300MHz 21164 <https://en.wikipedia.org/wiki/Alpha_21164#Performance>. Given that
    it held SPECint and SPECfp performance crowns for some time, one can
    consider it to be successful. Also, I think it was commercially
    somewhat successful. The 21164 including the 21164a also had a much
    longer lifespan than its predecessor, mainly due to the 21264 being
    late.

    The Larrabee (which eventually resulted in Knights Ferry) is a
    two-wide in-order design, but with very wide (512-bit) SIMD units.
    One probably cannot call it a success.

    Since the victory of OoO, people mostly limited themselves to two-wide
    in-order machines, probably because any more width is mostly wasted
    given the limited amount of instruction-level parallelism within a
    basic block. If people wanted more, they usually went to OoO (e.g., Bonnell->Silvermont, Knight's Ferry->Knight's Corner).

    One exception is ARM, which stayed with in-order in A53, A55, A510,
    A520, and switched from 2-wide to 3-wide in the A55->A510 transition,
    but interestingly went from 3 to 2 ALUs in the A510->A520 transition
    (but is still generally 3-wide).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Sun Dec 3 15:55:41 2023
    Quadibloc <quadibloc@servername.invalid> writes:
    On Fri, 01 Dec 2023 22:10:39 +0000, Quadibloc wrote:
    That's a very valid point, but any ISA for a "great big" machine can
    have a subset which no longer requires a "great big" machine.

    Also, as you are well aware, Intel has included both "performance" and >"efficiency" cores in its latest generations of CPUs, similar to the >BIG.little architecture used for some ARM processors.

    Which interestingly leads to recent Intel desktop and laptop CPUs not supporting AVX-512, even on CPUs that have only the performance cores
    enabled, even though the P-cores have AVX-512 implemented.

    Likewise, big.LITTLE has led to ARM cores all only supporting
    128-bit-wide SVE, because wider SVE would be too costly on the LITTLE
    cores. It will be interesting to see what Apple does.

    Well, another way to address the efficiency/little cores being a
    waste of space would be to reduce the waste by making them smaller.
    If their purpose is to save power consumption when nobody's using the >computer, to just keep the OS alive while it waits for the keyboard
    or the mouse to ask it to do something... then they should be made
    really little.

    That's not their primary purpose, or there would only be one such
    core. Intel has put 16 E-cores on Raptor Lake in order to be able to
    boast 24 cores (more than AMD's desktop offering) and 32 threads (same
    as AMD) total. And on tasks that can benefit from that many cores,
    such as some benchmarks, they are actually quite beneficial.

    ARM claims that their LITTLE in-order cores serve that purpose, but
    then, why put 4 or more of them on a smartphone SoC? They are
    certainly not more energy-efficient than the OoO brethren except at
    their lowest performance point (and then not by much).

    Apple uses OoO efficiency cores that are about as fast as a Cortex-A76
    (in case of M1). Apparently they have no problem with using an OoO
    core to "keep the OS alive while it waits"; given that modern cores
    use very little power while they wait, that's not surprising.

    But in-order efficiency cores that are there when the demands are very
    low?

    Intel runs the management engine on such a core, and AMD runs its
    equivalent on several ARM cores, but these work outside of the realm
    covered by the OS.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Anton Ertl on Sun Dec 3 19:45:39 2023
    On Sun, 03 Dec 2023 14:36:37 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup) writes:

    Great Big in-order machines (why write non-OoO?):

    Of course, no doubt he was thinking of the Itanium, which was one
    of the most resounding failures in recent years.

    If one goes far enough back, of course, there's the IBM System/360
    Model 85. Unlike the Model 91, it was in-order, yet it offered more performance! This was because it had one thing the Model 91 didn't,
    a cache.

    The Model 85 was actually a failure for IBM in sales terms, but as
    that was because of an economic slump at the time it came out, IBM
    was not deterred from re-using the design, with a few additions and
    tweaks, in the IBM System/370 Model 165 and 168 a few years later.
    And those systems were quite successful.

    I've already noted that an in-order version of a great big architecture
    might make for nice lightweight efficiency cores in a BIG.little type
    design. But making those cores in-order has another nice benefit.

    No Spectre. No Meltdown. So, when the computer is actually active,
    these cores, instead of being a total waste of space, could be put
    to use as a ready-made sandbox for executing code sourced from the
    Internet.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Sun Dec 3 20:18:30 2023
    Quadibloc wrote:

    On Sun, 03 Dec 2023 14:36:37 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup) writes:

    Great Big in-order machines (why write non-OoO?):

    Of course, no doubt he was thinking of the Itanium, which was one
    of the most resounding failures in recent years.

    Itanic, Multiflow, i860, and now probably Mill.

    If one goes far enough back, of course, there's the IBM System/360
    Model 85. Unlike the Model 91, it was in-order, yet it offered more performance! This was because it had one thing the Model 91 didn't,
    a cache.

    The Model 85 was actually a failure for IBM in sales terms, but as
    that was because of an economic slump at the time it came out, IBM
    was not deterred from re-using the design, with a few additions and
    tweaks, in the IBM System/370 Model 165 and 168 a few years later.
    And those systems were quite successful.

    Model 85 and 91 were combined into 195 but this still failed compared
    to CDC 7600.

    I've already noted that an in-order version of a great big architecture
    might make for nice lightweight efficiency cores in a BIG.little type
    design. But making those cores in-order has another nice benefit.

    When you deconstruct a GBOoO machine into a LBIO machine you invariably
    loose issue width, which takes the pressure off {TLBs, Caches, Bus, ...}
    the pipeline shrinks in stages, taking even more pressure off those.

    No Spectre. No Meltdown. So, when the computer is actually active,
    these cores, instead of being a total waste of space, could be put
    to use as a ready-made sandbox for executing code sourced from the
    Internet.

    You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO
    by following one simple rule:: No microarchitectural changes until
    the causing instruction retires. AND you can do this without loosing performance.

    The existing camp of designs chooses not to.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Sun Dec 3 13:02:35 2023
    On 12/3/2023 12:18 PM, MitchAlsup wrote:

    snip

    Is my assessment (interspersed below) of the effects of this correct?


    When you deconstruct a GBOoO machine into a LBIO machine you invariably
    loose issue width,

    Which reduces performance


    which takes the pressure off {TLBs, Caches, Bus, ...}

    Which allows savings in ports, etc., thus further reducing gate count,
    thus chip size, thus cost.

    the pipeline shrinks in stages,

    Which reduces the cost of mis-predicted branches, thus counterbalancing
    "some" of the performance loss from eliminating OoO. Also, further
    reduces gate count.


    taking even more pressure off those.

    Overall, while the direction of the area/cost reduction, and performance
    loss are clear, the magnitude of these is more difficult to predict
    before actually doing it.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Sun Dec 3 22:19:40 2023
    On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:

    Model 85 and 91 were combined into 195 but this still failed compared to
    CDC 7600.

    I definitely remembered the Model 195.

    Even if the CDC 7600 outsold it, though, in one way the Model 195 was
    an enormous success. Its microarchitecture ended up being, in general
    terms, copied by the Pentium Pro and the Pentium II.

    So, today, all computers are made this way - OoO pipeline plus cache.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stephen Fuld on Sun Dec 3 22:34:56 2023
    Stephen Fuld wrote:

    On 12/3/2023 12:18 PM, MitchAlsup wrote:

    snip

    Is my assessment (interspersed below) of the effects of this correct?


    When you deconstruct a GBOoO machine into a LBIO machine you invariably
    loose issue width,

    Which reduces performance


    which takes the pressure off {TLBs, Caches, Bus, ...}

    Which allows savings in ports, etc., thus further reducing gate count,
    thus chip size, thus cost.

    the pipeline shrinks in stages,

    Which reduces the cost of mis-predicted branches, thus counterbalancing "some" of the performance loss from eliminating OoO. Also, further
    reduces gate count.


    taking even more pressure off those.

    Overall, while the direction of the area/cost reduction, and performance
    loss are clear, the magnitude of these is more difficult to predict
    before actually doing it.

    My last year at AMD ('06); I was working on a 1-wide x86-64, eXcel simulation indicated ½ the performance at 1/12 the area and likely 1/10 the power.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Sun Dec 3 22:39:26 2023
    Quadibloc wrote:

    On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:

    Model 85 and 91 were combined into 195 but this still failed compared to
    CDC 7600.

    I definitely remembered the Model 195.

    Even if the CDC 7600 outsold it, though, in one way the Model 195 was
    an enormous success. Its microarchitecture ended up being, in general
    terms, copied by the Pentium Pro and the Pentium II.

    So, today, all computers are made this way - OoO pipeline plus cache.

    Depends on how accurately you think copying 91 reservation stations count.
    Most machines today implement value-free reservation stations because they
    are 1/8 the area and somewhat faster. Tomasulo used value-capturing res- ervation stations.

    Also note: the compute Luke was working on a few years ago used Scoreboard technology rather than reservation station technology.....

    Several of the very deep window machines use a dispatch-stack-like pre scheduler before routing instructions to the FV reservation stations.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Sun Dec 3 23:25:38 2023
    On Sun, 03 Dec 2023 22:34:56 +0000, MitchAlsup wrote:

    My last year at AMD ('06); I was working on a 1-wide x86-64, eXcel
    simulation indicated ½ the performance at 1/12 the area and likely 1/10
    the power.

    Given that OoO is a wildly inefficient way to improve
    the single-thread performance of CPUs, which we use
    because we don't have anything better, I'm not surprised
    you've expressed the wish that more research be done on
    using multiple CPUs in parallel.

    Myself, I don't believe the parallel programming problem
    is solvable; there will always be too many problems that
    have critical serial parts that are too big. But that
    doesn't mean that I think we're doomed to require big
    hot CPUs that hog electricity.

    Because the problem of writing small, bloat-free programs
    _is_ solvable. Back in the days when all we had was Windows
    3.1 running on 386 and 486 processors, that was enough to
    do nearly everything we do with computers today.

    We could still run word processors, do spreadsheets, even
    run _Mathematica_. All most computer users would miss would
    be a bit of graphical pizazz.

    Now, it isn't in the interest of CPU makers and others in
    the computer industry for users not to be strongly motivated
    to run out and buy newer and faster processors every year
    or two. The death of Dennard Scaling, and the tapering off
    of Moore's Law, however, are taking the wind out of the sails
    of that. Eventually, the improvements will be so minor that
    the CPU makers won't have enough *money* to fund fabs that
    probe the ultimate limits of feature size any longer.

    For some users, CPUs made of some exotic material beyond
    silicon that was 10x as fast... but, because of yield
    issues, could only be used to make small in-order CPUs,
    so the CPUs are only 5x as fast... would be worth almost
    any price. Because the parallel programming problem hasn't
    been solved, whether or not it can be.

    And I don't begrudge them such a development, as it would
    be a step towards making better performance available to
    everyone, as demand drives research into bringing costs
    down.

    What the rest of us really need is lighter-weight software
    that isn't driven by the interests of computer makers instead
    of computer users.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Quadibloc on Sun Dec 3 16:01:43 2023
    On 12/3/2023 3:25 PM, Quadibloc wrote:
    On Sun, 03 Dec 2023 22:34:56 +0000, MitchAlsup wrote:

    My last year at AMD ('06); I was working on a 1-wide x86-64, eXcel
    simulation indicated ½ the performance at 1/12 the area and likely 1/10
    the power.

    Given that OoO is a wildly inefficient way to improve
    the single-thread performance of CPUs, which we use
    because we don't have anything better, I'm not surprised
    you've expressed the wish that more research be done on
    using multiple CPUs in parallel.

    Myself, I don't believe the parallel programming problem
    is solvable; there will always be too many problems that
    have critical serial parts that are too big. But that
    doesn't mean that I think we're doomed to require big
    hot CPUs that hog electricity.

    Because the problem of writing small, bloat-free programs
    _is_ solvable. Back in the days when all we had was Windows
    3.1 running on 386 and 486 processors, that was enough to
    do nearly everything we do with computers today.

    We could still run word processors, do spreadsheets, even
    run _Mathematica_. All most computer users would miss would
    be a bit of graphical pizazz.

    While I absolutely agree that there is too much resources spent on
    "graphical pizzazz", and while you could run many/most of the same
    programs, that doesn't mean there is no user benefit from faster CPUs.
    For example, you probably could run some simulations, fluid dynamics,
    finite element analysis, etc. but you were severely limited in the size
    of the program you could run in an acceptable amount of elapsed time.
    And applications like servers of various flavors certainly benefit from
    faster CPUs, as you need fewer of them. Not to mention driving graphics
    at more realistic resolutions, etc.

    So, no, it wasn't simply the greed of CPU makers that drive us to higher performance systems.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Mon Dec 4 00:08:19 2023
    On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:

    You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO by following one simple rule:: No microarchitectural changes until the
    causing instruction retires. AND you can do this without loosing
    performance.

    I thought that the mitigations that _were_ costly in performance
    were mostly attempts to approach following just that rule.

    Since out-of-order is so expensive in power and transistors,
    though, if mitigations do exact a performance cost, then
    going to a simple CPU that is not out-of-order might be a
    way to accept a loss of performance, but gain big savings in
    power and die size, whereas mitigations make those worse.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Mon Dec 4 18:58:51 2023
    Quadibloc wrote:

    On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:

    You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO by
    following one simple rule:: No microarchitectural changes until the
    causing instruction retires. AND you can do this without loosing
    performance.

    I thought that the mitigations that _were_ costly in performance
    were mostly attempts to approach following just that rule.

    The mitigations were closer to:: cause the problem to vanish,
    but change as little of the µArchitecture as possible in doing
    it. But 6 years later, they apparently are still unwilling to
    alter the µArchitecture enough to completely eliminate them.

    Since out-of-order is so expensive in power and transistors,
    though, if mitigations do exact a performance cost, then
    going to a simple CPU that is not out-of-order might be a
    way to accept a loss of performance, but gain big savings in
    power and die size, whereas mitigations make those worse.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Mon Dec 4 19:54:10 2023
    Quadibloc wrote:

    On Sun, 03 Dec 2023 22:34:56 +0000, MitchAlsup wrote:

    My last year at AMD ('06); I was working on a 1-wide x86-64, eXcel
    simulation indicated ½ the performance at 1/12 the area and likely 1/10
    the power.

    Given that OoO is a wildly inefficient way to improve
    the single-thread performance of CPUs, which we use
    because we don't have anything better, I'm not surprised
    you've expressed the wish that more research be done on
    using multiple CPUs in parallel.

    Myself, I don't believe the parallel programming problem
    is solvable; there will always be too many problems that
    have critical serial parts that are too big. But that
    doesn't mean that I think we're doomed to require big
    hot CPUs that hog electricity.

    Because the problem of writing small, bloat-free programs
    _is_ solvable. Back in the days when all we had was Windows
    3.1 running on 386 and 486 processors, that was enough to
    do nearly everything we do with computers today.

    We could still run word processors, do spreadsheets, even
    run _Mathematica_. All most computer users would miss would
    be a bit of graphical pizazz.

    Question (to everyone):: Has your word processor or spreadsheet
    added anything USEFUL TO YOU since 2000 ??

    If not, why are they still adding unused bloat to them ??

    {{Come to think of it, my 2003 WORD is more useful than my wife's
    2022 WORD because mine wastes less space on the screen with stuff
    I never use.}}

    Now, it isn't in the interest of CPU makers and others in
    the computer industry for users not to be strongly motivated
    to run out and buy newer and faster processors every year
    or two. The death of Dennard Scaling, and the tapering off
    of Moore's Law, however, are taking the wind out of the sails
    of that. Eventually, the improvements will be so minor that
    the CPU makers won't have enough *money* to fund fabs that
    probe the ultimate limits of feature size any longer.

    My desktops tend to last 7-9 years before blowing out a power
    supply transistor. My laptops when the battery dies.

    For some users, CPUs made of some exotic material beyond
    silicon that was 10x as fast... but, because of yield
    issues, could only be used to make small in-order CPUs,
    Gallium Arsenide.
    so the CPUs are only 5x as fast... would be worth almost
    any price. Because the parallel programming problem hasn't
    been solved, whether or not it can be.

    And I don't begrudge them such a development, as it would
    be a step towards making better performance available to
    everyone, as demand drives research into bringing costs
    down.

    What the rest of us really need is lighter-weight software
    that isn't driven by the interests of computer makers instead
    of computer users.

    Bloatware is driven by the software companies needing to sell
    new SW features to stay in business. CUP companies don't care,
    customers with CD or DVD ROM disks don't care either....

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Mon Dec 4 20:03:47 2023
    Quadibloc wrote:

    On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:

    You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO by
    following one simple rule:: No microarchitectural changes until the
    causing instruction retires. AND you can do this without loosing
    performance.

    I thought that the mitigations that _were_ costly in performance
    were mostly attempts to approach following just that rule.

    Since out-of-order is so expensive in power and transistors,
    though, if mitigations do exact a performance cost, then
    going to a simple CPU that is not out-of-order might be a
    way to accept a loss of performance, but gain big savings in
    power and die size, whereas mitigations make those worse.

    18 years ago, when I quit building CPUs professionally, GBOoO
    performance was 2× what an 1-wide IO could deliver. In those
    18 years the CPU makers have gone from 2× to 3× performance
    while the execution window has grown from 48 to 300 instructions.
    Clearly an unsustainable µArchitectural direction.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stephen Fuld on Mon Dec 4 19:58:18 2023
    Stephen Fuld wrote:

    On 12/3/2023 3:25 PM, Quadibloc wrote:


    We could still run word processors, do spreadsheets, even
    run _Mathematica_. All most computer users would miss would
    be a bit of graphical pizazz.

    While I absolutely agree that there is too much resources spent on
    "graphical pizzazz", and while you could run many/most of the same
    programs, that doesn't mean there is no user benefit from faster CPUs.
    For example, you probably could run some simulations, fluid dynamics,
    finite element analysis, etc. but you were severely limited in the size
    of the program you could run in an acceptable amount of elapsed time.

    I might note that all of those applications have no real limitation
    in parallelism.

    And applications like servers of various flavors certainly benefit from faster CPUs, as you need fewer of them. Not to mention driving graphics
    at more realistic resolutions, etc.

    So, no, it wasn't simply the greed of CPU makers that drive us to higher performance systems.

    I would say that CPU makers were driven to build faster, bigger, ...
    machines because SW makers were continuing to consume all of the
    available cycles (whether the end user cared or not.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Mon Dec 4 20:13:55 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Quadibloc wrote:

    On Sun, 03 Dec 2023 22:34:56 +0000, MitchAlsup wrote:

    We could still run word processors, do spreadsheets, even
    run _Mathematica_. All most computer users would miss would
    be a bit of graphical pizazz.

    Question (to everyone):: Has your word processor or spreadsheet
    added anything USEFUL TO YOU since 2000 ??

    Dunno. I still use troff.


    If not, why are they still adding unused bloat to them ??

    You can't sell a new version if there's nothing different.



    My desktops tend to last 7-9 years before blowing out a power
    supply transistor. My laptops when the battery dies.

    My cubicle desktop (a Dell tower) is now 11 years old and
    going strong. It's only been powered down a half dozen times
    during that period. I did replace the boot disk with an
    SSD in 2013.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Mon Dec 4 20:34:16 2023
    BGB wrote:

    On 12/3/2023 5:25 PM, Quadibloc wrote:
    On Sun, 03 Dec 2023 22:34:56 +0000, MitchAlsup wrote:


    In my own efforts, I can note that a 50MHz CPU, with programs having
    memory foot-prints measured in MB (or less) is "not entirely useless".

    I was working on an Automotive Engine simulator in eXcel on a 33 MHz
    486, That CPU died and I got a 200 MPH Pentium Pro. On the 486, I
    could change a variable (rod length for example) and eXcel would be
    done by the time I walked to the frige, got a beer and walked back.
    On the PP, it was done in less than 1 second.

    But, looking backwards, I am left to realize that, it seems, I am
    nowhere near close to the levels of performance or efficiency of a lot
    of these early systems.

    Like, seemingly, often it is not so much that the CPU is too weak or
    slow, but that my code code is still slow. Often, taking for granted
    coding practices that were formed in the "relative abundance" of CPU
    power in the early 2000s.


    In nearly every other area of engineering, the design constraints were relatively constant; but in software, nearly everyone had the mistaken
    belief that the exponential increases in computing speed and power would continue indefinitely.

    Now it has been steadily falling off, but there has been a sort of
    collective denial about it.

    As I mentioned above, this has more to do with SW companies needing to
    stay in business than in satisfying customer requirements.

    Now, it isn't in the interest of CPU makers and others in
    the computer industry for users not to be strongly motivated
    to run out and buy newer and faster processors every year
    or two. The death of Dennard Scaling, and the tapering off
    of Moore's Law, however, are taking the wind out of the sails
    of that. Eventually, the improvements will be so minor that
    the CPU makers won't have enough *money* to fund fabs that
    probe the ultimate limits of feature size any longer.


    As someone who skipped every other generation; I bought my destops
    more because the last one died, than the need for faster and faster
    CPUs. Also, since I always had a second disk in the box, transferring
    my files was as simple as moving the drive from box to box.

    I kinda suspect that when Moore's Law is good and dead, there may
    actually be a bit of a back-slide in these areas, as the "best" fabs
    will likely be more expensive to run and maintain than the "good but not
    the best" fabs, and this will create a back-pressure towards whatever is
    "the most bang for the buck" in terms of fab technology.

    The only thing chips smaller than 22nm bring is lower power (which we
    can use and more cores which apparently we cannot). We have been at 5GHz
    for nearly a decade.

    I also suspect that the transition from the past/current state, to this
    state of things, is a world where x86-64 is unlikely to fare well.

    It is loosing %-TAM to cell phones and ARM.

    Say, in this scenario, x86-64 would be left with an ultimately
    unwinnable battle against the likes of ARM and RISC-V.

    ARM: yes; RISC-V: I would bet against, but it is too early to tell.
    With all the Chinese money in RISC-V I don't think USA.gov will
    allow what the pundents are predicting.

    The exact form things will take will likely depend on a tradeoff:
    Whether it is better to have a smaller number of cores getting the best possible single-thread performance;
    Or, a larger number of cores each giving comparably worse single-thread performance, but there can be more of them for cheaper and less power.


    Say, if you could have cores that only got 1/3 as much performance per
    clock, but could afford to have 8x as many cores in total for the same
    amount of silicon.


    Or, say, people can find ways to make multi-threaded programming not
    suck as much (say, if using an async join/promise and/or channels model rather than traditional multi-threading with shared memory and synchronization primitives).

    If you want multi-threaded programs to succeed you need to start writing
    them in a language that is not inherently serial !! It is brain dead
    easy to write embarrassingly parallel applications in a language like
    Verilog. The programmer does not have to specify when or where a gate
    is evaluated--that is the job of the environment (Verilog).....

    Namely, with such models, it may be possible to make better use of a
    many core system, with less pain and overhead than that associated with trying to spawn off large numbers of conventional threads and have them
    all sitting around trying to lock mutexes for shared resources.

    If you want multi-threaded parallel programs you need to design the
    host language under the assumption of having infinite cores (not just
    "many"). It is not the job of the programmer to distribute the work,
    and verify the fork/joins or synchronization, but the environment

    Though, not necessarily a great way to map this stuff onto "ye olde C",

    None of the vonNeumann programming paradigms carry to the parallel realm.
    a) you can't single step the program
    b) you cannot assume that one inst is performed and then the next ...
    c) you cannot assume the "I call you and you return to me" control handoff
    d) you cannot assume there is 1 point of control

    so effectively one may end up with something in this case resembling the processes communicating in a form resembling COM objects or similar,

    You need a "net list" of how data flows through the application
    that manages itself.

    with the side effect that (given the structure of the internal dispatch loops), these "objects" can be self-synchronizing and thus don't need an explicit mutex (but, may potentially need a way for the task scheduler
    to queue up in-flight requests, which are then handled asynchronously; possibly with a mechanism in place to indicate whether the request will
    block the caller until it will be handled, or whether the caller will
    resume immediately, potentially even though the called object has not
    yet seen the request).

    You are starting to get the jist, but our feet are still stuck in the vN paradigm.

    Things like async/promises could scale a little easier to "well, do this thing, potentially using as many cores as available". Though, async's
    don't make as much sense on a primarily or exclusively single threaded system, and have an annoying level of overhead if emulated on top of conventional multithreading (it effectively needs a mutex protected work-queue which can itself become a bottleneck).

    This is simply syntactic sugar--the programmer should not have to
    manage the parallelism !! The environment does this !!

    Ideally, one would need a mechanism to distribute and balance tasks
    across the available cores that does not depend on needing to lock a
    mutex. Say, for example, maybe using an inter-processor interrupt or
    similar to "push" tasks or messages to the other cores, with some shared visible state for "how busy each core is" but not needing to lock
    anything to look at this state.

    I used to make a joke:: Verilog compiles your parallel application into
    2 miles of straight line code, the first mile is the clock high code,
    the second mile is the clock low code. No loops, no branches, just
    LD/ST and compute.

    This was back in the 1 CPU/system era. But all those loops, linked
    data structures, and procedure call/return tree were completely flattened
    into straight line code.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Mon Dec 4 21:33:01 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    BGB wrote:


    Or, say, people can find ways to make multi-threaded programming not
    suck as much (say, if using an async join/promise and/or channels model
    rather than traditional multi-threading with shared memory and
    synchronization primitives).

    If you want multi-threaded programs to succeed you need to start writing
    them in a language that is not inherently serial !! It is brain dead
    easy to write embarrassingly parallel applications in a language like >Verilog. The programmer does not have to specify when or where a gate
    is evaluated--that is the job of the environment (Verilog).....

    That is true, but only really usable when the resulting design
    is realized on silicon. Verilog simulations won't win any
    speed races, even with verilator.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Dec 4 17:21:27 2023
    I kinda suspect that when Moore's Law is good and dead, there may
    actually be a bit of a back-slide in these areas, as the "best" fabs
    will likely be more expensive to run and maintain than the "good but
    not the best" fabs, and this will create a back-pressure towards
    whatever is "the most bang for the buck" in terms of fab technology.

    AFAIK this future arrived a few years ago: the lowest cost
    per-transistor is not on the densest/smallest nodes any more, which is
    why many SoCs don't bother to use those densest/smallest nodes.

    I also suspect that the transition from the past/current state, to
    this state of things, is a world where x86-64 is unlikely to
    fare well.

    I suspect that the ISA makes sufficiently little difference at this
    point that it doesn't matter too much.

    Things like async/promises could scale a little easier to "well, do
    this thing, potentially using as many cores as available".

    Async/promises are handy for concurrency, but they don't bring much
    benefit for parallelism.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Tue Dec 5 00:57:40 2023
    On Mon, 04 Dec 2023 19:54:10 +0000, MitchAlsup wrote:

    Gallium Arsenide.

    I thought that while Gallium Arsenide was _once_ thought
    of as something faster than silicon, Intel had, by using
    it as a template for "stretched silicon", managed to
    improve silicon enough to make it just as good as Gallium
    Arsenide... or, at least, this seemed to be what they
    were claiming.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Tue Dec 5 01:08:01 2023
    Quadibloc wrote:

    On Mon, 04 Dec 2023 19:54:10 +0000, MitchAlsup wrote:

    Gallium Arsenide.

    Gallium Arsenide is what is used in Hubble's 60GHz radio links.

    I thought that while Gallium Arsenide was _once_ thought
    of as something faster than silicon, Intel had, by using
    it as a template for "stretched silicon", managed to
    improve silicon enough to make it just as good as Gallium
    Arsenide... or, at least, this seemed to be what they
    were claiming.

    Stretched, low-K dielectrics, high-K gates are all 10%-15% jumps
    at the chip level; as was FinFET and will be Gate-all-around.

    Gallium Arsenide is 5×; hideously expensive, dangerous to the
    workers in the FAB, and chemical disposal, low yield,.....

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Tue Dec 5 01:11:24 2023
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    BGB wrote:


    Or, say, people can find ways to make multi-threaded programming not
    suck as much (say, if using an async join/promise and/or channels model
    rather than traditional multi-threading with shared memory and
    synchronization primitives).

    If you want multi-threaded programs to succeed you need to start writing >>them in a language that is not inherently serial !! It is brain dead
    easy to write embarrassingly parallel applications in a language like >>Verilog. The programmer does not have to specify when or where a gate
    is evaluated--that is the job of the environment (Verilog).....

    That is true, but only really usable when the resulting design
    is realized on silicon. Verilog simulations won't win any
    speed races, even with verilator.

    Because it treats each bit as if it had (at least) 4 states.

    Verilog, with the model of 1-bit == 1-bit, would only have a 3× penalty;
    but would allow one to use all 1M CPUs in a system; instantly, and with
    out rewriting anything !

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to BGB on Tue Dec 5 09:17:03 2023
    On Mon, 04 Dec 2023 20:48:55 -0600, BGB wrote:

    The pressure against x86-64 is that one needs comparably expensive CPU
    cores to get decent performance, whereas ARM and RISC-V can perform acceptably on cheaper cores.

    The pressure would be in the direction of best perf/$, which will be
    in-turn best perf per die area, which is not really a battle that x86 is likely to win in the a longer term sense.

    If ARM or RISC-V catch up and end up being able to deliver more cores
    that are faster and cheaper than what is realistically possible for x86
    to offer, then x86's days will be numbered.

    I think this reasoning makes a lot of sense.

    The trouble is that:

    a) x86 has an enormous pool of software, and
    b) it is possible to build x86 processors, with current processes,
    that anyone can afford, and which have adequate performance, and
    c) much of the cost of a computer system is in the box housing
    the CPU, not just the CPU itself.

    However, in my opinion, x86-64 threw away the biggest advantage of x86,
    because it repeated the mistake of the 80286. It wasn't designed to
    make it easy and trivial for 16-bit Windows programs to run on 64-bit
    editions of Windows, without resorting to any fancy techniques like virtualization.

    Instead, they should just run, without Microsoft having to make much
    effort (of course, they would still have to thunk the OS calls).

    Then Windows' huge advantage, which carries over to the x86 architecture
    as well, the huge pool of software written for it, would be there in
    full.

    So Windows today seems to be in the situation that all that which is
    not bloatware is lost. That makes it easier for a competing architecture
    to win, it just has to not make the same mistake. Then lightweight
    programs _plus_ less complicated instruction decoding will compound
    the performance advantage of an alternate ISA.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stephen Fuld on Tue Dec 5 09:13:00 2023
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Overall, while the direction of the area/cost reduction, and performance
    loss are clear, the magnitude of these is more difficult to predict
    before actually doing it.

    Fortunatley, ARM and Intel have implemented such CPUs, so we can
    measure it. For the small Gforth benchmarks, I see (numbers are times
    in seconds):

    sieve bubble matrix fib fft
    0.348 0.384 0.300 0.460 0.356 Bonnell 1.6GHz (in-order 2-wide 2008)
    0.146 0.208 0.090 0.239 0.154 Silvermont 2.4GHz (OoO 2-wide E-core 2013)
    0.112 0.124 0.028 0.116 0.036 Sandy Bridge 3GHz (OoO 4-wide P-Core 2011)
    0.099 0.095 0.035 0.074 0.037 Tremont 2.8GHz (OoO E-Core 2020)
    0.037 0.043 0.014 0.035 0.015 Rocket Lake 5.1GHz (OoO wide P-Core 2021)

    0.250 0.296 0.159 0.256 0.151 Cortex A55 1.8GHz (in-order 2-wide 2017)
    0.180 0.208 0.072 0.232 0.084 Cortex A73 1.8GHz (OoO 2-wide 2016)
    0.116 0.160 0.042 0.087 0.051 Cortex A76 2.2GHz (OoO 4-wide 2018)
    0.111 0.116 0.046 0.098 0.071 IceStorm 2.06GHz (OoO Apple M1 E-core)
    0.088 0.054 0.028 0.047 0.034 Firestorm 3.2GHz (OoO Apple M1 P-core)

    Bonnell is really slow. The A55 managed to be quite a bit faster even
    though it has the same width and not much faster clock rate. In any
    case, the A55 is beaten by Firestorm by a factor of 5 on most
    benchmarks (and these are benchmarks that are not helped much by the
    larger caches of Firestorm); not a factor 2 any more.

    As for area, yes, I guess that the A55 is smaller than Firestorm by
    more than a factor of 5. What does it help? There have been startups
    that tried to put many small cores on a single chip (and Intel, in a
    way, with Knight's Ferry, too). They were not particularly
    successful; even Knights Ferry, where the target market was
    supercomputing (where applications are well parallelizable and
    software pipelining works well) was replaced by OoO Knights Corner and eventually with the mainline wide OoO cores.

    My impression is that the caches and interconnect between cores costs
    so much area that it does not pay off to build lots of small ones.
    Intel announced one with 288 Gracemonts (the successor of Tremont),
    but Gracemont is more advanced (and would probably use more area in
    the technology of the day) than the GBOoO (probably K8) that Mitch
    Alsup compared the little core with in 2006.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Tue Dec 5 11:26:23 2023
    Quadibloc <quadibloc@servername.invalid> writes:
    Given that OoO is a wildly inefficient way to improve
    the single-thread performance of CPUs

    "Wildly inefficient" in what way? As far as energy is concerned,
    comparing the A55 to the A75 <https://images.anandtech.com/doci/14072/Exynos9820-Perf-Eff-Estimated.png>
    at the highest respective efficiency, you get a factor of about 3.5 in performance for a cost factor of 1.1 in energy efficiency. Wildly
    inefficient?

    Compare that to just raising the voltage and clock of the in-order
    core: There you get the same factor 3.5 at a loss in efficiency by
    more than a factor of 2.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Tue Dec 5 11:07:09 2023
    Quadibloc <quadibloc@servername.invalid> writes:
    [IBM Model 195]
    Its microarchitecture ended up being, in general
    terms, copied by the Pentium Pro and the Pentium II.

    Not really. The Models 91 and 195 only have OoO for FP, not for
    integers. They have no reorder buffer and no speculative execution.
    They have imprecise exceptions, whereas modern OoO processors have
    precise exceptions. And do they split instructions into uops that
    find their own way through the OoO execution engine? I don't think
    so, because that needs a reorder buffer.

    I have not read the HPS papers for a long time, but they certainly
    look closer to what is implemented in modern OoO machines. However,
    looking at my comments for [hwu&patt87isca], there is still quite a
    bit of difference between that and modern OoO.

    @InProceedings{patt+85a,
    author = "Yale N. Patt and {Wen-mei} Hwu and Michael Shebanow",
    title = "{HPS}, a New Microarchitecture: Rationale and Introduction",
    crossref = "micro85",
    pages = "103--108",
    annote = "CISC instructions are decoded into RISC instructions,
    which are executed in parallel using dynamic
    scheduling etc. This microengine is presented as a
    restricted data flow machine."
    }

    @InProceedings{patt+85b,
    author = "Yale N. Patt and Stephen W. Melvin and {Wen-mei} Hwu
    and Michael C. Shebanow",
    title = "Critical Issues Regarding {HPS}, a High Performance Microarchitecture",
    crossref = "micro85",
    pages = "109--116",
    annote = "Discusses in depth some of the issues in dynamic
    scheduling hardware."
    }

    @Proceedings{micro85,
    key = "MICRO-18",
    booktitle = "The $18^{th}$ Annual Workshop on Microprogramming
    (MICRO-18)",
    title = "The $18^{th}$ Annual Workshop on Microprogramming
    (MICRO-18)",
    year = "1985",
    }

    @InProceedings{hwu&patt87isca,
    author = "{Wen-mei} Hwu and Yale N. Patt",
    title = "Checkpoint Repair for Out-of-order Execution Machines",
    crossref = "isca87",
    pages = "18--26",
    note = "Newer version: \cite{hwu&patt87ieeetc}",
    annote = "Describes design issues in checkpoint mechanisms for
    precise interrupts and speculative execution. Their
    design uses backup register files and difference
    techniques for main memory. Instructions can be
    retired out-of-order, avoiding full window
    conditions."
    }

    @Article{hwu&patt87ieeetc,
    author = "{Wen-mei} Hwu and Yale N. Patt",
    title = "Checkpoint Repair for High-Performance Out-of-order
    Execution Machines",
    journal = ieeetc,
    year = "1987",
    volume = "36",
    number = "12",
    pages = "1496--1514",
    month = dec
    }

    @Proceedings{isca87,
    key = "ISCA-14",
    booktitle = "The $14^{th}$ Annual International Symposium on
    Computer Architecture (ISCA)",
    title = "The $14^{th}$ Annual International Symposium on
    Computer Architecture (ISCA)",
    year = "1987",
    address = "Pittsburgh, Pennsylvania",
    organization = "IEEE Computer Society TCCA and ACM SIGARCH",
    note = "{\em Computer Architecture News,} 15(2), June 1987",
    month = jun # " 2--5,",
    }

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to BGB on Tue Dec 5 14:59:04 2023
    BGB <cr88192@gmail.com> writes:
    On 12/4/2023 4:21 PM, Stefan Monnier wrote:
    I kinda suspect that when Moore's Law is good and dead, there may
    actually be a bit of a back-slide in these areas, as the "best" fabs
    will likely be more expensive to run and maintain than the "good but
    not the best" fabs, and this will create a back-pressure towards
    whatever is "the most bang for the buck" in terms of fab technology.

    AFAIK this future arrived a few years ago: the lowest cost
    per-transistor is not on the densest/smallest nodes any more, which is
    why many SoCs don't bother to use those densest/smallest nodes.


    OK.

    I also suspect that the transition from the past/current state, to
    this state of things, is a world where x86-64 is unlikely to
    fare well.

    I suspect that the ISA makes sufficiently little difference at this
    point that it doesn't matter too much.


    The pressure against x86-64 is that one needs comparably expensive CPU
    cores to get decent performance, whereas ARM and RISC-V can perform >acceptably on cheaper cores.

    Are they cheaper? There are a lot of sunk costs already absorbed
    by the x86-64 family both at Intel and AMD.

    One of my professors back in the late 70's was researching
    data flow architectures. Perhaps it's time to reconsider the
    unit of compute using single instructions, instead providing a
    set of hardware 'functions' than can be used in a data flow environment.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Tue Dec 5 15:44:24 2023
    Quadibloc <quadibloc@servername.invalid> writes:
    On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:

    You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO by
    following one simple rule:: No microarchitectural changes until the
    causing instruction retires. AND you can do this without loosing
    performance.

    I thought that the mitigations that _were_ costly in performance
    were mostly attempts to approach following just that rule.

    No. What mitigations do we have:

    * Retpolines (against Spectre v2): These ensure that an indirect
    branch mispredicts in a harmless way, so they completely suppress
    speculation. I have seen slowdowns in Gforth by up to a factor of
    9.5 from retpolines. All indirect branches in a process have to be
    converted to retpolines, and even if you do that, there is Inception
    (which works without any indirect branch in the victim).

    * Speculative load hardening (against Spectre v1): This adds the
    control dependencies as data dependencies to loads, essentially
    eliminating speculation of loads (and thus mostly eliminating
    speculation and its speed benefits). The slowdown for Ultimate SLH
    is a factor 2.5 on SPEC, and not much less for weaker SLH versions.
    The hope is that this can be done selectively, only on the loads
    where the attacker can influence the loaded address, reducing the
    slowdown. But if you make one mistake in the wrong direction, your
    program is vulnerable.

    * Ways to clear various microarchitectural state (e.g., branch
    predictors, caches), in the hope that this prevents the primed
    predictor to reach the victim, or the changed microarchitectural
    state to reach the attacker.

    None of these mitigations prevent speculative changes to
    microarchictural state from continuing to be in microarchitectural
    state after the misprediction is resolved. By contrast, speculative architectural state is thrown away when the misprediction is resolved.
    That's why they are mitigations, not fixes.

    Since out-of-order is so expensive in power and transistors,
    though, if mitigations do exact a performance cost, then
    going to a simple CPU that is not out-of-order might be a
    way to accept a loss of performance, but gain big savings in
    power and die size, whereas mitigations make those worse.

    Buy a Raspi3, or, for more performance, Odroid HC4. No Spectre there,
    low power, maybe few transistors.

    Anyway, what the mainstream players have been doing seems to be:
    Hardware vendors throw the problem over to software people;
    application people do nothing about it, while systems software people
    try to mitigate the problems in various ways, including those outlined
    above. Users a lulled by the claim that they are not affected by
    Spectre, because there are other, easier-to-exploit vulnerabilities on
    their computer, and Spectre is supposedly so harder to exploit. So
    they buy fast OoO CPUs rather than Odroid HC4s. And consequently
    nobody has capitalized on the Spectre-invulnerability of in-order
    cores.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Wed Dec 6 02:29:31 2023
    Quadibloc wrote:

    On Mon, 04 Dec 2023 20:48:55 -0600, BGB wrote:

    The pressure against x86-64 is that one needs comparably expensive CPU
    cores to get decent performance, whereas ARM and RISC-V can perform
    acceptably on cheaper cores.

    The pressure would be in the direction of best perf/$, which will be
    in-turn best perf per die area, which is not really a battle that x86 is
    likely to win in the a longer term sense.

    If ARM or RISC-V catch up and end up being able to deliver more cores
    that are faster and cheaper than what is realistically possible for x86
    to offer, then x86's days will be numbered.

    I think this reasoning makes a lot of sense.

    The trouble is that:

    a) x86 has an enormous pool of software, and
    b) it is possible to build x86 processors, with current processes,
    that anyone can afford, and which have adequate performance, and
    c) much of the cost of a computer system is in the box housing
    the CPU, not just the CPU itself.

    However, in my opinion, x86-64 threw away the biggest advantage of x86, because it repeated the mistake of the 80286. It wasn't designed to
    make it easy and trivial for 16-bit Windows programs to run on 64-bit editions of Windows, without resorting to any fancy techniques like virtualization.

    A lot of this had to do with reclaiming prefixes so that we could make
    -64 work in long mode.

    Instead, they should just run, without Microsoft having to make much
    effort (of course, they would still have to thunk the OS calls).

    This would have been a big loss--argument passing in registers, continued
    use of segmentation when everyone was using a flat memory model,.....
    No the real problem was 286 creating the segmentation model in the first
    place. {I left a company at this transition because I did not want to go segmented style writing asm......they later went OoB.}

    Then Windows' huge advantage, which carries over to the x86 architecture
    as well, the huge pool of software written for it, would be there in
    full.

    So Windows today seems to be in the situation that all that which is
    not bloatware is lost. That makes it easier for a competing architecture
    to win, it just has to not make the same mistake. Then lightweight
    programs _plus_ less complicated instruction decoding will compound
    the performance advantage of an alternate ISA.

    Now imaging an architecture that context switches in 10 cycles instead
    of taking 1,000 cycles to reach somebody capable of slowly walking
    through the context switching process...................

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to MitchAlsup on Wed Dec 6 07:31:54 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:
    That is true, but only really usable when the resulting design
    is realized on silicon. Verilog simulations won't win any
    speed races, even with verilator.

    Because it treats each bit as if it had (at least) 4 states.

    Actually, as I learned in HOPL-IV, Verilog won the speed race that
    counts, the one against VHDL, because it has been designed around
    these 4 states, and implementing them efficiently, whereas VHDL allows
    more states.

    Verilog, with the model of 1-bit == 1-bit, would only have a 3× penalty;
    but would allow one to use all 1M CPUs in a system; instantly, and with
    out rewriting anything !

    Given that simulation efficiency is the reason that Verilog won, your 1-bit-Verilog should be a winner. But what do you do about the
    high-impendance state of MOS?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Wed Dec 6 07:54:07 2023
    scott@slp53.sl.home (Scott Lurndal) writes:
    BGB <cr88192@gmail.com> writes:
    The pressure against x86-64 is that one needs comparably expensive CPU >>cores to get decent performance, whereas ARM and RISC-V can perform >>acceptably on cheaper cores.

    Are they cheaper?

    Good question. I can buy a server with 128GB of ECC RAM and two
    enterprise SSDs with a Ryzen for EUR 2000. With that amount of money,
    I get no ARM or RISC-V machine with similar capabilities.

    Looking for places where I can actually buy something with ARM: The
    Rock 5B with 16GB RAM cost EUR 240 plus EUR 25 or so for the PSU (with
    some anxiety on whether it would work), without a case. I can buy a
    barebone with an Intel N100 starting at EUR 192 including case and
    PSU, but I have to add 16GB RAM for about EUR 30; The N100 is faster
    for single-threaded stuff, and probably similarly fast for
    multi-threaded stuff. Here's a speed comparison between the A76 on
    the Rock 5B, and the Tremont (predecessor of the Gracemont in the
    N100); numbers are times in seconds, lower is better:

    sieve bubble matrix fib fft
    0.099 0.095 0.035 0.074 0.037 Tremont 2.8GHz (OoO E-Core 2020)
    0.116 0.160 0.042 0.087 0.051 Cortex A76 2.2GHz (OoO 4-wide 2018)
    0.452 0.526 0.314 0.676 0.603 JH7100 1GHz

    The Raspi5 will be cheaper, but not offered with 16GB.

    Concerning RISC-V, you can buy a Visionfive V2 with a JH7110 (for
    USD100 with 8GB), but even with 1.5GHz, it will be dog slow, as you
    can see above.

    One of my professors back in the late 70's was researching
    data flow architectures. Perhaps it's time to reconsider the
    unit of compute using single instructions, instead providing a
    set of hardware 'functions' than can be used in a data flow environment.

    We already have data-flow microarchitectures since the mid-1990s, with
    the success of OoO execution. And the "von Neumann" ISAs have proven
    to be a good and long-term stable interface between software and these data-flow microarchitectures, whereas the data-flow ISAs of the 1970s
    and their microarchitectures turned out to be uncompetetive.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andreas Eder@21:1/5 to BGB on Wed Dec 6 10:04:04 2023
    On Di 05 Dez 2023 at 17:44, BGB <cr88192@gmail.com> wrote:

    QEMU does better emulation, but lacks any real way of sharing files with
    the host OS.

    Look at what is described here:
    https://en.wikibooks.org/wiki/QEMU/FreeDOS

    You can simply mount the image (when qemu istn't running) and copy file
    to and fro.

    'Andreas

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Wed Dec 6 14:52:35 2023
    Paul A. Clayton wrote:

    On 11/24/23 9:49 PM, BGB wrote:
    [snip]
    Though, apparently, someone posted something recently showing RV64
    and ARM64 to be much closer than expected, which is curious. The
    main instructions that seem to have "the most bang for the buck"
    are ones that ARM64 has equivalents of.

    "An Empirical Comparison of the RISC-V and AArch64 Instruction
    Sets" (Daniel Weaver and Simon McIntosh-Smith, 2023) used five
    benchmarks, four scientific and STREAM. Just the fact that STREAM
    did not use FMADD for the TRIAD portion slightly penalized AArch64
    (though RISC-V will presumably add FMADD if it has not already).
    SIMD was excluded based on the reasonable point that RISC-V has
    not yet standardized its SIMD extension and "comparing the
    different vector instruction sets across AArch64 and RISC-V is
    beyond the scope of this initial comparison".

    I rather suspect these benchmarks do not provide a good basis for
    ISA design targeting minimum path length (much less performance).

    Both ARM and RISC-V require close to 40% more instructions than My 66000.
    So much for minimum path lengths.

    AND, no ISA with more than about 200 instructions should be considered RISC,

    The path lengths also varied considerably based on the compiler
    version — a more recent version usually helping RISC-V more as
    would be expected for a more recent ISA — though the results do
    seem to point to general consistency of path length across
    versions (one benchmark had negligible change for both ISAs, one
    improved AArch64 only, two helped RISC-V only, and one helped both
    ISAs but RISC-V more than AArch64).

    I am somewhat surprised that indexed memory accesses did not
    benefit AArch64 more (for such "scientific" benchmarks). AArch64's
    need for a distinct comparison instruction for branches presumably
    hurt, especially since loops were not unrolled. (AArch64 does, I
    think, include a branch on equal/not-equal zero, so reverse
    counted loops would have removed that disadvantage in some cases.)

    My data indicates the indexed advantage is in the 2%-3% range.

    Both RISC-V and AArch64 are RISC-oriented,

    Under a perverted view of what the R in RISC stands for.

    so one would expect the
    most common operations to be present as instructions in both. The
    differences would be mainly in special instructions (AArch64 has
    many), memory addressing (AArch64 has more complex addressing
    modes), branches (RISC-V has comparisons on integer values in the
    branch instruction, AArch64 can sometimes set condition codes
    without an additional instruction), and immediate sizes (AArch64
    has larger base immediates — 16-bit vs. 12-bit and ways of
    generating some larger immediates).

    The special instructions seem unlikely to affect path length much
    on such benchmarks and I suspect most of the constants are either
    small integers or floating point values. This leaves branches and
    memory accesses to affect path length.

    A compiler or a web browser would have more interesting
    instruction use, I suspect.

    The benchmarks used were:
    • STREAM [11]
    A benchmark for measuring sustained memory bandwidth widely used
    in industry, this consists of 4 simple kernels applied to elements
    of arrays of size 10,000,000.
    • CloverLeaf Serial [10]
    A high energy physics simulation solving the compressible Euler
    equations on a 2D Cartesian grid. This is broken down into a
    series of kernels each of which loops over the entire grid. This
    is run with default parameter.
    • MiniBUDE [12, 15]
    A mini app approximating the behaviour of a molecular docking
    simulation used for drug discovery. Run with the bm1 input at 64
    poses for one iteration (-n 64 -i 1 –deck /bm1).
    • Lattice Boltzmann (LBM)
    A d2q9-bgk Lattice Boltzmann algorithm, developed within the HPC
    Research Group at the University of Bristol, optimised for serial
    execution. Run with a grid size of 128x128 for 100 iterations.
    • Minisweep [13]
    A radiation transportation mini app reproducing the Denovo Sn
    radiation transport behaviour used for nuclear reactor neutronics
    modeling. Run with options –ncell_x 8 –ncell_y 16 –ncell_z 32
    –ne 1 –na 32

    The paper was not really very good.

    Captain Obvious strikes again.

    While some would argue that
    excluding cache misses and branch mispredictions from
    consideration even for maximum ILP is silly — I do not have a
    problem with such in a limit study — the lack of loop unrolling
    (or value inference/prediction for incremented values) makes the
    results less accurately reflect a true maximum. The benchmarks are
    also such that parallelism is much higher than usual.

    Comparing performance of ISAs in such a limit study (same
    frequency) seems to mostly be comparing the dataflow traits of the
    programs rather than the tradeoffs presented by the ISAs, though
    there were notable differences when instruction latencies were
    allowed to be more realistic.

    Fair ISA comparisons are hard. ISA interacts with multiple aspects
    of microarchitecture. One could present an optimized
    implementation space (with the dimensions of energy, time-to-
    completion, and area/yield/cost — one might have to model an
    optimum financial binning!), but that would seem to involve an
    enormous amount of work even with rough approximations.

    More testing may be needed.

    Choosing benchmarks (and what to measure) tends to be iterative.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to BGB on Wed Dec 6 15:03:26 2023
    BGB <cr88192@gmail.com> writes:
    On 12/5/2023 8:29 PM, MitchAlsup wrote:
    Quadibloc wrote:

    But, the decoder still worked as-is for 32-bit x86, and the CPU isn't
    going to be running 16-bit and 64-bit code at the same time, ...

    Granted, IIRC an issue was that when Long-Mode-Enable is set, the mode >bit-patterns for 16-bit mode were reused for 64-bit mode (and VM86 mode
    went poof as well).

    But, otherwise they might have needed to "get creative" and find a way
    to encode more CPU modes.

    Either was, would have been happier if MS had included a built-in
    emulator for 16-bit stuff.

    At that point, nobody was using the 16-bit stuff except for a few
    hobbyists. Good riddance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Wed Dec 6 17:44:50 2023
    Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:
    That is true, but only really usable when the resulting design
    is realized on silicon. Verilog simulations won't win any
    speed races, even with verilator.

    Because it treats each bit as if it had (at least) 4 states.

    Actually, as I learned in HOPL-IV, Verilog won the speed race that
    counts, the one against VHDL, because it has been designed around
    these 4 states, and implementing them efficiently, whereas VHDL allows
    more states.

    VHDL allows for "current fighting" between 2 driving nodes.

    Verilog, with the model of 1-bit == 1-bit, would only have a 3× penalty; >>but would allow one to use all 1M CPUs in a system; instantly, and with
    out rewriting anything !

    Given that simulation efficiency is the reason that Verilog won, your 1-bit-Verilog should be a winner. But what do you do about the high-impendance state of MOS?

    This requires the 4-state model (to mimic anything similar to real circuitry.) In any event, technologies smaller than 30nm no longer allow this form of logic.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to BGB on Wed Dec 6 19:07:32 2023
    BGB <cr88192@gmail.com> writes:
    On 12/6/2023 3:04 AM, Andreas Eder wrote:
    On Di 05 Dez 2023 at 17:44, BGB <cr88192@gmail.com> wrote:

    QEMU does better emulation, but lacks any real way of sharing files with >>> the host OS.

    Look at what is described here:
    https://en.wikibooks.org/wiki/QEMU/FreeDOS

    You can simply mount the image (when qemu istn't running) and copy file
    to and fro.


    Windows can't mount filesystem images...

    WSL1 can't do it either, and WSL2 doesn't work on my PC.

    Maybe it's time to switch to linux? Or at least a dual-boot
    setup?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Chris M. Thomasson on Wed Dec 6 21:55:55 2023
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 12/6/2023 11:07 AM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 12/6/2023 3:04 AM, Andreas Eder wrote:
    On Di 05 Dez 2023 at 17:44, BGB <cr88192@gmail.com> wrote:

    QEMU does better emulation, but lacks any real way of sharing files with >>>>> the host OS.

    Look at what is described here:
    https://en.wikibooks.org/wiki/QEMU/FreeDOS

    You can simply mount the image (when qemu istn't running) and copy file >>>> to and fro.


    Windows can't mount filesystem images...

    WSL1 can't do it either, and WSL2 doesn't work on my PC.

    Maybe it's time to switch to linux? Or at least a dual-boot
    setup?


    :^) Fwiw, I remember using a lot of those handy harddrive caddies way
    back 22'ish years ago. I remember one time I had a lot of them, Solaris, >Linux, WinNT4, WinME, MSDOS, ect...



    Today you can boot off an SSD connected to USB. With USB-C and an
    NVME ssd, you can get excellent performance to boot.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to MitchAlsup on Thu Dec 7 13:33:53 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Anton Ertl wrote:
    But what do you do about the
    high-impendance state of MOS?

    This requires the 4-state model (to mimic anything similar to real circuitry.) >In any event, technologies smaller than 30nm no longer allow this form of >logic.

    But is it prevented by static checking? If not, you still need to
    represent it in simulation, and report it as a bug there.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Thu Dec 7 19:03:50 2023
    Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Anton Ertl wrote:
    But what do you do about the
    high-impendance state of MOS?

    This requires the 4-state model (to mimic anything similar to real circuitry.)
    In any event, technologies smaller than 30nm no longer allow this form of >>logic.

    But is it prevented by static checking? If not, you still need to
    represent it in simulation, and report it as a bug there.

    You still need the X state {don't know if the value is 1 or 0}--set all flip-flops to X at power on and watch HW achieve initialized state.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcus@21:1/5 to MitchAlsup on Thu Dec 7 21:14:53 2023
    On 2023-12-06, MitchAlsup wrote:
    Paul A. Clayton wrote:

    On 11/24/23 9:49 PM, BGB wrote:
    [snip]
    Though, apparently, someone posted something recently showing RV64
    and ARM64 to be much closer than expected, which is curious. The main
    instructions that seem to have "the most bang for the buck" are ones
    that ARM64 has equivalents of.

    "An Empirical Comparison of the RISC-V and AArch64 Instruction
    Sets" (Daniel Weaver and Simon McIntosh-Smith, 2023) used five
    benchmarks, four scientific and STREAM. Just the fact that STREAM
    did not use FMADD for the TRIAD portion slightly penalized AArch64
    (though RISC-V will presumably add FMADD if it has not already).
    SIMD was excluded based on the reasonable point that RISC-V has
    not yet standardized its SIMD extension and "comparing the
    different vector instruction sets across AArch64 and RISC-V is
    beyond the scope of this initial comparison".

    I rather suspect these benchmarks do not provide a good basis for
    ISA design targeting minimum path length (much less performance).

    Both ARM and RISC-V require close to 40% more instructions than My 66000.
    So much for minimum path lengths.
    AND, no ISA with more than about 200 instructions should be considered
    RISC,

    I wonder if 200 is a fundamental constant for RISC vs CISC ;-)

    I still struggle to find a good definition of "1 instruction". For me
    the definition is loosely "one distinct operation", and so there can be
    many variants of one instruction (e.g. variants with register operands
    or register + immediate operands all count as a single instruction),
    that all carry out the same operation, but with different kinds of
    operands or operand sizes.

    In my ISA I refer to these as "major instructions", and each
    instructions typically have several variants (currently up to 18
    variants per major instruction, where different permutations of scalar
    and vector register operands count as different variants of a single instruction, for instance).

    If I count this way, I currently have 106 instructions, which by your definition safely puts MRISC32 in the "RISC" camp. However, if I count
    every variant as a separate instruction, I blow the budget.

    Also worth mentioning is that my current instruction encoding scheme
    allows for 1535 major instructions, so there is still plenty of room for extensions (even though I already have pretty complete integer,
    floating-point and vector support).


    The path lengths also varied considerably based on the compiler
    version — a more recent version usually helping RISC-V more as
    would be expected for a more recent ISA — though the results do
    seem to point to general consistency of path length across
    versions (one benchmark had negligible change for both ISAs, one
    improved AArch64 only, two helped RISC-V only, and one helped both
    ISAs but RISC-V more than AArch64).

    I am somewhat surprised that indexed memory accesses did not
    benefit AArch64 more (for such "scientific" benchmarks). AArch64's
    need for a distinct comparison instruction for branches presumably
    hurt, especially since loops were not unrolled. (AArch64 does, I
    think, include a branch on equal/not-equal zero, so reverse
    counted loops would have removed that disadvantage in some cases.)

    My data indicates the indexed advantage is in the 2%-3% range.

    Both RISC-V and AArch64 are RISC-oriented,

    Under a perverted view of what the R in RISC stands for.

    I think that "RISC" much more commonly refers to a "load/store
    architecture where all instruction fit well in a pipelined
    design". That is, the "R" is very misleading.

    [snip]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Thu Dec 7 22:34:08 2023
    BGB wrote:

    On 12/7/2023 2:14 PM, Marcus wrote:

    Under a perverted view of what the R in RISC stands for.

    I think that "RISC" much more commonly refers to a "load/store
    architecture where all instruction fit well in a pipelined
    design". That is, the "R" is very misleading.


    Yeah.


    Load/Store, and doesn't use a "variable number of bytes" encoding scheme (like x86/Z80/6502 variants).

    Does variable number of words fit this criterion.

    Or, the 'R' could refer more to keeping instruction logic simple, rather
    than minimizing the number of instructions that can exist in the
    instruction listing.

    In the end it is how do you fit K instructions through your pipeline in
    fewer cycles than some on can fit 1.4×k instructions through their pipeline.

    Well, and probably that it is viable to implement a CPU core for the
    entire ISA without needing a microcode ROM or similar.

    There is no microcode in My 66000 1-wide or 6-wide implementations.
    But there is no reason one could not build a My 66000 using microcode
    should that be the best choice for some implementation.

    It is probably not viable to build a {bug for bug} compatible x86
    without microcode.

    Admittedly, I feel unease with instructions which violate the Load/Store model, which goes for both my experimental LDOP extension and the RISC-V
    'A' extension (where essentially LDOP and 'A' represent the same basic
    CPU functionality).

    Though, it is "sort of passable" in that it is possible to implement
    these by shoving a minimal ALU into the L1 cache, rather than needing to restructure the whole pipeline (as would be needed for a more general x86-like model).

    It is SO EASY to track this dependency based on register forwarding
    that creating a LdOp was done for some other reason.

    Then again, not many people are going and being like "The A extension
    makes RISC-V no longer RISC".

    BECAUSE RISC-V is already not RISC (less than 200 instructions)...

    But, then again, there are people who go on about how "[Rm+Ro*Sc]"
    addressing is "Not RISC", nevermind that nearly every other RISC
    (besides RISC-V) had included it (whether or not they also included a
    way to explicitly encode the scale, or if the scale was baked into the instruction, *).

    It was accepted as RISC in Mc88100 ISA. {MIPS did not have, ...}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Marcus on Thu Dec 7 22:23:50 2023
    Marcus wrote:

    On 2023-12-06, MitchAlsup wrote:
    Paul A. Clayton wrote:

    On 11/24/23 9:49 PM, BGB wrote:
    [snip]
    Though, apparently, someone posted something recently showing RV64
    and ARM64 to be much closer than expected, which is curious. The main
    instructions that seem to have "the most bang for the buck" are ones
    that ARM64 has equivalents of.

    "An Empirical Comparison of the RISC-V and AArch64 Instruction
    Sets" (Daniel Weaver and Simon McIntosh-Smith, 2023) used five
    benchmarks, four scientific and STREAM. Just the fact that STREAM
    did not use FMADD for the TRIAD portion slightly penalized AArch64
    (though RISC-V will presumably add FMADD if it has not already).
    SIMD was excluded based on the reasonable point that RISC-V has
    not yet standardized its SIMD extension and "comparing the
    different vector instruction sets across AArch64 and RISC-V is
    beyond the scope of this initial comparison".

    I rather suspect these benchmarks do not provide a good basis for
    ISA design targeting minimum path length (much less performance).

    Both ARM and RISC-V require close to 40% more instructions than My 66000.
    So much for minimum path lengths.
    AND, no ISA with more than about 200 instructions should be considered
    RISC,

    I wonder if 200 is a fundamental constant for RISC vs CISC ;-)

    I choose 200 as the upper bound since 100 is obviously too small
    {even though I get by with 61} and any vectorized or SIMDed ISA
    is way more than 200.

    I still struggle to find a good definition of "1 instruction".

    1 Instruction is 1 Spelling the assembly language programmer has to
    remember.

    For me
    the definition is loosely "one distinct operation", and so there can be
    many variants of one instruction (e.g. variants with register operands
    or register + immediate operands all count as a single instruction),
    that all carry out the same operation, but with different kinds of
    operands or operand sizes.

    I hold this same view.

    In my ISA I refer to these as "major instructions", and each
    instructions typically have several variants (currently up to 18
    variants per major instruction, where different permutations of scalar
    and vector register operands count as different variants of a single instruction, for instance).

    VVM makes this distinction unnecessary.

    If I count this way, I currently have 106 instructions, which by your definition safely puts MRISC32 in the "RISC" camp. However, if I count
    every variant as a separate instruction, I blow the budget.

    My 66000 has 61 instructions under this framework. This includes {flow
    control, Integer, Logical, Shift, Floating point, Transcendentals,
    conversions, privileged, vectorization, and SIMD}

    Also worth mentioning is that my current instruction encoding scheme
    allows for 1535 major instructions, so there is still plenty of room for extensions (even though I already have pretty complete integer, floating-point and vector support).

    My 66000 encoding scheme supports 2048 1-operand instructions at the consumption of 1 Major OpCode. Only the 3-operand subGroup is stressed
    for Minor OpCodes.


    Under a perverted view of what the R in RISC stands for.

    I think that "RISC" much more commonly refers to a "load/store
    architecture where all instruction fit well in a pipelined
    design". That is, the "R" is very misleading.

    My point was that it should not be redefined into meaninglessness.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Fri Dec 8 02:36:04 2023
    BGB wrote:

    On 12/7/2023 4:34 PM, MitchAlsup wrote:
    BGB wrote:

    On 12/7/2023 2:14 PM, Marcus wrote:

    Under a perverted view of what the R in RISC stands for.

    I think that "RISC" much more commonly refers to a "load/store
    architecture where all instruction fit well in a pipelined
    design". That is, the "R" is very misleading.


    Yeah.


    Load/Store, and doesn't use a "variable number of bytes" encoding
    scheme (like x86/Z80/6502 variants).

    Does variable number of words fit this criterion.


    Variable number of words is probably OK, otherwise Thumb2 and RVC would
    no longer be RISC...

    My point on VLEness is that all the position and length information is
    found in the first container of the instruction and not determined by
    a serial walk along the containers. IBM 360 is a lot less CISC than x86.

    Serial decode is definitely not RISC.
    Small field determines length, pointers, and sizes; remains RISCable if
    it does not violate other RISC tenets.

    Or, the 'R' could refer more to keeping instruction logic simple,
    rather than minimizing the number of instructions that can exist in
    the instruction listing.

    In the end it is how do you fit K instructions through your pipeline in
    fewer cycles than some on can fit 1.4×k instructions through their
    pipeline.


    I could probably save a number of instructions if BJX2 was not
    Load/Store, but worth it?...


    Say, without LDOP:
    MOV 16, R6
    MOV.L (R4, 0), R5
    ADD R5, R6, R5
    MOV.L R5, (R4, 0)

    Vs, with LDOP:
    ADDS.L 16, (R4, 0) //*

    This is actually a OP-ST.


    Or, maybe go further, and add, say:
    INC.L (R4)
    DEC.L (R4)
    ...

    This is actually a Ld-Op-ST not a LD-Op.

    -------------------------

    Well, and probably that it is viable to implement a CPU core for the
    entire ISA without needing a microcode ROM or similar.

    There is no microcode in My 66000 1-wide or 6-wide implementations.
    But there is no reason one could not build a My 66000 using microcode
    should that be the best choice for some implementation.

    It is probably not viable to build a {bug for bug} compatible x86
    without microcode.


    OK.


    Admittedly, I feel unease with instructions which violate the
    Load/Store model, which goes for both my experimental LDOP extension
    and the RISC-V 'A' extension (where essentially LDOP and 'A' represent
    the same basic CPU functionality).

    Though, it is "sort of passable" in that it is possible to implement
    these by shoving a minimal ALU into the L1 cache, rather than needing
    to restructure the whole pipeline (as would be needed for a more
    general x86-like model).

    It is SO EASY to track this dependency based on register forwarding
    that creating a LdOp was done for some other reason.


    ?...

    How do you think 1-wide in order machines determine that stage 3 of the pipeline contains the required register value and that reading the
    register file will have been in vain ??

    It is called FORWARDING, no pipeline gets along without it. You can even
    split the LD part from the OP part from the ST part, or like Athlon, you
    can split the Ld-Op-ST into a triple firing reservation station, or like
    modern *-lake convert them into 3 µOps.

    Then again, not many people are going and being like "The A extension
    makes RISC-V no longer RISC".

    BECAUSE RISC-V is already not RISC (less than 200 instructions)...


    Fair enough.

    Ironically, if I want to support 'A', this means needing to have the
    'LDOP' extension enabled, even if I am not really a fan of the cost or implications of this mechanism...

    But, 'A' is needed for 'RV64G', which is annoyingly, what would need to
    be supported to be able to have any hope of compatibility with the Linux
    on RISC-V ecosystem.


    The common superset of BJX2 and RV64G (at least for the userland side of things) is a bit more complexity than I would prefer though.

    Well, along with the annoyance of the CPU core having functionality that
    may exist in one ISA but not the other (and, don't want to port over everything from RISC-V, as this would pollute my own ISA with things
    that don't really fit my own vision).

    Implementation by implementation ISA differences in a non upwards compatible fashion is not good for the consumers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to MitchAlsup on Thu Dec 7 21:42:03 2023
    On 12/4/2023 12:34 PM, MitchAlsup wrote:
    BGB wrote:

    snip

    Or, say, people can find ways to make multi-threaded programming not
    suck as much (say, if using an async join/promise and/or channels
    model rather than traditional multi-threading with shared memory and
    synchronization primitives).

    If you want multi-threaded programs to succeed you need to start writing
    them in a language that is not inherently serial !! It is brain dead
    easy to write embarrassingly parallel applications in a language like Verilog. The programmer does not have to specify when or where a gate
    is evaluated--that is the job of the environment (Verilog).....

    I am not sure what you are proposing here. While Verilog is fine for
    the domain it was designed for (a domain specific language?), it isn't
    suitable for most other things, e.g. you couldn't easily write say a
    compiler in Verilog. There are other domains where various languages
    have helped ease development of parallel programs, but they are also
    domain specific, e.g. some simulation languages, and heck, even COBOL
    had an (optional) parallel processing capability. There have also been
    various attempts to create general purpose languages that do that, e.g. dataflow languages, OCCAM, I think Ada, but AFAIK, none has been hugely successful.



    BTW, embarrassingly parallel parallel applications usually aren't much
    of a problem, as, pretty much by definition, they have little to no
    interaction between the threads/processes (whatever you call them).

    https://en.wikipedia.org/wiki/Embarrassingly_parallel

    so you can easily fire off as many copies as you need. But change that
    to distributed computing problems, and I agree.

    So, do you have a proposal for a general purpose language that makes development of distributed computing problems easier?


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcus@21:1/5 to MitchAlsup on Fri Dec 8 09:44:17 2023
    On 2023-12-07, MitchAlsup wrote:
    Marcus wrote:

    On 2023-12-06, MitchAlsup wrote:
    Paul A. Clayton wrote:

    On 11/24/23 9:49 PM, BGB wrote:
    [snip]
    Though, apparently, someone posted something recently showing RV64
    and ARM64 to be much closer than expected, which is curious. The
    main instructions that seem to have "the most bang for the buck"
    are ones that ARM64 has equivalents of.

    "An Empirical Comparison of the RISC-V and AArch64 Instruction
    Sets" (Daniel Weaver and Simon McIntosh-Smith, 2023) used five
    benchmarks, four scientific and STREAM. Just the fact that STREAM
    did not use FMADD for the TRIAD portion slightly penalized AArch64
    (though RISC-V will presumably add FMADD if it has not already).
    SIMD was excluded based on the reasonable point that RISC-V has
    not yet standardized its SIMD extension and "comparing the
    different vector instruction sets across AArch64 and RISC-V is
    beyond the scope of this initial comparison".

    I rather suspect these benchmarks do not provide a good basis for
    ISA design targeting minimum path length (much less performance).

    Both ARM and RISC-V require close to 40% more instructions than My
    66000.
    So much for minimum path lengths.
    AND, no ISA with more than about 200 instructions should be
    considered RISC,

    I wonder if 200 is a fundamental constant for RISC vs CISC ;-)

    I choose 200 as the upper bound since 100 is obviously too small
    {even though I get by with 61} and any vectorized or SIMDed ISA
    is way more than 200.

    I still struggle to find a good definition of "1 instruction".

    1 Instruction is 1 Spelling the assembly language programmer has to
    remember.

    That is my view too. Some examples:


    BZ (branch if zero), 1 variant:

    bz r3, #foo@pc


    CLZ (count leading zeros), 6 variants:

    clz r2, r1 // scalar
    clz.b r2, r1 // scalar, packed bytes
    clz.h r2, r1 // scalar, packed half-words
    clz v2, v1 // vector
    clz.b v2, v1 // vector, packed bytes
    clz.h v2, v1 // vector, packed half-words

    AND (bitwise and), 18 variants:

    and r3, r1, r2 // scalar
    and.pn r3, r1, r2 // scalar, r1 & ~r2
    and.np r3, r1, r2 // scalar, ~r1 & r2
    and.nn r3, r1, r2 // scalar, ~r1 & ~r2
    and v3, v1, r2 // vector/scalar
    and.pn v3, v1, r2 // vector/scalar, v1 & ~r2
    and.np v3, v1, r2 // vector/scalar, ~v1 & r2
    and.nn v3, v1, r2 // vector/scalar, ~v1 & ~r2
    and v3, v1, v2 // vector
    and.pn v3, v1, v2 // vector, v1 & ~v2
    and.np v3, v1, v2 // vector, ~v1 & v2
    and.nn v3, v1, v2 // vector, ~v1 & ~v2
    and/f v3, v1, v2 // folding vector
    and.pn/f v3, v1, v2 // folding vector, v1 & ~v2
    and.np/f v3, v1, v2 // folding vector, ~v1 & v2
    and.nn/f v3, v1, v2 // folding vector, ~v1 & ~v2
    and r3, r1, #im // scalar immediate
    and v3, v1, #im // vector/scalar immediate

    Some of the variants above are superfluous (at least three AND variants
    are useless and the value of a couple more can be debated), but I can
    live with that. The symmetry and ease of encoding/decoding weighs up for
    the potential loss of encoding space (of which there is plenty left).

                                                                   For me
    the definition is loosely "one distinct operation", and so there can be
    many variants of one instruction (e.g. variants with register operands
    or register + immediate operands all count as a single instruction),
    that all carry out the same operation, but with different kinds of
    operands or operand sizes.

    I hold this same view.

    In my ISA I refer to these as "major instructions", and each
    instructions typically have several variants (currently up to 18
    variants per major instruction, where different permutations of scalar
    and vector register operands count as different variants of a single
    instruction, for instance).

    VVM makes this distinction unnecessary.

    If I count this way, I currently have 106 instructions, which by your
    definition safely puts MRISC32 in the "RISC" camp. However, if I count
    every variant as a separate instruction, I blow the budget.

    My 66000 has 61 instructions under this framework. This includes {flow control, Integer, Logical, Shift, Floating point, Transcendentals, conversions, privileged, vectorization, and SIMD}

    Also worth mentioning is that my current instruction encoding scheme
    allows for 1535 major instructions, so there is still plenty of room for
    extensions (even though I already have pretty complete integer,
    floating-point and vector support).

    My 66000 encoding scheme supports 2048 1-operand instructions at the consumption of 1 Major OpCode. Only the 3-operand subGroup is stressed
    for Minor OpCodes.


    I have plenty of space left for 1-register-operand (99%) and 2-register-operands (84%) instructions, however since I encode
    immediates as part of the instruction word (unlike My 66000), the
    immediate versions are crowded. In fact the 21-bit immediate
    instructions are all used up (all seven of them). OTOH I'm pretty
    content with the ones that I have, as they cover quite some ground in
    terms of usefulness (e.g. they provide PC-relative load/store/call/jump
    with a range of +/-4MiB in a single 32-bit instruction).


    Under a perverted view of what the R in RISC stands for.

    I think that "RISC" much more commonly refers to a "load/store
    architecture where all instruction fit well in a pipelined
    design". That is, the "R" is very misleading.

    My point was that it should not be redefined into meaninglessness.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Marcus on Fri Dec 8 14:40:35 2023
    On 07/12/2023 21:14, Marcus wrote:
    On 2023-12-06, MitchAlsup wrote:
    Paul A. Clayton wrote:

    On 11/24/23 9:49 PM, BGB wrote:


    Both ARM and RISC-V require close to 40% more instructions than My 66000.
    So much for minimum path lengths.
    AND, no ISA with more than about 200 instructions should be considered
    RISC,

    I wonder if 200 is a fundamental constant for RISC vs CISC ;-)



    Both RISC-V and AArch64 are RISC-oriented,

    Under a perverted view of what the R in RISC stands for.

    I think that "RISC" much more commonly refers to a "load/store
    architecture where all instruction fit well in a pipelined
    design". That is, the "R" is very misleading.

    [snip]


    I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
    it meant small and simple instructions, rather than a small number of
    different instructions. The idea was that instructions should, on the
    whole, be single-cycle and implemented directly in the hardware, rather
    than multi-cycle using sequencers or microcode. You could have as many
    as you want, and they could be as complicated to describe as you want,
    as long as they were simple to implement. (I've worked with a few
    PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
    /lot/ of instructions!)

    In practice, though I think RISC vs CISC is more often used to
    distinguish between positions in a range of tradeoffs common in ISA
    design, such as :

    * fixed-size, fixed-format instruction codes vs variable encodings
    * many orthogonal registers vs fewer specialised registers
    * load/store vs advanced addressing modes
    * "one thing at a time" vs combing common tasks in one instruction

    But there's no clear boundaries. The original 68k architecture was
    always classified as "CISC". Then the later ColdFire versions were
    called "Variable instruction length RISC", though there was a 90%
    overlap in the ISA.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to David Brown on Fri Dec 8 15:19:47 2023
    David Brown <david.brown@hesbynett.no> writes:
    On 07/12/2023 21:14, Marcus wrote:
    On 2023-12-06, MitchAlsup wrote:
    Paul A. Clayton wrote:

    On 11/24/23 9:49 PM, BGB wrote:


    Both ARM and RISC-V require close to 40% more instructions than My 66000. >>> So much for minimum path lengths.
    AND, no ISA with more than about 200 instructions should be considered
    RISC,

    I wonder if 200 is a fundamental constant for RISC vs CISC ;-)



    Both RISC-V and AArch64 are RISC-oriented,

    Under a perverted view of what the R in RISC stands for.

    I think that "RISC" much more commonly refers to a "load/store
    architecture where all instruction fit well in a pipelined
    design". That is, the "R" is very misleading.

    [snip]


    I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
    it meant small and simple instructions, rather than a small number of >different instructions. The idea was that instructions should, on the
    whole, be single-cycle and implemented directly in the hardware, rather
    than multi-cycle using sequencers or microcode. You could have as many
    as you want, and they could be as complicated to describe as you want,
    as long as they were simple to implement. (I've worked with a few
    PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
    /lot/ of instructions!)

    Surely then, the PDP-8 can be counted as a RISC processor. There are
    only 8 instructions defined by a 3-bit opcode, and due to the
    instruction encoding, a single operate instruction can perform multiple (sequential) operations.

    000 - AND - AND the memory operand with AC.
    001 - TAD - Two's complement ADd the memory operand to <L,AC> (a 12 bit signed value (AC) w. carry in L).
    010 - ISZ - Increment the memory operand and Skip next instruction if result is Zero.
    011 - DCA - Deposit AC into the memory operand and Clear AC.
    100 - JMS - JuMp to Subroutine (storing return address in first word of subroutine!).
    101 - JMP - JuMP.
    110 - IOT - Input/Output Transfer (see below).
    111 - OPR - microcoded OPeRations (see below).


    https://en.wikipedia.org/wiki/PDP-8#Instruction_set

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Fri Dec 8 15:38:52 2023
    David Brown <david.brown@hesbynett.no> writes:
    On 07/12/2023 21:14, Marcus wrote:
    I wonder if 200 is a fundamental constant for RISC vs CISC ;-)

    It's fundamental nonsense, because:

    I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
    it meant small and simple instructions, rather than a small number of >different instructions.

    Yes. John Mashey made this point in his repeated posts on this topic.

    The idea was that instructions should, on the
    whole, be single-cycle and implemented directly in the hardware, rather
    than multi-cycle using sequencers or microcode.

    Somewhat: "Single-cycle" is a microarchitectural property, not an ISA
    property, but yes, the idea of the first RISCs was that the ISA should
    be implementable with such a microarchitecture.

    Also, single-cycle means the issue rate on a pipelined processor.
    There were many RISC implementations that needed two cycles of latency
    for loads. And likewise, FP instructions needed multiple cycles of
    latency. And finally, the MIPS R2000 integer multiplier and divider
    was not even pipelined (but could run in parallel with the rest of the
    integer pipeline).

    There have been attempts at splitting, e.g. FP instructions into their
    parts (align, add, normalize or somesuch) as a RISCier way to do
    things, but it never was implemented in a mainstream processor. What
    has been implemented in mainstream processors:

    * no integer multiplier/divider (SPARC, HPPA, no divide on Alpha and
    IA-64), instead go for multiply step, do it in the FPU, or implement
    division through subtraction (Alpha) or fma (IA-64)).

    * no 8-bit or 16-bit memory access: eliminates a part of the aligner
    from the load data path, eliminates ECC problems for write-back
    caches (but no Alpha implementation without BWX extension had such
    problems).

    What is common in RISCs is to split large constants into sequences of instructions (e.g., for loading the constant from the global table).

    I guess I forgot a few.

    You could have as many
    as you want, and they could be as complicated to describe as you want,
    as long as they were simple to implement. (I've worked with a few
    PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
    /lot/ of instructions!)

    Power(PC) is also an example of how moot it is to count instructions.
    It has, e.g., load instructions with and without update, which
    correspond to one load instruction with different addressing modes in
    ARM A64.

    In practice, though I think RISC vs CISC is more often used to
    distinguish between positions in a range of tradeoffs common in ISA
    design, such as :

    * fixed-size, fixed-format instruction codes vs variable encodings

    Many RISCs use variable-size instruction encodings, e.g., ROMP, ARM
    A32/T32 and RISC-V with the C extension.

    * many orthogonal registers vs fewer specialised registers

    VAX (the exemplary CISC) has 16 registers, like ARM A32 (first
    generation RISC).

    * load/store vs advanced addressing modes

    That's not a dichotomy. Many load/store architectures have more
    addressing modes than, e.g., AMD64 (not a load/store architecture);
    e.g. ARM A64. Power(PC), HPPA, and 88000 also have at least as many
    as AMD64.

    The dichotomy is between load/store and non-load/store architectures.
    And that's how I usually distinguish between RISC and CISC.

    However, it seems that a bigger issue is: one vs. multiple memory
    references per instruction. The VAX has multiple, which complicates
    many things, whereas (for the most part) AMD64 and load/store
    architectures have only one. There is MOVS and REP MOVS for AMD64,
    and there is ARM A32 and Power load/store multiple instructions, which
    require special treatment.

    One interesting aspect here is that modern general-purpose
    architectures all support unaligned accesses, which may require
    accessing two different cache lines and even two different pages.
    Once you support that, the load-pair/store-pair instructions of ARM
    A64 does not make loads and stores more complicated (but it
    complicates register porting).

    * "one thing at a time" vs combing common tasks in one instruction

    RISCs have done so for FP instructions, addressing modes, and by
    putting multiply and divide instructions in.

    But there's no clear boundaries.

    There are clear boundaries between load/store and other (the classical
    RISC boundary), and between lots of other properties of instruction
    sets. Of course marketing people and advocates have tried to claim
    RISCness when it was cool to be a RISC. An example can be found here:

    The original 68k architecture was
    always classified as "CISC". Then the later ColdFire versions were
    called "Variable instruction length RISC", though there was a 90%
    overlap in the ISA.

    Is Coldfire a load/store architecture? If not, it's not a RISC.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to David Brown on Fri Dec 8 17:40:06 2023
    David Brown wrote:

    On 07/12/2023 21:14, Marcus wrote:


    I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
    it meant small and simple instructions, rather than a small number of different instructions. The idea was that instructions should, on the
    whole, be single-cycle and implemented directly in the hardware, rather
    than multi-cycle using sequencers or microcode.

    Why should::
    ADD R7,R8,#0x123456789abcdef
    take any longer to execute than::
    ADD R7,R8,R9
    ???

    You could have as many
    as you want, and they could be as complicated to describe as you want,
    as long as they were simple to implement. (I've worked with a few
    PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
    /lot/ of instructions!)

    In practice, though I think RISC vs CISC is more often used to
    distinguish between positions in a range of tradeoffs common in ISA
    design, such as :

    * fixed-size, fixed-format instruction codes vs variable encodings
    * many orthogonal registers vs fewer specialised registers
    * load/store vs advanced addressing modes

    Like::
    lui a0, %hi(.LCPI10_2)
    ld a0, %lo(.LCPI10_2)(a0)
    instead of::
    LD R7,[IP,,.LCPT10_2]

    * "one thing at a time" vs combing common tasks in one instruction

    Like:
    fmv.x.d a0, ft6
    call log@plt
    fmv.d.x ft0, a0
    instead of::
    LOG R7,R9

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Fri Dec 8 17:42:31 2023
    Scott Lurndal wrote:

    David Brown <david.brown@hesbynett.no> writes:


    I think that "RISC" much more commonly refers to a "load/store
    architecture where all instruction fit well in a pipelined
    design". That is, the "R" is very misleading.

    [snip]


    I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
    it meant small and simple instructions, rather than a small number of >>different instructions. The idea was that instructions should, on the >>whole, be single-cycle and implemented directly in the hardware, rather >>than multi-cycle using sequencers or microcode. You could have as many
    as you want, and they could be as complicated to describe as you want,
    as long as they were simple to implement. (I've worked with a few
    PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
    /lot/ of instructions!)

    Surely then, the PDP-8 can be counted as a RISC processor. There are
    only 8 instructions defined by a 3-bit opcode, and due to the
    instruction encoding, a single operate instruction can perform multiple (sequential) operations.

    000 - AND - AND the memory operand with AC.
    001 - TAD - Two's complement ADd the memory operand to <L,AC> (a 12 bit signed value (AC) w. carry in L).
    010 - ISZ - Increment the memory operand and Skip next instruction if result is Zero.
    011 - DCA - Deposit AC into the memory operand and Clear AC.
    100 - JMS - JuMp to Subroutine (storing return address in first word of subroutine!).
    101 - JMP - JuMP.
    110 - IOT - Input/Output Transfer (see below).
    111 - OPR - microcoded OPeRations (see below).


    PDP-8 fails to be RISC because does not have a large number of GPRs.
    Does not really have LDs (has only a LD-Op).

    https://en.wikipedia.org/wiki/PDP-8#Instruction_set

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Scott Lurndal on Fri Dec 8 19:56:00 2023
    On 08/12/2023 16:19, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 07/12/2023 21:14, Marcus wrote:
    On 2023-12-06, MitchAlsup wrote:
    Paul A. Clayton wrote:

    On 11/24/23 9:49 PM, BGB wrote:


    Both ARM and RISC-V require close to 40% more instructions than My 66000. >>>> So much for minimum path lengths.
    AND, no ISA with more than about 200 instructions should be considered >>>> RISC,

    I wonder if 200 is a fundamental constant for RISC vs CISC ;-)



    Both RISC-V and AArch64 are RISC-oriented,

    Under a perverted view of what the R in RISC stands for.

    I think that "RISC" much more commonly refers to a "load/store
    architecture where all instruction fit well in a pipelined
    design". That is, the "R" is very misleading.

    [snip]


    I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
    it meant small and simple instructions, rather than a small number of
    different instructions. The idea was that instructions should, on the
    whole, be single-cycle and implemented directly in the hardware, rather
    than multi-cycle using sequencers or microcode. You could have as many
    as you want, and they could be as complicated to describe as you want,
    as long as they were simple to implement. (I've worked with a few
    PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
    /lot/ of instructions!)

    Surely then, the PDP-8 can be counted as a RISC processor. There are
    only 8 instructions defined by a 3-bit opcode, and due to the
    instruction encoding, a single operate instruction can perform multiple (sequential) operations.

    000 - AND - AND the memory operand with AC.
    001 - TAD - Two's complement ADd the memory operand to <L,AC> (a 12 bit signed value (AC) w. carry in L).
    010 - ISZ - Increment the memory operand and Skip next instruction if result is Zero.
    011 - DCA - Deposit AC into the memory operand and Clear AC.
    100 - JMS - JuMp to Subroutine (storing return address in first word of subroutine!).
    101 - JMP - JuMP.
    110 - IOT - Input/Output Transfer (see below).
    111 - OPR - microcoded OPeRations (see below).


    https://en.wikipedia.org/wiki/PDP-8#Instruction_set

    By my logic (such as it is - I don't claim it is in any sense
    "correct"), the PDP-8 would definitely be /CISC/. It only has a few instructions, but that is irrelevant (that was my point) - the
    instructions are complex, and therefore it is CISC.

    There was a microcontroller that we once considered for a project, which
    had only a single instruction - "move". We ended up with a different
    chip, so I never got to play with it in practice.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Fri Dec 8 20:06:22 2023
    On 08/12/2023 16:38, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:

    (I'm snipping because I pretty much agree with the rest of what you wrote.)


    Is Coldfire a load/store architecture? If not, it's not a RISC.


    I agree that there's a fairly clear boundary between a "load/store architecture" and a "non-load/store architecture". And I agree that it
    is usually a more important distinction than the number of instructions,
    or the complexity of the instructions, or any other distinctions.

    But does that mean LSA vs. NLSA should be used to /define/ RISC vs CISC?
    Things have changed a lot since the term "RISC" was first coined, and
    maybe architectural and ISA features are so mixed that the terms "RISC"
    and "CISC" have lost any real meaning. If that's the case, then we
    should simply talk about LSA and NLSA architectures, and stop using
    "RISC" and "CISC". I don't think trying to redefine "RISC" to mean
    something different from its original purpose helps.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to David Brown on Fri Dec 8 19:39:43 2023
    David Brown wrote:

    On 08/12/2023 16:38, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:

    (I'm snipping because I pretty much agree with the rest of what you wrote.)


    Is Coldfire a load/store architecture? If not, it's not a RISC.


    I agree that there's a fairly clear boundary between a "load/store architecture" and a "non-load/store architecture". And I agree that it
    is usually a more important distinction than the number of instructions,
    or the complexity of the instructions, or any other distinctions.

    Would CDC 6600 be considered to have a LD/ST architecture ??

    But does that mean LSA vs. NLSA should be used to /define/ RISC vs CISC?
    Things have changed a lot since the term "RISC" was first coined, and

    It HAS been 43 years since being coined.

    maybe architectural and ISA features are so mixed that the terms "RISC"
    and "CISC" have lost any real meaning. If that's the case, then we
    should simply talk about LSA and NLSA architectures, and stop using
    "RISC" and "CISC". I don't think trying to redefine "RISC" to mean
    something different from its original purpose helps.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to David Brown on Fri Dec 8 19:50:34 2023
    David Brown <david.brown@hesbynett.no> writes:
    On 08/12/2023 16:19, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 07/12/2023 21:14, Marcus wrote:
    On 2023-12-06, MitchAlsup wrote:
    Paul A. Clayton wrote:

    On 11/24/23 9:49 PM, BGB wrote:


    Both ARM and RISC-V require close to 40% more instructions than My 66000. >>>>> So much for minimum path lengths.
    AND, no ISA with more than about 200 instructions should be considered >>>>> RISC,

    I wonder if 200 is a fundamental constant for RISC vs CISC ;-)



    Both RISC-V and AArch64 are RISC-oriented,

    Under a perverted view of what the R in RISC stands for.

    I think that "RISC" much more commonly refers to a "load/store
    architecture where all instruction fit well in a pipelined
    design". That is, the "R" is very misleading.

    [snip]


    I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
    it meant small and simple instructions, rather than a small number of
    different instructions. The idea was that instructions should, on the
    whole, be single-cycle and implemented directly in the hardware, rather
    than multi-cycle using sequencers or microcode. You could have as many
    as you want, and they could be as complicated to describe as you want,
    as long as they were simple to implement. (I've worked with a few
    PowerPC microcontrollers - PowerPC is considered "RISC", but it has a
    /lot/ of instructions!)

    Surely then, the PDP-8 can be counted as a RISC processor. There are
    only 8 instructions defined by a 3-bit opcode, and due to the
    instruction encoding, a single operate instruction can perform multiple
    (sequential) operations.

    000 - AND - AND the memory operand with AC.
    001 - TAD - Two's complement ADd the memory operand to <L,AC> (a 12 bit signed value (AC) w. carry in L).
    010 - ISZ - Increment the memory operand and Skip next instruction if result is Zero.
    011 - DCA - Deposit AC into the memory operand and Clear AC.
    100 - JMS - JuMp to Subroutine (storing return address in first word of subroutine!).
    101 - JMP - JuMP.
    110 - IOT - Input/Output Transfer (see below).
    111 - OPR - microcoded OPeRations (see below).


    https://en.wikipedia.org/wiki/PDP-8#Instruction_set

    By my logic (such as it is - I don't claim it is in any sense
    "correct"), the PDP-8 would definitely be /CISC/. It only has a few >instructions, but that is irrelevant (that was my point) - the
    instructions are complex, and therefore it is CISC.

    Given the age of the PDP-8, I'd argue that the instructions
    are anything but complex. Leaving aside the optional EAE extension
    which provided multiplication and division.

    A load (hooked to the adder), a store, and a few logic operations.

    The IOT instruction is effectively an MMIO operation, as in
    the instruction was put on the bus and the I/O controller
    responded appropriately as if it were a load or store operation.

    The lack of general purpose registers doesn't disqualify it
    from the RISC label in my opinion.

    Likewise, the complexity that RISC was attempting to address
    were instructions like the Vax POLY, MOVC3/MOCV5 and the
    queuing instructions (insert & remove).

    The entire RISC vs CISC argument seems somewhat contrived
    in these modern times.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to MitchAlsup on Fri Dec 8 21:17:37 2023
    On 08/12/2023 20:39, MitchAlsup wrote:
    David Brown wrote:

    On 08/12/2023 16:38, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:

    (I'm snipping because I pretty much agree with the rest of what you
    wrote.)


    Is Coldfire a load/store architecture?  If not, it's not a RISC.


    I agree that there's a fairly clear boundary between a "load/store
    architecture" and a "non-load/store architecture".  And I agree that
    it is usually a more important distinction than the number of
    instructions, or the complexity of the instructions, or any other
    distinctions.

    Would CDC 6600 be considered to have a LD/ST architecture ??

    I don't know - that was /long/ before my time!


    But does that mean LSA vs. NLSA should be used to /define/ RISC vs
    CISC?   Things have changed a lot since the term "RISC" was first
    coined, and

    It HAS been 43 years since being coined.

    maybe architectural and ISA features are so mixed that the terms
    "RISC" and "CISC" have lost any real meaning.  If that's the case,
    then we should simply talk about LSA and NLSA architectures, and stop
    using "RISC" and "CISC".  I don't think trying to redefine "RISC" to
    mean something different from its original purpose helps.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Fri Dec 8 23:11:28 2023
    Scott Lurndal wrote:


    The lack of general purpose registers doesn't disqualify it
    from the RISC label in my opinion.

    Then RISC is a meaningless term.

    PDP-8 certainly is simple nor does it have many instructions,
    but it certainly is NOT RISC.

    Did not have a large GPR register file
    Was Not pipelined
    Was Not single cycle execution
    Did not overlap instruction fetch with execution
    Did not rely on compiler for good code performance

    Likewise, the complexity that RISC was attempting to address
    were instructions like the Vax POLY, MOVC3/MOCV5 and the
    queuing instructions (insert & remove).

    CALL, RET, and EDIT were nightmares to pipeline, too.
    But above that:: VAX address modes prevented pipelining.

    The entire RISC vs CISC argument seems somewhat contrived
    in these modern times.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to David Brown on Fri Dec 8 23:14:08 2023
    David Brown wrote:

    On 08/12/2023 20:39, MitchAlsup wrote:
    David Brown wrote:

    On 08/12/2023 16:38, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:

    (I'm snipping because I pretty much agree with the rest of what you
    wrote.)


    Is Coldfire a load/store architecture?  If not, it's not a RISC.


    I agree that there's a fairly clear boundary between a "load/store
    architecture" and a "non-load/store architecture".  And I agree that
    it is usually a more important distinction than the number of
    instructions, or the complexity of the instructions, or any other
    distinctions.

    Would CDC 6600 be considered to have a LD/ST architecture ??

    I don't know - that was /long/ before my time!

    If you wrote into A1..A5 then X1..X5 was loaded from memory
    If you wrote into A6..A7 then X6..X7 was stored into memory

    Peripheral processes (I/O controllers) performed the job of the OS
    leaving the CPUs strictly for number crunching.


    But does that mean LSA vs. NLSA should be used to /define/ RISC vs
    CISC?   Things have changed a lot since the term "RISC" was first
    coined, and

    It HAS been 43 years since being coined.

    maybe architectural and ISA features are so mixed that the terms
    "RISC" and "CISC" have lost any real meaning.  If that's the case,
    then we should simply talk about LSA and NLSA architectures, and stop
    using "RISC" and "CISC".  I don't think trying to redefine "RISC" to
    mean something different from its original purpose helps.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Sat Dec 9 10:09:45 2023
    scott@slp53.sl.home (Scott Lurndal) writes:
    Surely then, the PDP-8 can be counted as a RISC processor.

    I don't count it as a RISC, because it's too different from the
    architectures that are commonly seen as RISCs:

    1. It is not a load-store architecture
    5. It does not have 16 or more general-purpose registers.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to MitchAlsup on Sat Dec 9 10:19:12 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Would CDC 6600 be considered to have a LD/ST architecture ??

    If I understand the description right, the load or store happen as
    side effects of an operation that writes to A1..A7. And it only loads
    to X1..X5 and stores from X6..X7. If somebody says that an
    architecture is a load-store architecture, I certainly do not expect
    such restrictions; I actually expect a register machine (i.e., with
    GPRs), but the CDC-6600 has three sets of special-purpose registers.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to A. Clayton on Sat Dec 9 21:07:00 2023
    In article <ul2bu6$2a7gb$4@dont-email.me>, paaronclayton@gmail.com (Paul
    A. Clayton) wrote:

    For Itanium, binary translation provided better performance on
    the same hardware, so it was more evident that the compatibility
    had a mediocre performance target.

    From memory of conversations with Intel people in 1997-2000, they thought
    the hardware-provided compatibility would be faster than it turned out.
    That suggests they expected higher clockspeeds, and that the translator
    would make effective use of Itanium bundled instructions.

    I accidentally benchmarked the 667MHz Merced running IA-32 code, by
    selecting the wrong build tree, and it was about a third of the
    performance of optimised native code. That suggests little use of bundled instructions.

    The other reason the IA-32 emulation seemed so slow was that IA-32
    performance standards rose considerably while Merced was being designed
    and built. That was due to the clockspeed war between Pentium II/III and
    AMD's Athlon. That disrupted Intel's plans for a slower clockspeed ramp,
    and Intel's response gave the world NetBurst.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to MitchAlsup on Sun Dec 10 10:56:36 2023
    MitchAlsup <mitchalsup@aol.com> schrieb:

    If you want multi-threaded programs to succeed you need to start writing
    them in a language that is not inherently serial !! It is brain dead
    easy to write embarrassingly parallel applications in a language like Verilog. The programmer does not have to specify when or where a gate
    is evaluated--that is the job of the environment (Verilog).....

    But the job of a programmer to keep everything that can be parallel in
    mind... Would you write a compiler, or a word processor, in Verilog?
    How much harder would that be, compared to a serial language?

    My personal favorites for parallel programming are PGAS languages
    like (yes, you guessed it) Fortran, where the central data
    structure is the coarray.

    On image X, you can manuipulate data on your own image, and you can
    access data on other images (let's call it Y) in these coarrays via
    special syntax, as a[Y].

    You have to make sure that you synchronize before accessing data
    that has been modified on another image.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to MitchAlsup on Sun Dec 10 10:39:31 2023
    MitchAlsup <mitchalsup@aol.com> schrieb:

    Question (to everyone):: Has your word processor or spreadsheet
    added anything USEFUL TO YOU since 2000 ??

    In my case: Yes.

    Besides making many things worse, the new formula editor (since
    2010?) in Word is reasonable to work with, especially since it is
    possible to use LaTeX notation now (and thus it is now possible
    to paste from Maple).

    Previously, I actually wrote some reports in LaTeX, going to some
    trouble to make them appear visually like the Word template du jour
    (but the formulas gave it away, they looked to nice for Word).

    Formula _numbering_ - now that, Microsoft managed to make worse
    (which simply comes naturally in LaTeX).

    And, come to think of it, since Office 365 (I think) they now
    allow direct use of svg files as graphics, allowing two
    non-braindead ways of including pdf graphics in Word - either
    via Inkscape (read as pdf, write as svg) or through command-line
    tools (usually via Cygwin).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to MitchAlsup on Sun Dec 10 10:51:03 2023
    mitchalsup@aol.com (MitchAlsup) writes:

    Scott Lurndal wrote:

    The lack of general purpose registers doesn't disqualify it
    from the RISC label in my opinion.

    Then RISC is a meaningless term.

    PDP-8 certainly is simple nor does it have many instructions, but it certainly is NOT RISC.

    Did not have a large GPR register file
    Was Not pipelined
    Was Not single cycle execution
    Did not overlap instruction fetch with execution
    Did not rely on compiler for good code performance

    Of course the PDP-8 is a RISC. These propperties may have been
    common among some RISC processors, but they don't define what
    RISC is. RISC is a design philosophy, not any particular set
    of architectural features.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Tim Rentsch on Sun Dec 10 19:16:02 2023
    Tim Rentsch wrote:

    mitchalsup@aol.com (MitchAlsup) writes:

    Scott Lurndal wrote:

    The lack of general purpose registers doesn't disqualify it
    from the RISC label in my opinion.

    Then RISC is a meaningless term.

    PDP-8 certainly is simple nor does it have many instructions, but it
    certainly is NOT RISC.

    Did not have a large GPR register file
    Was Not pipelined
    Was Not single cycle execution
    Did not overlap instruction fetch with execution
    Did not rely on compiler for good code performance

    Of course the PDP-8 is a RISC. These propperties may have been
    common among some RISC processors, but they don't define what
    RISC is. RISC is a design philosophy, not any particular set
    of architectural features.


    So what we can take from this is that RISC as a term has become meaningless.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Sun Dec 10 20:09:44 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Tim Rentsch wrote:

    mitchalsup@aol.com (MitchAlsup) writes:

    Scott Lurndal wrote:

    The lack of general purpose registers doesn't disqualify it
    from the RISC label in my opinion.

    Then RISC is a meaningless term.

    PDP-8 certainly is simple nor does it have many instructions, but it
    certainly is NOT RISC.

    Did not have a large GPR register file
    Was Not pipelined
    Was Not single cycle execution
    Did not overlap instruction fetch with execution
    Did not rely on compiler for good code performance

    Of course the PDP-8 is a RISC. These propperties may have been
    common among some RISC processors, but they don't define what
    RISC is. RISC is a design philosophy, not any particular set
    of architectural features.


    So what we can take from this is that RISC as a term has become meaningless.

    Or that it never had meaning, in the sense you're looking for.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Tim Rentsch on Mon Dec 11 18:28:09 2023
    On Sun, 10 Dec 2023 10:51:03 -0800, Tim Rentsch wrote:

    Of course the PDP-8 is a RISC. These propperties may have been common
    among some RISC processors, but they don't define what RISC is. RISC is
    a design philosophy, not any particular set of architectural features.

    I can't agree.

    Your final sentence may be true enough, but I think that the architectural feature of being a load-store architecture is very much indicative of
    whether the RISC design philosophy was being followed. Of course, it isn't absolutely _decisive_, as Concertina II demonstrates.

    The PDP-8 is just a very small computer, with a very small instruction
    set, designed before the RISC design philosophy was even concieved of.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to tkoenig@netcologne.de on Mon Dec 11 22:47:56 2023
    On Sun, 10 Dec 2023 10:56:36 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:

    MitchAlsup <mitchalsup@aol.com> schrieb:

    If you want multi-threaded programs to succeed you need to start writing
    them in a language that is not inherently serial !! It is brain dead
    easy to write embarrassingly parallel applications in a language like
    Verilog. The programmer does not have to specify when or where a gate
    is evaluated--that is the job of the environment (Verilog).....

    But the job of a programmer to keep everything that can be parallel in >mind... Would you write a compiler, or a word processor, in Verilog?
    How much harder would that be, compared to a serial language?

    There are (at least) 3 problems:

    1: most programming languages are predominantly serial and their
    support for parallelism (relatively) is poor.

    2: many programmers are much better at figuring out what CAN be
    done in parallel than they are at figuring out what SHOULD be
    done in parallel. The result often is too many threads each
    making very little progress.

    3: the skill level of the average programmer now is only slightly
    above "novice". More software is being written now than ever
    before, but the vast majority of it is poor quality.


    Better languages can help, but "better" in my view does not include C.


    My personal favorites for parallel programming are PGAS languages
    like (yes, you guessed it) Fortran, where the central data
    structure is the coarray.

    Mileage varies considerably and I don't intend to start a language
    war: a lot has to due with the history of parallel applications a
    person has developed. I can respect your point of view even though I
    don't agree with it.

    My favorite model is CSP (ala Hoare) with no shared memory. Which is
    not to say I don't use threads, but I try to design programs such that
    threads (mostly) are not sharing writable data structures.


    You have to make sure that you synchronize before accessing data
    that has been modified on another image.

    And that's where most languages fall down: they provide either
    primitives which are too low level and too hard to use correctly, or
    they provide high level mechanisms that don't scale well and are too
    limiting in actual use.

    Which goes back to #3 above. Repeated studies have shown that most
    programmers can't write correct parallel code to operate on shared
    data structures. The results are congruent with, but even worse than,
    the studies on memory management which showed most programmers had
    trouble with manual (de)allocation of shared structures.

    I've been programming for 40 years now, and I have yet to see a
    language that I would want to hand to a novice intending to write
    parallel code. I've seen what I think have been some good approaches,
    but the languages involved: Lisps/Schemes, functional, and constraint
    logic lanaguages ... are just too different for many people to grasp.

    Again, MMV.
    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Tue Dec 12 23:07:04 2023
    Paul A. Clayton wrote:

    On 12/7/23 9:36 PM, MitchAlsup wrote:
    [snip]
    My point on VLEness is that all the position and length
    information is
    found in the first container of the instruction and not determined by
    a serial walk along the containers. IBM 360 is a lot less CISC
    than x86.

    Serial decode is definitely not RISC.
    Small field determines length, pointers, and sizes; remains
    RISCable if it does not violate other RISC tenets.

    I would have guessed that encoding a length to next length
    indicator would also be somewhat simple for decode when the
    additional chunk contains only opcode extension that does not
    affect routing (which includes some hints) or immediate data. In
    terms of parsing the instruction stream into instructions, such is
    not different than having a nop with the length specified in *its*
    first container. [see ENDNOTE]

    My VLE encoding (4-bits) deals with constants (±5-bits, 32-bits, 64-bits)
    and operand sign control {rs1,rs2..rs1,-rs2..-rs1,rs2..-rs1,-rs2}
    The trick is finding where you can position to place these bits
    such that the same bits are used in {1-operand, 2-operand, 3-operand,
    and memory references.} This means you can decode them prior to
    determining the instruction subGroup. And you cannot more a 5-bit
    register specifier.....

    My 66000's instruction modifiers seem to add some decoding
    complexity in that bits of the container are distributed to the
    following instructions (which may themselves be variable length);
    clearly, this is considered acceptable complexity. I think a
    DOUBLE prefix was also proposed (architected? it was not in the 28
    Jan 2020 version that I have) that encoded additional operands
    into the prefix, forming a kind of explicit instruction fusion.

    Yes, I toyed with a DBLE instruction. Its job was to give 3-more
    operands and 1 more result register in support of 128-bit registers calculations, and memory references. It can be resurrected if desired.
    But I don't think there is currently enough demand for 128-bit except
    ii market niches so small that I am not interested.

    (I have a suspicion that a large-chunk instruction encoding with
    borrow-lend across chunks could facilitate code density while
    providing some of the advantages of fixed-length encoding. I have
    not thought about this deeply, but I sense there may be problems
    with allowing arbitrary bits to be borrowed. Limiting such to
    immediates might reduce the exposure to danger. However, I
    suspect that more emphasis should be on targeting an OoO
    implementation than on code density.)

    === ENDNOTE ===
    An encoding with multiple length specifiers might theoretically
    reduce the overhead of encoding the length for the more common
    short cases — perhaps by an entire bit!!!!☺ — but in addition to increasing the size overhead for longer instructions it would
    split large fields by inserting the length extension specifier.
    The extra size overhead also means that 32-bit and 64-bit
    immediates could further bloat the instruction by requiring an
    additional parcel for only a few bits. One bit of length
    information effectively becomes a marker bit per parcel, which is
    a technique that was used for some x86 pre-decoded caches and for
    one ISA that encoded immediates.

    If there were 16 instruction lengths, perhaps a split length
    specifier *might* make sense, but My 66000's five instruction
    lengths obviously does not take that much space. I believe the
    lengths are not fully orthogonal, so it does not take 2.32 bits.
    (I think only a store can be longer than 3 parcels, though perhaps
    some 3-input compute operations might be theoretically able to use
    two constant inputs.)

    My 66000 uses 4-bits, you found 2.32-bits of utility, another 1.2
    bits of utility are used to allow 5-bit constants replacing the
    5-bit register specifier; and the rest are sign control over the
    operands. The 4-bits are completely used {16-patterns} only 2
    are lightly used}

    Yet if the extra bits are not on a critical path (such as register specifiers) such a clunkier mechanism might not be so horrible.
    Maybe only 0.2 x86s (the unit of measurement for ISA horror☺).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Wed Dec 13 04:58:13 2023
    On Tue, 05 Dec 2023 01:08:01 +0000, MitchAlsup wrote:

    Gallium Arsenide is 5×; hideously expensive, dangerous to the workers in
    the FAB, and chemical disposal, low yield,.....

    If the yield is _so_ low that they can't make anything bigger than an
    8086 out of it, then of course that means they're losing more than the
    5x gain.

    But if they could make a Pentium Pro or Pentium II out of Gallium Arsenide,
    and get, say, a 10% yield, I'm sure that government (specifically military) users would be happy to pay the price premium for it. Even if it's
    exorbitant, like $20,000 per processor.

    I'm not saying that it would be for *everybody*.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Wed Dec 13 05:09:52 2023
    On Tue, 12 Dec 2023 23:07:04 +0000, MitchAlsup wrote:

    My VLE encoding (4-bits) deals with constants (±5-bits, 32-bits,
    64-bits)
    and operand sign control {rs1,rs2..rs1,-rs2..-rs1,rs2..-rs1,-rs2}
    The trick is finding where you can position to place these bits such
    that the same bits are used in {1-operand, 2-operand, 3-operand,
    and memory references.} This means you can decode them prior to
    determining the instruction subGroup. And you cannot more a 5-bit
    register specifier.....

    Of course, Concertina II solves this problem too, even if it does
    so in a way which you believe to be the wrong way. But of course it
    can also be solved in a relatively simple way with an encoding of
    the first few bits of a variable-length instruction - the trick to
    keeping it simple would be to have _two_ sets of prefix bits, because
    there are only a few lengths for instructions, and a few lengths
    for constants used as immediates - it's only the _combinations_ that
    get out of control.

    And you're already using that trick, IIRC, so you don't need to
    take any lessons from the monstrosity that is Concertina II.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Wed Dec 13 12:31:50 2023
    On Wed, 13 Dec 2023 05:09:52 +0000, Quadibloc wrote:

    the trick to keeping it simple
    would be to have _two_ sets of prefix bits, because there are only a few lengths for instructions, and a few lengths for constants used as
    immediates - it's only the _combinations_ that get out of control.

    Actually, there is one other thing. So that the instructions for which immediates are used can have one consistent format, unlike the prefixes
    for instruction length, which can have different numbers of bits, the
    prefixes for constant length need to all be the same number of bits in
    length.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Wed Dec 13 13:14:40 2023
    On Wed, 13 Dec 2023 12:31:50 +0000, Quadibloc wrote:

    On Wed, 13 Dec 2023 05:09:52 +0000, Quadibloc wrote:

    the trick to keeping it simple would be to have _two_ sets of prefix
    bits, because there are only a few lengths for instructions, and a few
    lengths for constants used as immediates - it's only the _combinations_
    that get out of control.

    Actually, there is one other thing. So that the instructions for which immediates are used can have one consistent format, unlike the prefixes
    for instruction length, which can have different numbers of bits, the prefixes for constant length need to all be the same number of bits in length.

    I went back and checked. When I proposed what a variable-length coding
    for the Concertina instruction set would look like, at first glance it
    seemed as though I didn't follow that rule:

    Something like:

    0 - 16 bits
    1 - 32 bits, except
    111011001 32 bits + 16 bits
    111011010 32 bits + 32 bits
    111011011 32 bits + 64 bits
    1110111000 32 bits + 48 bits
    1110111001 32 bits + 32 bits
    1110111010 32 bits + 64 bits
    1110111011 32 bits + 128 bits
    11110 - 48 bits
    11111 - 64 bits

    But notice that the lengths 32 and 64 bits appear twice. So this
    actually was intended to keep the instruction format constant;
    the prifixes that were one bit longer were for floating-point
    instructions. While both integer and floating-point instructions
    use register banks of 32 registers, the difference is that there
    are two load-store instructions for floats - LOAD and STORE -
    for integers there are also (where the integer type is shorter than
    the register) LOAD UNSIGNED (zero out all unused higher bits) and
    INSERT (leave all higher bits unaffected) in addition to LOAD
    (sign extend into all unused bits of the register more significant
    than the leading bit of the argument type).

    So I did know of that principle, and was following it, with a
    slight customization for the specifics of the Concertina II
    instruction set.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Wed Dec 13 15:42:43 2023
    On 10/12/2023 11:39, Thomas Koenig wrote:
    MitchAlsup <mitchalsup@aol.com> schrieb:

    Question (to everyone):: Has your word processor or spreadsheet
    added anything USEFUL TO YOU since 2000 ??

    In my case: Yes.

    Besides making many things worse, the new formula editor (since
    2010?) in Word is reasonable to work with, especially since it is
    possible to use LaTeX notation now (and thus it is now possible
    to paste from Maple).


    If I want to write something serious with formula, I use LaTeX.

    Previously, I actually wrote some reports in LaTeX, going to some
    trouble to make them appear visually like the Word template du jour
    (but the formulas gave it away, they looked to nice for Word).


    What a strange thing to do - that sounds completely backwards to me!

    I was happy when I had made a template for LibreOffice (it might have
    been one of the forks of OpenOffice, pre-LibreOffice) that looked
    similar to what I have for LaTeX. Then I could make reasonable-looking documents for customers that insisted on having docx format instead of pdf.

    I don't think there has been much exciting or important (to me) added to
    word processors for decades. Direct pdf generation was one, which
    probably existed in Star Office (the ancestor of OpenOffice /
    LibreOffice, IIRC). And then when styles, numbering, templates and pdf
    export got good enough that you could make pdfs with real table of
    contents, clickable links, etc., so that word processed documents could
    look almost professional.

    Apart from that, the only benefits I see of newer LibreOffice over older
    ones is better handling of the insane chaos that MS Office uses for its
    file formats. LibreOffice is /much/ better at this than MS Office is, especially if the file has been modified by a number of different MS
    Office versions.

    Formula _numbering_ - now that, Microsoft managed to make worse
    (which simply comes naturally in LaTeX).

    And, come to think of it, since Office 365 (I think) they now
    allow direct use of svg files as graphics, allowing two
    non-braindead ways of including pdf graphics in Word - either
    via Inkscape (read as pdf, write as svg) or through command-line
    tools (usually via Cygwin).

    LibreOffice has had that for ages.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Paul A. Clayton on Wed Dec 13 09:47:39 2023
    Paul A. Clayton wrote:
    On 12/7/23 9:36 PM, MitchAlsup wrote:
    [snip]
    My point on VLEness is that all the position and length information is
    found in the first container of the instruction and not determined by
    a serial walk along the containers. IBM 360 is a lot less CISC than x86.

    Serial decode is definitely not RISC.
    Small field determines length, pointers, and sizes; remains RISCable
    if it does not violate other RISC tenets.

    I would have guessed that encoding a length to next length
    indicator would also be somewhat simple for decode when the
    additional chunk contains only opcode extension that does not
    affect routing (which includes some hints) or immediate data. In
    terms of parsing the instruction stream into instructions, such is
    not different than having a nop with the length specified in *its*
    first container. [see ENDNOTE]

    If one designs the ISA on the assumption that there will be separate
    stages for Fetch and Decode, and I think that's a good idea,
    then there are two parses taking place, the external inter-instruction
    parse performed by Fetch, and internal instruction field parse by Decode.

    The Fetch length parse needs to be simple *except* that Fetch needs to
    be able to pick off all conditional and unconditional branch, call, ret,
    and consult the branch predictors for which-path information.

    Additionally for BRcc/CALL Fetch needs access to the branch offset,
    which is an internal parse, to add to the its future RIP
    in case the predictor says to follow the alternate path.
    And, as in my case, there could be multiple branch offset sizes
    which interacts with the length parse, sign extension delay,
    and the final RIP add result delay.

    Everything else is an internal parse by Decode where it is mostly a matter
    of chopping things up. For instruction with immediates RISC-V designers
    seemed to be very concerned about sign/zero extension delay and the
    location of the sign bit but I'm not sure why - to me it looks like a
    single mux delay at the end. And if all immediates are parsed by Fetch,
    because it needs the BR/CALL offset, then these might arrive in the
    Decode input buffer already parsed and sign extended.

    === ENDNOTE ===
    An encoding with multiple length specifiers might theoretically
    reduce the overhead of encoding the length for the more common
    short cases — perhaps by an entire bit!!!!☺ — but in addition to increasing the size overhead for longer instructions it would
    split large fields by inserting the length extension specifier.
    The extra size overhead also means that 32-bit and 64-bit
    immediates could further bloat the instruction by requiring an
    additional parcel for only a few bits. One bit of length
    information effectively becomes a marker bit per parcel, which is
    a technique that was used for some x86 pre-decoded caches and for
    one ISA that encoded immediates.

    If there were 16 instruction lengths, perhaps a split length
    specifier *might* make sense, but My 66000's five instruction
    lengths obviously does not take that much space. I believe the
    lengths are not fully orthogonal, so it does not take 2.32 bits.
    (I think only a store can be longer than 3 parcels, though perhaps
    some 3-input compute operations might be theoretically able to use
    two constant inputs.)

    Yet if the extra bits are not on a critical path (such as register specifiers) such a clunkier mechanism might not be so horrible.
    Maybe only 0.2 x86s (the unit of measurement for ISA horror☺).

    The other considerations are frequency of occurrence of instructions
    and the relative cost of the length bits in the parse tokens.
    A 2-bit length field can be simple but in a 16-bit token it also
    only allows 25% of the opcode space for the shortest instructions,
    which is where opcodes are most precious.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to David Brown on Wed Dec 13 19:06:49 2023
    David Brown wrote:

    On 10/12/2023 11:39, Thomas Koenig wrote:
    MitchAlsup <mitchalsup@aol.com> schrieb:

    Question (to everyone):: Has your word processor or spreadsheet
    added anything USEFUL TO YOU since 2000 ??

    In my case: Yes.

    Besides making many things worse, the new formula editor (since
    2010?) in Word is reasonable to work with, especially since it is
    possible to use LaTeX notation now (and thus it is now possible
    to paste from Maple).


    If I want to write something serious with formula, I use LaTeX.

    When I want to write an unmisunderstandable formula I use CorelDraw
    and then export as *.jpg. {Everything, except NGs like this, can take
    *.jpgs.} And a Draw program can create symbols that are not in char-
    acture Maps.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to David Brown on Wed Dec 13 22:40:57 2023
    David Brown wrote:

    On 10/12/2023 11:39, Thomas Koenig wrote:
    MitchAlsup <mitchalsup@aol.com> schrieb:

    Question (to everyone):: Has your word processor or spreadsheet
    added anything USEFUL TO YOU since 2000 ??

    In my case: Yes.

    Besides making many things worse, the new formula editor (since
    2010?) in Word is reasonable to work with, especially since it is
    possible to use LaTeX notation now (and thus it is now possible
    to paste from Maple).


    If I want to write something serious with formula, I use LaTeX.

    Previously, I actually wrote some reports in LaTeX, going to some
    trouble to make them appear visually like the Word template du jour
    (but the formulas gave it away, they looked to nice for Word).


    What a strange thing to do - that sounds completely backwards to me!

    I was happy when I had made a template for LibreOffice (it might have
    been one of the forks of OpenOffice, pre-LibreOffice) that looked
    similar to what I have for LaTeX. Then I could make reasonable-looking documents for customers that insisted on having docx format instead of pdf.

    I don't think there has been much exciting or important (to me) added to
    word processors for decades. Direct pdf generation was one, which
    probably existed in Star Office (the ancestor of OpenOffice /

    *.pdf arrives in Word ~2000 (maybe before).

    <snip>

    Apart from that, the only benefits I see of newer LibreOffice over older
    ones is better handling of the insane chaos that MS Office uses for its
    file formats. LibreOffice is /much/ better at this than MS Office is, especially if the file has been modified by a number of different MS
    Office versions.

    I still require people sending me *.docx to convert it back to
    WORD2003 format *.doc and retransmitting it. It is surprising how
    many people don't know how to do that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Wed Dec 13 22:36:02 2023
    Quadibloc wrote:

    On Wed, 13 Dec 2023 12:31:50 +0000, Quadibloc wrote:

    I went back and checked. When I proposed what a variable-length coding
    for the Concertina instruction set would look like, at first glance it
    seemed as though I didn't follow that rule:

    Something like:

    0 - 16 bits
    1 - 32 bits, except
    111011001 32 bits + 16 bits
    111011010 32 bits + 32 bits
    111011011 32 bits + 64 bits
    1110111000 32 bits + 48 bits
    1110111001 32 bits + 32 bits
    1110111010 32 bits + 64 bits
    1110111011 32 bits + 128 bits
    11110 - 48 bits
    11111 - 64 bits

    My 66000 can do::

    FADD R7,#1,R9 // the #1 is interpreted as +1.0D0
    FADD R7,#-1,R9 // the #-1 is interpreted as -1.0D0
    FMAC R7,R8,R9,#1 // a+b+1
    FDIV R7,#1,R9 // reciprocate
    CVTID R1,#28 // R7 = 28D0

    And all of these are 32-bit instructions. In addition we have::

    CVTFD R1,#377 // R7 = 377D0
    FADD R7,R8,#799 // R7 = R9+799D0
    FMAC R7,R8,#799,R9 // R7 = r8*799+R9

    as 64-bit instructions.

    ~00 Instruction is in the Major OpCode group
    00 Instruction can have long constants
    000 XOM Instruction is in the negative eXtended OpCode group
    001 XOP Instruction is in the positive eXtended OpCode group ----------------
    bits<15,11,14,12>
    0000 +Rs1 +Rs2
    0001 +Rs1 -Rs2
    0010 -Rs1 +Rs2
    0011 -Rs1 -Rs2
    0100 +Rs1 +imm5
    0101 +imm5 +Rs2
    0110 +Rs1 -imm5
    0111 -imm5 +rs2
    1000 +rs1 imm32
    1001 #imm32 +Rs2
    1010 -Rs1 #imm32
    1011 #imm32 -Rs2
    1100 +rs1 imm64
    1101 #imm64 +Rs2
    1110 -Rs1 #imm64
    1111 #imm64 -Rs2

    The 5-bit immediates, when used in 32-bit FP calculations, are expanded
    into float32.
    The 5-bit immediates, when used in 64-bit FP calculations, are expanded
    into double64.
    The 32-bit immediates, when used in 64-bit FP calculations, are expanded
    into double64--this requires the compiler not put denorms in these
    constants.

    I found it very useful to separate those instruction that can have long constants from those that cannot.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Wed Dec 13 22:58:36 2023
    EricP wrote:

    Paul A. Clayton wrote:
    On 12/7/23 9:36 PM, MitchAlsup wrote:
    [snip]
    My point on VLEness is that all the position and length information is
    found in the first container of the instruction and not determined by
    a serial walk along the containers. IBM 360 is a lot less CISC than x86. >>>
    Serial decode is definitely not RISC.
    Small field determines length, pointers, and sizes; remains RISCable
    if it does not violate other RISC tenets.

    I would have guessed that encoding a length to next length
    indicator would also be somewhat simple for decode when the
    additional chunk contains only opcode extension that does not
    affect routing (which includes some hints) or immediate data. In
    terms of parsing the instruction stream into instructions, such is
    not different than having a nop with the length specified in *its*
    first container. [see ENDNOTE]

    If one designs the ISA on the assumption that there will be separate
    stages for Fetch and Decode, and I think that's a good idea,
    then there are two parses taking place, the external inter-instruction
    parse performed by Fetch, and internal instruction field parse by Decode.

    My 66000 ISA was designed under the notion that there is::

    FETCH -- PARSE -- DECODE --

    The parse stage includes address comparison (hit 5-gates) and Set selection
    (4 gates) along with instruction length decode (4-gates). This leaves the
    SRAMs of Fetch 1-whole clock from flopped-address to flopped-data.

    The Fetch length parse needs to be simple *except* that Fetch needs to
    be able to pick off all conditional and unconditional branch, call, ret,
    and consult the branch predictors for which-path information.

    I am going to disagree, here, in that one can run fetch entirely from predictors without knowing if the previous fetch satisfied this or that.
    There is time to sort this out later as long as the predictor is good.

    Additionally for BRcc/CALL Fetch needs access to the branch offset,

    None of {R2000, SPARC V8, Mc88100, CRIPS} did that in fetch, we
    all did that in decode--hence the delay slot.

    which is an internal parse, to add to the its future RIP
    in case the predictor says to follow the alternate path.

    The predictor that says: "follow the alternate path" can supply
    an index (6-8 bits) and access alternate path instructions
    {and sort out the minutia later}.

    And, as in my case, there could be multiple branch offset sizes
    which interacts with the length parse, sign extension delay,
    and the final RIP add result delay.

    Everything else is an internal parse by Decode where it is mostly a matter
    of chopping things up.

    Once your ISA bites-off-on VLE, you basically need a PARSE stage
    {or DECODE 1 ^ 2 stages}. The PARSE stage (as mentioned above)
    can absorb the hit and set select gate delays taking pressure of
    FETCH and DECODE. What PARSE delivers to DECODE is the instruction-
    specifiers of al instruction to be DECODEd that cycle {and unary
    pointers to constants--which becomes more inputs to the forwarding
    logic.

    For instruction with immediates RISC-V designers seemed to be very concerned about sign/zero extension delay and the
    location of the sign bit but I'm not sure why - to me it looks like a
    single mux delay at the end. And if all immediates are parsed by Fetch, because it needs the BR/CALL offset, then these might arrive in the
    Decode input buffer already parsed and sign extended.

    Based on the frequencies RISC-V implementations have achieved to
    date--this is a poor assumption.

    In My 66000 case, there is at least 10-gates of delay to perform sign
    extension prior to consuming the constant at forwarding.

    === ENDNOTE ===
    An encoding with multiple length specifiers might theoretically
    reduce the overhead of encoding the length for the more common
    short cases — perhaps by an entire bit!!!!☺ — but in addition to
    increasing the size overhead for longer instructions it would
    split large fields by inserting the length extension specifier.
    The extra size overhead also means that 32-bit and 64-bit
    immediates could further bloat the instruction by requiring an
    additional parcel for only a few bits.

    And THAT is why you don't do it that way!!

    One bit of length
    information effectively becomes a marker bit per parcel, which is
    a technique that was used for some x86 pre-decoded caches and for
    one ISA that encoded immediates.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Wed Dec 13 23:10:38 2023
    BGB wrote:

    On 12/13/2023 8:47 AM, EricP wrote:
    Paul A. Clayton wrote:
    On 12/7/23 9:36 PM, MitchAlsup wrote:
    [snip]

    Luckily, if one can classify each instruction word into one of:
    16-bit op;
    32-bit scalar op;
    32-bit bundle;
    32-bit jumbo prefix.

    By eliminating 16-bit Ops, headers and prefixes (except in rare circumstances) one gets rid of the mess.

    Then looking at 1 or 2 instruction words for a 1-3 word instruction
    isn't too much of an ask.


    Additionally for BRcc/CALL Fetch needs access to the branch offset,

    Technically it is a displacement.....

    which is an internal parse, to add to the its future RIP
    in case the predictor says to follow the alternate path.

    By the time you are doing 4-wide and wider, you quit thinking
    like this. You predict it, and sort it out later. In Mc88120
    we did not verify correct branch target address until the
    branch instruction executed. This did not show up on the top
    10 things slowing the CPU down.

    And, as in my case, there could be multiple branch offset sizes
    which interacts with the length parse, sign extension delay,
    and the final RIP add result delay.

    So, have 1 size for conditional and 1 size for unconditional--

    AND THEN, you build a sign extending adder that adds 64+16 and
    produces the correct sign extended 64-bit result. Guys it is not
    1975 anymore !! Why do you think this is a serial process ??

    Also Note:: The address produced by the branch target adder only
    needs to produce enough bits to index the cache {you can sort out
    all the harder stuff later} in the DECODE stage of the pipeline.
    A check for page crossing (because you don't necessarily need to
    access the TLB) finishes the problem.

    More or less how it is done in my case, except it works by computing PC
    + one of several different branch-sizes (8s, 11s, and 20s), and if the

    With the cache sizes you have shown in the past (and word accesses)
    you probably don't need to calculate more than 11-bits in the decode
    cycle. Once you have enough bits to index the SRAM macro which comprises
    the cache, every other bit needed to finish the calculation can be deferred
    to later.

    corresponding branch hits (matches the pattern and is selected as
    "taken") it then uses this output as the destination (via MUX'ing).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to MitchAlsup on Thu Dec 14 09:10:04 2023
    On 13/12/2023 20:06, MitchAlsup wrote:
    David Brown wrote:

    On 10/12/2023 11:39, Thomas Koenig wrote:
    MitchAlsup <mitchalsup@aol.com> schrieb:

    Question (to everyone):: Has your word processor or spreadsheet
    added anything USEFUL TO YOU since 2000 ??

    In my case: Yes.

    Besides making many things worse, the new formula editor (since
    2010?) in Word is reasonable to work with, especially since it is
    possible to use LaTeX notation now (and thus it is now possible
    to paste from Maple).


    If I want to write something serious with formula, I use LaTeX.

    When I want to write an unmisunderstandable formula I use CorelDraw
    and then export as *.jpg. {Everything, except NGs like this, can take *.jpgs.} And a Draw program can create symbols that are not in char-
    acture Maps.

    Drawing programs can be useful if you want some unusual symbols (though
    they would have to be /very/ unusual if they are not in some package on
    CTAN). And of course you can be /much/ freer with the layout of the maths.

    If I needed something with such freehand layout, I'd write it on paper
    and scan it in, as it is much faster to do. That's also fine for notes,
    or documentation only read by the development team. But it is not "professional" quality, if that is important for the job in hand.

    If you are making such files, I'd suggest png as a better format than
    jpg - it is far better suited to sharp contrast images. jpg is for
    photographs and similar images, and will blur the lines and figures on
    drawn maths. (Or use a vector image format, like svg.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to MitchAlsup on Thu Dec 14 09:57:55 2023
    On 13/12/2023 23:40, MitchAlsup wrote:
    David Brown wrote:

    On 10/12/2023 11:39, Thomas Koenig wrote:
    MitchAlsup <mitchalsup@aol.com> schrieb:

    Question (to everyone):: Has your word processor or spreadsheet
    added anything USEFUL TO YOU since 2000 ??

    In my case: Yes.

    Besides making many things worse, the new formula editor (since
    2010?) in Word is reasonable to work with, especially since it is
    possible to use LaTeX notation now (and thus it is now possible
    to paste from Maple).


    If I want to write something serious with formula, I use LaTeX.

    Previously, I actually wrote some reports in LaTeX, going to some
    trouble to make them appear visually like the Word template du jour
    (but the formulas gave it away, they looked to nice for Word).


    What a strange thing to do - that sounds completely backwards to me!

    I was happy when I had made a template for LibreOffice (it might have
    been one of the forks of OpenOffice, pre-LibreOffice) that looked
    similar to what I have for LaTeX.  Then I could make
    reasonable-looking documents for customers that insisted on having
    docx format instead of pdf.

    I don't think there has been much exciting or important (to me) added
    to word processors for decades.  Direct pdf generation was one, which
    probably existed in Star Office (the ancestor of OpenOffice /

    *.pdf arrives in Word ~2000 (maybe before).

    Word 2007, according to Wikipedia, google, and the never-wrong internet community. Prior to that, people used "pdf printers" which gave basic
    pdf output (image only - no links, contents, cross-references, etc.).
    Or they did a lot of manual work using expensive Adobe Acrobat Writer
    tools so that they could add the "active" bits.

    My experience with MS Office is mostly in helping others - I haven't had
    that overrated, overpriced monstrosity on a computer since Word for
    Windows 2.0 on Windows 3.1. But it seems that these days it does a lot
    better job at exporting pdfs than it used to. I did a quick test with
    online Office 365 with an old LibreOffice document, and Office 365 did
    get the table of contents right when exporting, and the cross-references
    had the right section numbers and were clickable. But if failed to get
    the cross-referenced section names in the pdf, despite showing them fine
    in the docx file it was editing. It was not an extensive test, and the original document was written with LibreOffice, not MS Office. (It was exported from LibreOffice in docx, thus it was in the official ISO
    standard ooxml format, rather than the screwed up version of that which
    MS Office prefers.)


    <snip>

    Apart from that, the only benefits I see of newer LibreOffice over
    older ones is better handling of the insane chaos that MS Office uses
    for its file formats.  LibreOffice is /much/ better at this than MS
    Office is, especially if the file has been modified by a number of
    different MS Office versions.

    I still require people sending me *.docx to convert it back to
    WORD2003 format *.doc and retransmitting it. It is surprising how many
    people don't know how to do that.

    It is surprising that anyone would want them to. Why not just install LibreOffice, and have a tool that is better at reading MS Word generated
    files than any version of MS Word ever was?

    Of course, people should not be sending .docx or any other source-format
    file unless they expect you to edit the document - finished documents
    should always be sent in pdf format.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Thu Dec 14 14:41:32 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    David Brown wrote:

    On 10/12/2023 11:39, Thomas Koenig wrote:
    MitchAlsup <mitchalsup@aol.com> schrieb:

    Question (to everyone):: Has your word processor or spreadsheet
    added anything USEFUL TO YOU since 2000 ??

    In my case: Yes.

    Besides making many things worse, the new formula editor (since
    2010?) in Word is reasonable to work with, especially since it is
    possible to use LaTeX notation now (and thus it is now possible
    to paste from Maple).


    If I want to write something serious with formula, I use LaTeX.

    Previously, I actually wrote some reports in LaTeX, going to some
    trouble to make them appear visually like the Word template du jour
    (but the formulas gave it away, they looked to nice for Word).


    What a strange thing to do - that sounds completely backwards to me!

    I was happy when I had made a template for LibreOffice (it might have
    been one of the forks of OpenOffice, pre-LibreOffice) that looked
    similar to what I have for LaTeX. Then I could make reasonable-looking
    documents for customers that insisted on having docx format instead of pdf.

    I don't think there has been much exciting or important (to me) added to
    word processors for decades. Direct pdf generation was one, which
    probably existed in Star Office (the ancestor of OpenOffice /

    *.pdf arrives in Word ~2000 (maybe before).

    Are you sure about that? IIRC it was a decade later before
    adobe wasn't required.

    <snip>

    I still require people sending me *.docx to convert it back to
    WORD2003 format *.doc and retransmitting it. It is surprising how
    many people don't know how to do that.

    I ask for PDF's. I have no ability to read windows office formats
    of any type without using star/open/libre office, and I detest WYSIWYG
    word processors of all stripes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to David Brown on Thu Dec 14 18:05:23 2023
    David Brown wrote:

    On 13/12/2023 23:40, MitchAlsup wrote:


    I was happy when I had made a template for LibreOffice (it might have
    been one of the forks of OpenOffice, pre-LibreOffice) that looked
    similar to what I have for LaTeX.  Then I could make
    reasonable-looking documents for customers that insisted on having
    docx format instead of pdf.

    I don't think there has been much exciting or important (to me) added
    to word processors for decades.  Direct pdf generation was one, which
    probably existed in Star Office (the ancestor of OpenOffice /

    *.pdf arrives in Word ~2000 (maybe before).

    Word 2007, according to Wikipedia, google, and the never-wrong internet community.

    I worked at AMD 1999-2006 and we used save as to *.pdf all the time.
    {This would have been the professional version of WORD/Office.}
    The student version 2003 also has this, I still have the CD-ROM.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Thu Dec 14 14:19:14 2023
    MitchAlsup wrote:
    On 12/13/2023 8:47 AM, EricP wrote:

    And, as in my case, there could be multiple branch offset sizes
    which interacts with the length parse, sign extension delay,
    and the final RIP add result delay.

    So, have 1 size for conditional and 1 size for unconditional--

    I have 2 branch formats, one for small 16b size offset with 16b opspec,
    and one for medium 32b and large 64b offsets with 32b opspec,

    Later I added a 3rd format for compare and branch when there are
    two variable size immediates, one for offset and one for compare value.
    The offset is the first immediate so it starts in a known buffer location.

    AND THEN, you build a sign extending adder that adds 64+16 and
    produces the correct sign extended 64-bit result. Guys it is not
    1975 anymore !! Why do you think this is a serial process ??

    I didn't say serial.
    I was thinking of starting all 3 offsets sizes 16b, 32b, 64b,
    adds immediately before knowing the instruction type or size,
    then use the actual type and size to select the correct result.
    The 64b adds could be further subdivided as 4 * 16b adders then
    combine the size select with 16b carry select to assemble a 64b result.

    Which is why I said I thought this was just a mux delay at the end.

    Also Note:: The address produced by the branch target adder only
    needs to produce enough bits to index the cache {you can sort out
    all the harder stuff later} in the DECODE stage of the pipeline.
    A check for page crossing (because you don't necessarily need to access
    the TLB) finishes the problem.

    In the hypothetical design I have in mind the instruction bytes
    get parsed from fetch buffers, whose job is to hide the pipeline
    latency to I$L1, and also allow prefetch for possible alternate path.
    It also allows local looping and replay out of the fetch buffers.

    In that design the full 64b parse RIP is need as a tag for
    selecting from the multiple fetch buffers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Thu Dec 14 21:57:39 2023
    EricP wrote:

    MitchAlsup wrote:
    On 12/13/2023 8:47 AM, EricP wrote:

    And, as in my case, there could be multiple branch offset sizes
    which interacts with the length parse, sign extension delay,
    and the final RIP add result delay.

    So, have 1 size for conditional and 1 size for unconditional--

    I have 2 branch formats, one for small 16b size offset with 16b opspec,
    and one for medium 32b and large 64b offsets with 32b opspec,

    Later I added a 3rd format for compare and branch when there are
    two variable size immediates, one for offset and one for compare value.
    The offset is the first immediate so it starts in a known buffer location.

    AND THEN, you build a sign extending adder that adds 64+16 and
    produces the correct sign extended 64-bit result. Guys it is not
    1975 anymore !! Why do you think this is a serial process ??

    I didn't say serial.
    I was thinking of starting all 3 offsets sizes 16b, 32b, 64b,
    adds immediately before knowing the instruction type or size,
    then use the actual type and size to select the correct result.
    The 64b adds could be further subdivided as 4 * 16b adders then
    combine the size select with 16b carry select to assemble a 64b result.

    Which is why I said I thought this was just a mux delay at the end.

    I am trying to tell you to put that mux in the subsequent cycle.

    A 16-bit address can access a 64KB cache. A 64KB cache is bigger than
    we will be willing to build. So, to access the cache all we need is
    the lower order bits, and all 3 formats are the same here, so we
    add 16 bits and start accessing the cache, Then we flop the carry out
    and in the subsequent cycle we add the bits bigger than 16 while the
    cache is being accessed. And now at the time when the tag is available
    so is the rest of the address bits.

    Also Note:: The address produced by the branch target adder only
    needs to produce enough bits to index the cache {you can sort out
    all the harder stuff later} in the DECODE stage of the pipeline.
    A check for page crossing (because you don't necessarily need to access
    the TLB) finishes the problem.

    In the hypothetical design I have in mind the instruction bytes
    get parsed from fetch buffers, whose job is to hide the pipeline
    latency to I$L1, and also allow prefetch for possible alternate path.
    It also allows local looping and replay out of the fetch buffers.

    I call this the instruction buffer and hold both sequential and
    alternate path instructions for decode. I can access a whole cache
    line per cycle, so I have little problem feeding the sequential
    path--but instead of fetching whole cache lines, I fetch four ¼
    cache lines and a next fetch predictor. Each I$ access uses a
    7-bit index and a 3-bit set {turning the 4-way cache into direct
    mapped cache} and another 7+bit index to the fetch predictor.

    void accessICache(Fetch fetch)
    {
    static Index index;

    for( i = 0; i < SETS; i++ )
    InstBuf[fetch+i] = column[i].SRAM[index[i]];
    index = FetchPredictor[index[5]];
    }

    In that design the full 64b parse RIP is need as a tag for
    selecting from the multiple fetch buffers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to David Brown on Sat Dec 16 13:49:26 2023
    David Brown <david.brown@hesbynett.no> schrieb:
    On 10/12/2023 11:39, Thomas Koenig wrote:

    Previously, I actually wrote some reports in LaTeX, going to some
    trouble to make them appear visually like the Word template du jour
    (but the formulas gave it away, they looked to nice for Word).


    What a strange thing to do - that sounds completely backwards to me!

    When you work at a company that prescribes (sort of) a certain
    format, that is one possibility. I did the cover sheet in Word,
    though, and pasted it together as PDF.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to George Neuner on Sat Dec 16 14:20:56 2023
    George Neuner <gneuner2@comcast.net> schrieb:
    On Sun, 10 Dec 2023 10:56:36 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:

    [...]

    My personal favorites for parallel programming are PGAS languages
    like (yes, you guessed it) Fortran, where the central data
    structure is the coarray.

    Mileage varies considerably and I don't intend to start a language
    war: a lot has to due with the history of parallel applications a
    person has developed. I can respect your point of view even though I
    don't agree with it.

    My favorite model is CSP (ala Hoare) with no shared memory. Which is
    not to say I don't use threads, but I try to design programs such that threads (mostly) are not sharing writable data structures.

    Which is a sound idea, Inadvertently shared variables are a major
    source of errors in OpenMP, for example.

    You have to make sure that you synchronize before accessing data
    that has been modified on another image.

    And that's where most languages fall down: they provide either
    primitives which are too low level and too hard to use correctly, or
    they provide high level mechanisms that don't scale well and are too
    limiting in actual use.

    I think Fortran has gotten many things right here, at least for the
    domain of scientific computing - the complexity is manageable.

    For those who are interested, I've written a short tutorial about
    Fortran coarrays, which can be found at

    https://github.com/tkoenig1/coarray-tutorial/blob/main/tutorial.md

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Sat Dec 16 16:32:49 2023
    On 16/12/2023 14:49, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:
    On 10/12/2023 11:39, Thomas Koenig wrote:

    Previously, I actually wrote some reports in LaTeX, going to some
    trouble to make them appear visually like the Word template du jour
    (but the formulas gave it away, they looked to nice for Word).


    What a strange thing to do - that sounds completely backwards to me!

    When you work at a company that prescribes (sort of) a certain
    format, that is one possibility. I did the cover sheet in Word,
    though, and pasted it together as PDF.

    Ah, your aim was to make the LaTeX documents look like the corporate
    standard, which happened to be made in Word. That makes a lot more
    sense. Typical out-of-the-box Word templates (and LibreOffice, and
    every other word processor I have seen) all look amateur in comparison
    to LaTeX layouts. But company standardisation trumps quality
    typesetting. (And maybe you are one of the lucky people whose company
    standard templates are well designed.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Chris M. Thomasson on Sat Dec 16 16:28:08 2023
    On 15/12/2023 03:59, Chris M. Thomasson wrote:
    On 12/14/2023 6:41 AM, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup) writes:
    David Brown wrote:

    On 10/12/2023 11:39, Thomas Koenig wrote:
    MitchAlsup <mitchalsup@aol.com> schrieb:

    Question (to everyone):: Has your word processor or spreadsheet
    added anything USEFUL TO YOU since 2000 ??

    In my case: Yes.

    Besides making many things worse, the new formula editor (since
    2010?) in Word is reasonable to work with, especially since it is
    possible to use LaTeX notation now (and thus it is now possible
    to paste from Maple).


    If I want to write something serious with formula, I use LaTeX.

    Previously, I actually wrote some reports in LaTeX, going to some
    trouble to make them appear visually like the Word template du jour
    (but the formulas gave it away, they looked to nice for Word).


    What a strange thing to do - that sounds completely backwards to me!

    I was happy when I had made a template for LibreOffice (it might have
    been one of the forks of OpenOffice, pre-LibreOffice) that looked
    similar to what I have for LaTeX.  Then I could make reasonable-looking >>>> documents for customers that insisted on having docx format instead
    of pdf.

    I don't think there has been much exciting or important (to me)
    added to
    word processors for decades.  Direct pdf generation was one, which
    probably existed in Star Office (the ancestor of OpenOffice /

    *.pdf arrives in Word ~2000 (maybe before).

    Are you sure about that?  IIRC it was a decade later before
    adobe wasn't required.

      <snip>

    I still require people sending me *.docx to convert it back to
    WORD2003 format *.doc and retransmitting it. It is surprising how
    many people don't know how to do that.

    I ask for PDF's.   I have no ability to read windows office formats
    of any type without using star/open/libre office, and I detest WYSIWYG
    word processors of all stripes.

    Try to stay far away from windows office docs, they can be filled with interesting macros, well back in the day! I do remember a lot of print
    to PDF programs. Mock up a printer device, print, produce a file.

    They are only a problem if you use MS Office. LibreOffice, and its predecessors, disable the macros by default.

    PDF also supports dangerous links and Javascript. It's not a problem if
    you use a decent pdf viewer, but if you use Adobe Acrobat on Windows,
    you can definitely be at risk.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Thomas Koenig on Tue Dec 19 03:14:05 2023
    Thomas Koenig wrote:

    George Neuner <gneuner2@comcast.net> schrieb:
    On Sun, 10 Dec 2023 10:56:36 -0000 (UTC), Thomas Koenig >><tkoenig@netcologne.de> wrote:


    And that's where most languages fall down: they provide either
    primitives which are too low level and too hard to use correctly, or
    they provide high level mechanisms that don't scale well and are too
    limiting in actual use.

    I think Fortran has gotten many things right here, at least for the
    domain of scientific computing - the complexity is manageable.

    For those who are interested, I've written a short tutorial about
    Fortran coarrays, which can be found at

    https://github.com/tkoenig1/coarray-tutorial/blob/main/tutorial.md

    Nicely done.

    Notice how they specified "the what" without specifying "the how".
    Notice that C and C++ atomics to the reverse.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Tue Dec 19 03:29:22 2023
    Paul A. Clayton wrote:

    On 12/6/23 2:54 AM, Anton Ertl wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    [snip]
    One of my professors back in the late 70's was researching
    data flow architectures. Perhaps it's time to reconsider the
    unit of compute using single instructions, instead providing a
    set of hardware 'functions' than can be used in a data flow environment.

    We already have data-flow microarchitectures since the mid-1990s, with
    the success of OoO execution. And the "von Neumann" ISAs have proven
    to be a good and long-term stable interface between software and these
    data-flow microarchitectures, whereas the data-flow ISAs of the 1970s
    and their microarchitectures turned out to be uncompetetive.

    I suspect that a superior interface could be designed which
    exploits diverse locality (i.e., data might naturally be closer to
    some computational resources than to others) and communication
    (and storage) costs and budgets (budgets being related to urgency
    and importance).

    Consider the inner loop of DGEM on a properly resourced GBOoO::

    LD |AGEN |cache|align|reslt|
    LD |AGEN |cache|align|reslt|
    FMUL | ex1 | ex2 | ex3 | ex4 |reslt|
    FADD | ex1 | ex2 | ex3 |reslt|
    ST |AGEN |cache|-----------------------------------------------|align|write|
    LOOP | inc |

    LD |AGEN |cache|align|reslt|
    LD |AGEN |cache|align|reslt|
    FMUL | ex1 | ex2 | ex3 | ex4 |reslt|
    FADD | ex1 | ex2 | ex3 |reslt|
    ST |AGEN |cache|-----------------------------------------------|align|write|
    LOOP | inc |

    LD |AGEN |cache|align|reslt|
    LD |AGEN |cache|align|reslt|
    FMUL | ex1 | ex2 | ex3 | ex4 |reslt|
    FADD | ex1 | ex2 | ex3 |reslt|
    ST |AGEN |cache|-----------------------------------------------|align|write|
    LOOP | inc |

    This is about as much Out-of-order one gets in a GBOoO machine.
    Andy Glew would say that this is "not that much" out of order.
    Notice that every function unit sees every operation in program
    order and that the only OoO-ness is the latency of calculations.


    I think the original dataflow architectures
    attempted to be very general with significant overhead for
    readiness determination and communication. They also (as I
    understand) lacked value prediction whereas OoO effectively uses
    value prediction for branches.

    The original data-flow architectures over did their ability to
    discover parallelism. And whereas vonNeumann ISAs struggled to
    find parallelism, data-flow architectures found "way too much".
    They found so much parallelism that they got diverted down a
    dark alley for a decade trying to "so manage" the parallelism
    so their queuing structures did not overflow and crash the
    machine. They stumbled upon parallelism in the 10s of millions
    of instructions that could be fired each cycle. Primarily, they
    failed in trying to reign in the parallelism down to the point
    where they could build actual machines.

    They died from finding too much not from finding too little.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to All on Tue Dec 19 03:36:06 2023
    I have made some further minor modifications.

    I changed where, in the opcode space, the supplementary
    memory-reference instructions were located. This allowed
    me to have a few more bits available for them.

    Also, I added a mechanism for a set of instructions
    longer than 32 bits that can be used without
    recourse to a block header of any kind, so that they
    can be slipped into code in the format formerly
    consisting purely of 32-bit instructions. This is very
    inefficient, though, and so the previous format for
    long instructions is also kept.

    But this way, the basic instruction set used from
    unblocked code is open-ended, which I think is important.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Tue Dec 19 07:22:10 2023
    On Tue, 19 Dec 2023 03:36:06 +0000, Quadibloc wrote:

    I changed where, in the opcode space, the supplementary memory-reference instructions were located. This allowed me to have a few more bits
    available for them.

    I've moved them again, making even more space available... because in
    my last change, I made the mistake of using the opcode space that I
    was already using for block headers. I couldn't reduce the amount of information in a block header by two bits, by using a combination of
    ten bits instead of eight to indicate a block header, so I had to do
    my rearranging in this place instead.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Paul A. Clayton on Tue Dec 19 14:30:25 2023
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 12/6/23 2:54 AM, Anton Ertl wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    [snip]
    One of my professors back in the late 70's was researching
    data flow architectures. Perhaps it's time to reconsider the
    unit of compute using single instructions, instead providing a
    set of hardware 'functions' than can be used in a data flow environment.

    We already have data-flow microarchitectures since the mid-1990s, with
    the success of OoO execution. And the "von Neumann" ISAs have proven
    to be a good and long-term stable interface between software and these
    data-flow microarchitectures, whereas the data-flow ISAs of the 1970s
    and their microarchitectures turned out to be uncompetetive.

    I suspect that a superior interface could be designed which
    exploits diverse locality (i.e., data might naturally be closer to
    some computational resources than to others) and communication
    (and storage) costs and budgets (budgets being related to urgency
    and importance). I think the original dataflow architectures
    attempted to be very general with significant overhead for
    readiness determination and communication. They also (as I
    understand) lacked value prediction whereas OoO effectively uses
    value prediction for branches.

    I would argue that the Cavium coprocessors are data flow at the
    level envisioned in the 1970s research. The data (network packets)
    are presented (queued) to the appropriate coprocessor as it
    flows through the configured set of coprocessors for the data
    stream.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Thomas Koenig on Tue Dec 19 17:34:34 2023
    On Fri, 10 Nov 2023 22:03:23 +0000, Thomas Koenig wrote:
    Quadibloc <quadibloc@servername.invalid> schrieb:

    For 32-bit instructions, the only implication is that the first few
    integer registers would be used as index registers, and the last few
    would be used as base registers, which is likely to be true in any
    case.

    This breaks with the central tenet of the /360, the PDP-11,
    the VAX, and all RISC architectures: (Almost) all registers are general-purpose registers.

    This would make your ISA very un-S/360-like.

    Well, I felt that _some_ compromises had to be made; otherwise,
    there was no way instructions with base-index addressing _and_
    16-bit displacements would fit into 32 bits.

    So this isn't a decision I can reverse. Yes, it has its problems,
    but it's an unavoidable result of my goal of combining aspects of
    RISC and CISC in a single ISA.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Tue Dec 19 17:47:25 2023
    On Tue, 19 Dec 2023 07:22:10 +0000, Quadibloc wrote:

    On Tue, 19 Dec 2023 03:36:06 +0000, Quadibloc wrote:

    I changed where, in the opcode space, the supplementary
    memory-reference instructions were located. This allowed me to have a
    few more bits available for them.

    I've moved them again, making even more space available... because in my
    last change, I made the mistake of using the opcode space that I was
    already using for block headers. I couldn't reduce the amount of
    information in a block header by two bits, by using a combination of ten
    bits instead of eight to indicate a block header, so I had to do my rearranging in this place instead.

    And now, with what I've learned from this experience, I've made further changes. I've increased the length of the opcode field in the supplementary memory-reference instructions that were moved to be among the other memory-reference instructions, so as to have enough for the different
    sizes of the various types to be supported.

    But in addition, I have now engaged in what some may see as an act of
    pure evil.

    Once again there are supplementary memory-reference instructions among
    the operate instructions as well. *These*, however, provide for the conventional integer and floating-point types, CISC-style memory to
    register operate instructions! So even within the basic 32-bit instruction
    set, although _these_ instructions are highly restricted in register use
    and addressing modes, the pretense of being a load-store architecture
    has been dropped!

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Anton Ertl on Tue Dec 19 17:39:19 2023
    On Tue, 05 Dec 2023 11:07:09 +0000, Anton Ertl wrote:

    Quadibloc <quadibloc@servername.invalid> writes:
    [IBM Model 195]
    Its microarchitecture ended up being, in general terms, copied by the >>Pentium Pro and the Pentium II.

    Not really. The Models 91 and 195 only have OoO for FP, not for
    integers.

    As do the Pentium Pro and the Pentium II. (The Motorola 68050 did it
    the other way around, only having OoO for integers, and not for FP,
    figuring, I guess, that integers are used the most, so this would
    create better performance numbers.)

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Tue Dec 19 17:56:03 2023
    On Tue, 19 Dec 2023 17:39:19 +0000, Quadibloc wrote:

    On Tue, 05 Dec 2023 11:07:09 +0000, Anton Ertl wrote:

    Quadibloc <quadibloc@servername.invalid> writes:
    [IBM Model 195]
    Its microarchitecture ended up being, in general terms, copied by the >>>Pentium Pro and the Pentium II.

    Not really. The Models 91 and 195 only have OoO for FP, not for
    integers.

    As do the Pentium Pro and the Pentium II. (The Motorola 68050 did it the other way around, only having OoO for integers, and not for FP,
    figuring, I guess, that integers are used the most, so this would create better performance numbers.)

    Oops, the Motorola 68060.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Tue Dec 19 18:29:41 2023
    Quadibloc <quadibloc@servername.invalid> writes:
    On Tue, 05 Dec 2023 11:07:09 +0000, Anton Ertl wrote:

    Quadibloc <quadibloc@servername.invalid> writes:
    [IBM Model 195]
    Its microarchitecture ended up being, in general terms, copied by the >>>Pentium Pro and the Pentium II.

    Not really. The Models 91 and 195 only have OoO for FP, not for
    integers.

    As do the Pentium Pro and the Pentium II.

    This is the first time I have seen that claimed. What makes you think
    so?

    Everything I have read about the Pentium Pro indicates that it has
    complete OoO with speculation and precise exceptions (and neither
    speculation nor precise exceptions would work with FP-only OoO, as
    demonstrated by the Model 91 which has neither and is infamous for its imprecise exceptions).

    (The Motorola 68050 did it
    the other way around, only having OoO for integers, and not for FP,
    figuring, I guess, that integers are used the most, so this would
    create better performance numbers.)

    According to <https://en.wikipedia.org/wiki/Motorola_68000_series#68050_and_68070>,
    there was no 68050. According to <https://en.wikipedia.org/wiki/68060>.

    |The 68060 shares most architectural features with the P5 Pentium. Both
    |have a very similar superscalar in-order dual instruction pipeline |configuration

    I.e., no OoO.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Wed Dec 20 01:03:33 2023
    On Tue, 19 Dec 2023 18:29:41 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Quadibloc <quadibloc@servername.invalid> writes:
    On Tue, 05 Dec 2023 11:07:09 +0000, Anton Ertl wrote:

    Quadibloc <quadibloc@servername.invalid> writes:
    [IBM Model 195]
    Its microarchitecture ended up being, in general terms, copied by
    the Pentium Pro and the Pentium II.

    Not really. The Models 91 and 195 only have OoO for FP, not for
    integers.

    As do the Pentium Pro and the Pentium II.

    This is the first time I have seen that claimed. What makes you think
    so?

    Everything I have read about the Pentium Pro indicates that it has
    complete OoO with speculation and precise exceptions (and neither
    speculation nor precise exceptions would work with FP-only OoO, as demonstrated by the Model 91 which has neither and is infamous for its imprecise exceptions).


    Back in 2019-03 I tried to educate John about OoO in Pentium-Pro and
    Pentium-II but failed misarably. Now I understand that I had to blame
    myself for the failure - I was not sifficiently polite.
    I wish you better luck.


    (The Motorola 68050 did it
    the other way around, only having OoO for integers, and not for FP, >figuring, I guess, that integers are used the most, so this would
    create better performance numbers.)

    According to <https://en.wikipedia.org/wiki/Motorola_68000_series#68050_and_68070>,
    there was no 68050. According to
    <https://en.wikipedia.org/wiki/68060>.

    |The 68060 shares most architectural features with the P5 Pentium.
    Both |have a very similar superscalar in-order dual instruction
    pipeline |configuration

    I.e., no OoO.

    - anton

    May be, he was thinking about MPC740/750 ? That was one of the more
    successfull PowerPC cores. Even after withdrowal from personal
    computing market it lived for many more years as Freescale e600 core.
    Its integer side can be described as 'barely OoO' but OoO nevertheless.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to BGB on Wed Dec 20 01:12:43 2023
    On Tue, 19 Dec 2023 17:19:24 -0600, BGB wrote:
    On 12/19/2023 11:34 AM, Quadibloc wrote:

    Well, I felt that _some_ compromises had to be made; otherwise, there
    was no way instructions with base-index addressing _and_ 16-bit
    displacements would fit into 32 bits.

    So this isn't a decision I can reverse. Yes, it has its problems,
    but it's an unavoidable result of my goal of combining aspects of RISC
    and CISC in a single ISA.

    As I see it, there are two major situations:
    Stack frames and structs, where a 16-bit displacement is likely
    overkill;
    Global variables, where for the general case it is almost entirely insufficient.

    Yes, but does that mean that 16-bit displacements are a bad idea?

    The Motorola 68000, the 8086, the PowerPC, and lots of other architectures
    all had them.

    So: what are 16-bit variables _for_? *Local* variables, of course. Allocate
    one base register to the start of the data area for a program, and another
    base register to the start of the program area for a program, and you're
    done.

    The architecture also provides _one_ base register that works with 15-bit displacements. This allows instructions to have a smaller format if that
    base register is used.

    And then there's another seven registers allocated as base registers that
    work with 12-bit displacements. If you want to save the base registers with 16-bit displacements, then you can use those.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Wed Dec 20 00:17:42 2023
    BGB wrote:

    On 12/19/2023 11:34 AM, Quadibloc wrote:
    On Fri, 10 Nov 2023 22:03:23 +0000, Thomas Koenig wrote:
    Quadibloc <quadibloc@servername.invalid> schrieb:

    For 32-bit instructions, the only implication is that the first few
    integer registers would be used as index registers, and the last few
    would be used as base registers, which is likely to be true in any
    case.

    This breaks with the central tenet of the /360, the PDP-11,
    the VAX, and all RISC architectures: (Almost) all registers are
    general-purpose registers.

    This would make your ISA very un-S/360-like.

    Well, I felt that _some_ compromises had to be made; otherwise,
    there was no way instructions with base-index addressing _and_
    16-bit displacements would fit into 32 bits.

    So this isn't a decision I can reverse. Yes, it has its problems,
    but it's an unavoidable result of my goal of combining aspects of
    RISC and CISC in a single ISA.


    As I see it, there are two major situations:
    Stack frames and structs, where a 16-bit displacement is likely overkill; Global variables, where for the general case it is almost entirely insufficient.

    EMBench is filled with stack frames illustrating RISC-V's 12-bit
    immediates are not big enough.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Wed Dec 20 01:46:29 2023
    On Wed, 20 Dec 2023 01:12:43 +0000, Quadibloc wrote:

    On Tue, 19 Dec 2023 17:19:24 -0600, BGB wrote:
    On 12/19/2023 11:34 AM, Quadibloc wrote:

    Well, I felt that _some_ compromises had to be made; otherwise, there
    was no way instructions with base-index addressing _and_ 16-bit
    displacements would fit into 32 bits.

    So this isn't a decision I can reverse. Yes, it has its problems,
    but it's an unavoidable result of my goal of combining aspects of RISC
    and CISC in a single ISA.

    As I see it, there are two major situations:
    Stack frames and structs, where a 16-bit displacement is likely
    overkill;
    Global variables, where for the general case it is almost entirely
    insufficient.

    Yes, but does that mean that 16-bit displacements are a bad idea?

    The Motorola 68000, the 8086, the PowerPC, and lots of other
    architectures all had them.

    So: what are 16-bit variables _for_? *Local* variables, of course.
    Allocate one base register to the start of the data area for a program,
    and another base register to the start of the program area for a
    program, and you're done.

    The architecture also provides _one_ base register that works with
    15-bit displacements. This allows instructions to have a smaller format
    if that base register is used.

    And then there's another seven registers allocated as base registers
    that work with 12-bit displacements. If you want to save the base
    registers with 16-bit displacements, then you can use those.

    And I forgot to mention: there are _another_ seven registers allocated
    as base registers that work with 20-bit displacements. The instructions
    using them, though, are all longer than 32 bits. I did not include this feature, though, because I thought there wa a need for it, but because
    it had been added in z/Architecture; so it's there for ease in translating programs over.

    As the architecture provides for instructions longer than 32 bits, I
    could indeed add instructions which contained a full 64-bit address,
    or instructions with 32-bit displacements. The first of those two
    possibilities certainly is a simple way to deal with, in a pinch,
    a single external variable without tying up one base register just
    for it.

    I haven't made a place for that feature yet, but one thing I do have
    is Array Mode. So if a program has a lot of large arrays, so they
    don't all fit into the 64K that one base register can cover, instead
    of using several registers to cover those arrays, one base register
    points to a table of array addresses - and the displacement picks
    out the array address, and the index register contents are added to
    that address to find the pointer to the operand. It's basically a form
    of post-indexed indirect addressing.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Wed Dec 20 03:30:34 2023
    BGB wrote:

    On 12/19/2023 6:17 PM, MitchAlsup wrote:
    BGB wrote:

    On 12/19/2023 11:34 AM, Quadibloc wrote:
    On Fri, 10 Nov 2023 22:03:23 +0000, Thomas Koenig wrote:
    Quadibloc <quadibloc@servername.invalid> schrieb:

    For 32-bit instructions, the only implication is that the first few >>>>>> integer registers would be used as index registers, and the last few >>>>>> would be used as base registers, which is likely to be true in any >>>>>> case.

    This breaks with the central tenet of the /360, the PDP-11,
    the VAX, and all RISC architectures:  (Almost) all registers are
    general-purpose registers.

    This would make your ISA very un-S/360-like.

    Well, I felt that _some_ compromises had to be made; otherwise,
    there was no way instructions with base-index addressing _and_
    16-bit displacements would fit into 32 bits.

    So this isn't a decision I can reverse. Yes, it has its problems,
    but it's an unavoidable result of my goal of combining aspects of
    RISC and CISC in a single ISA.


    As I see it, there are two major situations:
    Stack frames and structs, where a 16-bit displacement is likely overkill; >>> Global variables, where for the general case it is almost entirely
    insufficient.

    EMBench is filled with stack frames illustrating RISC-V's 12-bit
    immediates are not big enough.

    As mentioned before, if you scale the displacement here, it is like it
    is 3 bits bigger.

    RISC-V is a weak case here because:
    The displacements are unscaled;
    The displacements are signed.

    And yet, it has sucked all the oxygen out of the room..........

    For stack frames, this effectively loses 4 bits, so RISC-V's 12-bit displacement is effectively more equivalent to 8 bits with my scheme...

    Well, combined with the issue that exceeding the +/- 2K limit in RISC-V
    sucks (there is no low-cost fallback strategy).

    Universal constants solve that RISC-V's problem.

    But, generally, not many stack frames seem to have much issue with the current 4K limit.

    Granted, I just ran into a watched benchmarks that makes RISC-V look
    less optimal then those sucking the oxygen out of the room.

    If it were more of an issue, could potentially add a few ops to extend
    the limit to around 32K in XG2 mode. Say:
    MOV.Q (SP, Disp12u*8), Rn
    MOV.Q Rn, (SP, Disp12u*8)

    You still need 64-bit displacements for when we have Atta Byte address spaces.......

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to BGB on Wed Dec 20 04:38:47 2023
    On Tue, 19 Dec 2023 20:07:09 -0600, BGB wrote:
    On 12/19/2023 7:12 PM, Quadibloc wrote:

    Yes, but does that mean that 16-bit displacements are a bad idea?

    It is, if they end up hurting the encoding in some other way, like
    making the register fields smaller or eating too much of the opcode
    space.

    In some previous iterations of the Concertina II architecture, I
    did follow the SEL 32 architecture, in having instructions that
    only accessed aligned locations in memory. Then I used the bits
    at the end of the address to indicate the length of the operand
    in a way similar to what the SEL 32 did. This is sort of like
    shortening the address by scaling it.

    But I ended up not having to do that. I still had enough room
    for load-store memory-reference instructions. I did lose
    opcode space, because now I couldn't have 16-bit instructions.

    But my 16-bit instructions themselves had a restriction on
    register use that was bad; so I replaced them with 17-bit
    instructions (they can be used, but with an overhead of one
    32-bit header word in a block).

    Sometimes, I also considered replacing 16-bit displacements by
    15-bit displacements, but those designs also ended up not going
    anywhere.

    But then, I've been able to consider a wide variety of design
    alternatives in Concertina II precisely because having the block
    format means I potentially have as much opcode space available to
    me as I want. Spend four bits per block, and get 36-bit instructions
    instead of 32-bit instructions, for example.

    One major goal - not one that I've discussed much - is to make
    Concertina II look a lot like a conventional RISC architecture.

    Of course it does plenty of things that no conventional RISC
    architecture does, but except for the fact that I can only use
    seven (instead of 31) of the 32 integer registers as index
    registers (or, rather, as base registers, since when your
    displacement is too short to cover memory, you need a base
    register *first*)... its instruction set is basically a
    _superset_ of a conventional RISC architecture.

    You've got load-store memory-reference instructions, with
    16-bit displacements, like typical microprocessors (and unlike
    the System/360, which got along just fine with just 12 bits).

    You've got three-address register to register operate
    instructions - with a C bit to turn on or off affecting the
    condition codes.

    Just like some typical RISC designs!

    But then the block structure lets you do things never seen
    except in VLIW designs (instruction predication, explitly
    indicating certain instructions may execute in parallel).

    And the instruction set also goes into CISC territory. The
    block structure makes it *obvious*, even to an idiot, that
    one can process the header, and then locate instructions to
    be decoded without going through the whole block serially.

    Mitch has pointed out that with a simple length prefix
    scheme _can_ be decoded really quickly, but I want even
    implementors who aren't as smart as Mitch to be able to
    implement Concertina II properly so it runs fast. Length
    prefixes invite people to decode them serially.

    Concertina II seems almost "architecture-agnostic", as it
    doesn't know if it's RISC, CISC, or VLIW. But in thinking
    about it, my intention is perhaps this: to take a CISC
    instruction set, but use RISC and VLIW packaging on it,
    so that the performance advantages of RISC and VLIW can
    be given to a CISC instruction set.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Wed Dec 20 09:27:09 2023
    On Wed, 20 Dec 2023 03:30:34 +0000, MitchAlsup wrote:
    BGB wrote:

    RISC-V is a weak case here because:
    The displacements are unscaled;
    The displacements are signed.

    And yet, it has sucked all the oxygen out of the room..........

    I know that I felt the 68000 using signed displacements was
    a major weakness of the architecture, and on the Macintosh it did
    lead to segments being half as large as they could have otherwise
    been.

    Granted, I just ran into a watched benchmarks that makes RISC-V look
    less optimal then those sucking the oxygen out of the room.

    Ah.

    So x86/x86-64 and ARM are the ones _really_ sucking most of the oxygen
    out of the room... RISC-V is just sucking what little they've left
    behind out of the room.

    As Concertina II is unlikely to please anyone but myself, I wish
    your MY 66000 all the luck in the world in overcoming this problem.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to All on Thu Dec 21 07:12:41 2023
    Now I've simplified the format of the Composed Instructions, which
    allow instructions longer than 32 bits to appear in code without
    block headers.

    This freed up just enough opcode space that I could just barely
    add a header format for reserving part of a block for pseudo-immediates
    with essentially zero overhead back in to the instruction set.

    I felt this feature was needed to make immediate values feel like
    a real part of the instruction set; if they always required a full
    32-bit header as overhead, there would be reluctance to use them.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Chris M. Thomasson on Thu Dec 21 08:58:10 2023
    On 21/12/2023 04:00, Chris M. Thomasson wrote:
    On 12/16/2023 7:28 AM, David Brown wrote:
    On 15/12/2023 03:59, Chris M. Thomasson wrote:
    On 12/14/2023 6:41 AM, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup) writes:

    I ask for PDF's.   I have no ability to read windows office formats
    of any type without using star/open/libre office, and I detest WYSIWYG >>>> word processors of all stripes.

    Try to stay far away from windows office docs, they can be filled
    with interesting macros, well back in the day! I do remember a lot of
    print to PDF programs. Mock up a printer device, print, produce a file.

    They are only a problem if you use MS Office.  LibreOffice, and its
    predecessors, disable the macros by default.

    PDF also supports dangerous links and Javascript.

    Indeed!


    It's not a problem if you use a decent pdf viewer, but if you use
    Adobe Acrobat on Windows, you can definitely be at risk.


    Well, just make sure the PDF reader has javascript turned off all
    around. Trust in it.

    "Trust in it" ?

    Some readers /are/ trustworthy. Adobe's are not - Acrobat reader has
    endless lists of security holes. I haven't had it installed on a PC for
    many years, so things may have changed, but in comparison to any other
    reader it was huge, slow, and required continuous upgrading to deal with vulnerabilities, requiring a reboot of Windows each time. Horrible
    software.

    On Linux, common readers like evince don't support javascript - you can
    trust them!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Quadibloc on Thu Dec 21 13:21:45 2023
    Quadibloc wrote:
    On Tue, 05 Dec 2023 11:07:09 +0000, Anton Ertl wrote:

    Quadibloc <quadibloc@servername.invalid> writes:
    [IBM Model 195]
    Its microarchitecture ended up being, in general terms, copied by the
    Pentium Pro and the Pentium II.

    Not really. The Models 91 and 195 only have OoO for FP, not for
    integers.

    As do the Pentium Pro and the Pentium II. (The Motorola 68050 did it

    Huh???

    I'm sure Andy Glew would disagree re the PPro!

    Terje
    the other way around, only having OoO for integers, and not for FP,
    figuring, I guess, that integers are used the most, so this would
    create better performance numbers.)

    John Savard



    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Chris M. Thomasson on Thu Dec 21 14:51:45 2023
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 12/16/2023 7:28 AM, David Brown wrote:
    On 15/12/2023 03:59, Chris M. Thomasson wrote:
    On 12/14/2023 6:41 AM, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup) writes:
    David Brown wrote:

    On 10/12/2023 11:39, Thomas Koenig wrote:
    MitchAlsup <mitchalsup@aol.com> schrieb:

    Question (to everyone):: Has your word processor or spreadsheet >>>>>>>> added anything USEFUL TO YOU since 2000 ??

    In my case: Yes.

    Besides making many things worse, the new formula editor (since
    2010?) in Word is reasonable to work with, especially since it is >>>>>>> possible to use LaTeX notation now (and thus it is now possible
    to paste from Maple).


    If I want to write something serious with formula, I use LaTeX.

    Previously, I actually wrote some reports in LaTeX, going to some >>>>>>> trouble to make them appear visually like the Word template du jour >>>>>>> (but the formulas gave it away, they looked to nice for Word).


    What a strange thing to do - that sounds completely backwards to me! >>>>>
    I was happy when I had made a template for LibreOffice (it might have >>>>>> been one of the forks of OpenOffice, pre-LibreOffice) that looked
    similar to what I have for LaTeX.  Then I could make
    reasonable-looking
    documents for customers that insisted on having docx format instead >>>>>> of pdf.

    I don't think there has been much exciting or important (to me)
    added to
    word processors for decades.  Direct pdf generation was one, which >>>>>> probably existed in Star Office (the ancestor of OpenOffice /

    *.pdf arrives in Word ~2000 (maybe before).

    Are you sure about that?  IIRC it was a decade later before
    adobe wasn't required.

      <snip>

    I still require people sending me *.docx to convert it back to
    WORD2003 format *.doc and retransmitting it. It is surprising how
    many people don't know how to do that.

    I ask for PDF's.   I have no ability to read windows office formats
    of any type without using star/open/libre office, and I detest WYSIWYG >>>> word processors of all stripes.

    Try to stay far away from windows office docs, they can be filled with
    interesting macros, well back in the day! I do remember a lot of print
    to PDF programs. Mock up a printer device, print, produce a file.

    They are only a problem if you use MS Office.  LibreOffice, and its
    predecessors, disable the macros by default.

    PDF also supports dangerous links and Javascript.

    Indeed!

    Although my PDF reader ignores links and Javascript (xpdf),
    and I've yet to encounter a PDF that xpdf cannot read.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to David Brown on Thu Dec 21 14:52:32 2023
    David Brown <david.brown@hesbynett.no> writes:
    On 21/12/2023 04:00, Chris M. Thomasson wrote:
    On 12/16/2023 7:28 AM, David Brown wrote:
    On 15/12/2023 03:59, Chris M. Thomasson wrote:
    On 12/14/2023 6:41 AM, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup) writes:

    I ask for PDF's.   I have no ability to read windows office formats >>>>> of any type without using star/open/libre office, and I detest WYSIWYG >>>>> word processors of all stripes.

    Try to stay far away from windows office docs, they can be filled
    with interesting macros, well back in the day! I do remember a lot of
    print to PDF programs. Mock up a printer device, print, produce a file. >>>
    They are only a problem if you use MS Office.  LibreOffice, and its
    predecessors, disable the macros by default.

    PDF also supports dangerous links and Javascript.

    Indeed!


    It's not a problem if you use a decent pdf viewer, but if you use
    Adobe Acrobat on Windows, you can definitely be at risk.


    Well, just make sure the PDF reader has javascript turned off all
    around. Trust in it.

    "Trust in it" ?

    Some readers /are/ trustworthy. Adobe's are not - Acrobat reader has
    endless lists of security holes. I haven't had it installed on a PC for
    many years, so things may have changed, but in comparison to any other
    reader it was huge, slow, and required continuous upgrading to deal with >vulnerabilities, requiring a reboot of Windows each time. Horrible
    software.

    On Linux, common readers like evince don't support javascript - you can
    trust them!

    Although the evince UI is crap. I prefer xpdf.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Terje Mathisen on Thu Dec 21 16:25:59 2023
    On Thu, 21 Dec 2023 13:21:45 +0100, Terje Mathisen wrote:

    Huh???

    I'm sure Andy Glew would disagree re the PPro!

    I distinctly remember reading somewhere about the Pentium Pro, II, and
    the 68060, but Wikipedia doesn't back me up, so it's entirely possible
    that the one place where I read this - which I can't identify, not
    remembering what it was - was in error. Since this was the same as the
    360/91, naturally it was memorable to me, so I remembered that, and forgot anything contradicting it I might have read elsewhere.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Quadibloc on Mon Jan 1 12:07:00 2024
    Quadibloc <quadibloc@servername.invalid> writes:

    On Sun, 10 Dec 2023 10:51:03 -0800, Tim Rentsch wrote:

    Of course the PDP-8 is a RISC. These propperties may have been common
    among some RISC processors, but they don't define what RISC is. RISC is
    a design philosophy, not any particular set of architectural features.

    I can't agree.

    Your final sentence may be true enough, but I think that the architectural feature of being a load-store architecture is very much indicative of
    whether the RISC design philosophy was being followed. Of course, it isn't absolutely _decisive_, as Concertina II demonstrates.

    The PDP-8 is just a very small computer, with a very small instruction
    set, designed before the RISC design philosophy was even concieved of.

    That it was designed before is irrelevant. All that matters is
    that the end result is consistent with that philosophy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to MitchAlsup on Mon Jan 1 12:11:45 2024
    mitchalsup@aol.com (MitchAlsup) writes:

    Tim Rentsch wrote:

    mitchalsup@aol.com (MitchAlsup) writes:

    Scott Lurndal wrote:

    The lack of general purpose registers doesn't disqualify it
    from the RISC label in my opinion.

    Then RISC is a meaningless term.

    PDP-8 certainly is simple nor does it have many instructions, but it
    certainly is NOT RISC.

    Did not have a large GPR register file
    Was Not pipelined
    Was Not single cycle execution
    Did not overlap instruction fetch with execution
    Did not rely on compiler for good code performance

    Of course the PDP-8 is a RISC. These propperties may have been
    common among some RISC processors, but they don't define what
    RISC is. RISC is a design philosophy, not any particular set
    of architectural features.

    So what we can take from this is that RISC as a term has become meaningless.

    The term isn't meaningless. You yourself in another posting
    quoted the definitional property, and all I'm saying is that
    the PDP-8 is consistent with that original description.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Tim Rentsch on Tue Jan 2 04:03:30 2024
    On Mon, 01 Jan 2024 12:07:00 -0800, Tim Rentsch wrote:
    Quadibloc <quadibloc@servername.invalid> writes:

    The PDP-8 is just a very small computer, with a very small instruction
    set, designed before the RISC design philosophy was even concieved of.

    That it was designed before is irrelevant. All that matters is that the
    end result is consistent with that philosophy.

    It is true that the PDP-8 had a small and simple instruction set.

    Is it a load-store machine? Does it attempt to minimize
    communications with memory by having a large register file?

    Unfortunately, the designers of PDP-8 were working too soon
    to know that these things, and not just a small and simple
    instruction set, would be defining characteristics of RISC.

    But, hey, all the PDP-8's instructions were one 12-bit word
    long, so they got one thing right!

    Not only isn't the PDP-8 RISC, neither is the IBM 704 nor
    the SDS/Xerox Sigma series of computers (or the SDS 930,
    for that matter).

    Yes, the PDP-8 did have a small and simple instruction set.

    But that is _not_ what the meaning of RISC is commonly understood
    to be.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Tue Jan 2 20:41:07 2024
    Paul A. Clayton wrote:

    [This is long and much less organized than I wished, but I feel
    rushed to get this written while I have a day off.]

    On 11/24/23 9:49 PM, BGB wrote:> On 11/24/2023 12:24 PM,
    MitchAlsup wrote:
    Paul A. Clayton wrote:
    [snip]
    I suspect you could write a multi-volume treatise on x86 about
    hardware-software interface design and management (including the
    social and economic considerations of project/product
    management).
    Ignoring human factors, including those outside the organization
    owning the interface, seems attractive to a certain engineering
    mindset but human factors are significant design considerations.

    It would be more beneficial to the world just to build an
    architecture
    without any of those flaws--just to show them how its done.

    I thought My 66000 was very close to being completed (with
    refinements coming slowly and generally being compatible at the
    software level). Yes, there is lots of work getting the proof-of-
    concept more publicly recognized and lots of work exploring
    details of various implementations.

    That effort fell apart in August. I am waiting for market conditions,
    and a potential customer before restarting that effort.

    (Sadly, even if open source high-quality HDL implementations for a
    variety of interesting design points were published, My 66000
    seems unlikely to get much more adoption than Open RISC. Prophets
    have to speak, but people seem at least as likely to kill a
    prophet as to accept the prophet's message.)

    I wish there were world enough and time for everyone (especially
    experts) to publish their experience and wisdom and everyone to
    interact with that wisdom, but I can intellectually (if not
    emotionally) recognize that recording history is often not as
    critical as making history.

    That is the way of the world.

    People can probably debate what is ideal.

    Certainly. Yet there are different degrees of expertise. I believe
    I am more qualified to critique an ISA than even most computer
    programmers. Mitch Alsup (who has designed hardware for at least
    four ISAs — SPARC, x86, Motorola 88k, and an unspecified GPU
    architecture as well as done compiler and other related work) is
    more qualified to critique an ISA than most professional computer
    architects.

    There can also be different goals. Critiquing an ISA independent
    of its goals is unjust (except as a warning about goal
    constraints), but changing the goals to blunt criticism

    An ISA designed for teaching and research (the initial purpose of
    RISC-V) is unlikely to be excellent for general-purpose designs.
    Features which are elegant tend to be difficult to appreciate for
    new students; elegance involves complexity and synergy, which is
    more information even though it compresses nicely when the entire
    context is known.

    The Mona Lisa was not Leonardo's first painting.
    Nor did he finish it in a semester.....

    There seem to be people around who see RISC-V as the model of
    perfection.

    I would _like_ to think that all such people are noobs (or people
    who use "perfection" rather loosely). I doubt even Mitch Alsup
    considers My 66000 the model of perfection in ISA design, "merely"
    a model of unusual excellence superior to all other published ISAs
    for general purpose computing.

    The problem is that they do not see themselves as noobs.
    {I know, I did not feel like a noob when designing 88K}

    I tend to agree with Mitch though I am still skeptical about VVM
    and slightly skeptical about ESM. I trust a hardware designer to
    know that VVM is implementable with equivalent Power-Performance-
    Area even when I cannot see how, but I am not certain it addresses
    all the use cases of SIMD, specifically in-register blocking and isolated/limited SIMD use.

    There are certain things SIMD can do that VVM cannot, there are
    tings VVM can do that SIDM has lots of trouble with. The biggest
    difference is SIMD can perform super linearly without being in a
    loop; on the other hand, VVM can change the width in calculations
    {byte operands halfwords results or word operands bytes out}--
    SIMD has problems here.

    For ESM, I am not confident that idiom
    recognition will be cheap enough to avoid the need for special
    atomics (again I do basically trust a hardware designer's
    expertise) and I disagree mildly about the capacity guarantees. (I
    also disagree about the importance of reserving opcodes common for
    data as perpetually undefined to add a barrier to executing data.
    Since those opcodes could be reclaimed later if somehow opcode
    space became scarce, this is a rather trivial objection.)

    I don't see Idiom recognition in ESM. When I see is the C/C++ atomics
    have direct compiler sequences expressing ESM semantics. Programmer
    uses the language "intrinsics", compiler spits out EMS code, hardware
    sequences ESM code within the definition of ATOMICity.

    There might be a few areas where I think AArch64 may benefit from
    being less abstracted from the implementation. Load register pair
    seems a nice feature; My 66000 could provide such (and more) with
    idiom recognition (two or more loads or stores using the same base
    address register and slightly different offsets could avoid
    multiple accesses in many cases).

    My 66000 has LDM (load multiple) and STM, but these are seldom used.
    What is used often is ENTER and EXIT as these provide prologue and
    epilogue sequences for non-leaf subroutines. Not just storing/loading
    registers to/from the stack, but dealing with SP and <optional> FP manipulations.

    I do not have a good sense of when idiom recognition should be
    preferred over "explicit" encoding. Both introduce complexity in
    hardware and compilers. For idiom recognition, an optimizing
    compiler adds another consideration for scheduling code and
    sometimes choosing whether to do more conceptual work that is
    faster (and sometimes less actual work by the hardware) with
    uncertainty about the performance impact for different
    implementations; high-performance hardware becomes more complex in
    having to recognize the idiom and covert it to the
    microarchtitecture's functional support. Idiom recognition also
    has a code density cost. However, simple but complete
    implementations are simpler (and not subsetted) than for explicit instructions, some idioms appear without explicit compiler
    intention often enough to justify special handling, and handling
    such in microarchitecture reduces the interface complexity.

    The idioms recognized in My 66150 core:
    CMP Rt,--,-- ; BBit Rt,label
    Calk Rd,--,-- ; BCnd Rd,label
    LD Rd,[--] ; BCnd Rd,label
    ST Rd,[--] ; Calk --,--,--
    CALL Label ; BR Label
    These all CoIssue (both instruction pass through the pipeline
    as if they were a single instruction from a single DECODE cycle
    through a single WRITE cycle.

    And one case of register write elision:
    Calk Rd,--,-- ; Calk Rd,--,--
    Which frees up the write port of the register file for latent STs.
    Write elision is determined in the WAIT stage of the pipeline just
    before WRITE stage.

    For explicit instructions, a compiler need not use them (in which
    case they are useless frills wasting opcode, decoder, and backend
    resources) or even know they exist,

    so why have them ??

    but an optimizing compiler
    would have to try to recognize idioms to convert to special
    instructions. Complete hardware (compatibility issues) must pay
    the costs to implement the special instructions. Minor variations
    of special instructions that are discovered to be common or useful
    require hardware idiom-recognition to convert _and_ compilers are
    unlikely to have made any effort to facilitate idiom recognition
    and it is more difficult to justify the hardware effort for less
    common cases.

    As with constants, it is easier for the compiler just to spit out
    reasonable code and have HW treat 2-instructions as 1. Since the
    architecture is already multiple words per cycle even in a 1-wide
    machine, idiom recognition is just a few more patterns.

    Some of the AArch64 conditional instructions seem clever in
    exploiting the number of variable operands. My 66000's PRED
    provides **MUCH** more flexibility, though at the cost of hardware
    complexity and code density.

    The thing about PRED is that it transfers control without disrupting
    FETCH !! it is worth putting for this reason alone and even if only
    20% of conditional branches use it.

    I agree with Mitch Alsup that having to paste constants together
    in software (or load them as if variable data) is suboptimal
    generally. (There may be some cases where the importance of static
    text size [or working set] justifies the extra effort of a level
    of indirection, but such would generally seem to be a performance
    loser.)

    Suboptimal is a vast understatement !!

    I disagree, where some things seem to be corner cutting in areas
    where doing so is a foot gun, and other areas being needlessly
    expensive (and some things in the reaches of "extensions land"
    being just kinda absurd).>
    In some ways, it is (as I see it) better to define some things and
    leave them as optional, rather than define little, and leave
    everyone else to make an incoherent mess of things.

    One of the benefits of such is being able to approach elegance;
    nonce extensions have difficulty appropriating synergy.

    I do not really understand the hostility to subsetting.

    I do not mind subsetting at the implementation level.
    I do mind subsetting at the architectural level.

    RISC-V choose to do the contrapositive--define as little as one can
    get away with and let everyone invent their own additions--you end
    up with a mismash of additions that fit no overall pattern and have
    not <well> elegance.

    Then again, likely there is disagreements as to what sorts of
    features seem meaningful, wasteful, or needless extravagance.

    This is as it should be. Special purpose or experimental features
    should be viewed as "wasteful" when the target of those features
    is not shared. The contention also concerns the limited space for standardized extensions within a single encoding space.

    16-bit instructions take 3/4ths of the OpCode Map of RISC-V. If
    you dropped the compressed instructions, I can fit then entire
    My 66000 ISA into the vacated space.....

    Standardized extensions can avoid redundant effort and some
    incompatibility, but without modes to break-up the encoding space
    the more extensions means less free encoding space.

    Extensions need to be added in such a way that the extension remains
    compatible with may unstated things WRT ISA encoding. My 66000 ISA
    encoding has a property that a 40-gate logic block (4-gates of delay)
    can parse instruction boundaries and create unary pointers into IB
    for the next instruction and the constants. You can't just willy
    nilly add instructions without knowing about this decoding logic
    block.

    This also introduces the argument about extensions, coprocessors,
    and accelerators. Accelerators are obviously least tied to the ISA
    interface, but changing an accelerator can be effectively as
    incompatible as an ISA change. (Of course, microarchitecture
    changes can break software performance.)

    RISC-V's early encoding choices were probably quite suitable for a
    teaching and research RISC ISA. Research would emphasize easy
    extensibility for isolated efforts (VLE and lots of unassigned
    opcode space facilitates such). Compatibility is something of an anti-consideration; researchers should be free to add any
    functionality they wish without consideration to an "ecosystem".

    Agreed

    The commercial interest in open source implementations and even
    just license-free ISA use changed the goals. This interest
    expanded such that people were considering the possibility of
    competing with ARM not just in the microcontroller area but more
    generally.

    We shall see.

    Expanded interest also exposed weakness in organization.
    Commercial interests wanted closed-door meetings, open systems
    people wanted public information. The "prestige" of a _standard_
    extension motivates standardizing more localized extensions, the
    limited extension space motivates rushing to stake claims, the
    increased value of the opcode space encourages conflict.

    And people wonder why I am doing all My 66000 architecture and
    µArchitecture by myself. {{Like Leonardo asking a local painter
    to finish the Mona Lisa.}}

    Some
    think idiom recognition is so cheap that the bar for new
    instructions should be high, some think the flexibility of RISC-V
    encoding should make the bar low. Some think only "simple"
    instructions should be provided, some think complex instructions
    can easily be justified. The founders seem to have been,
    understandably, unprepared to handle the volume of conflict
    resolution involved.

    In my work, code path length (roughly number of instructions) and
    the frequency of operation govern performance. My 66000 has a path
    length similar to VAX and a pipelineability similar to MIPS. RISC-V
    has a path length == MIPS and pipelineability == MIPS. Thus, My 66000
    needs only 70% the number of instructions RISC-V needs.

    Granted, it does seem like x86 probably needs to be retired at
    some point...

    Nah. Intel has already proposed expanding the register count to 32
    and possibly simplifying some of the architecture (mostly system-
    level aspects, I think).

    Toss the various descriptor tables.

    Adding yet another encoding that retained the architectural
    features is another possibility, but I doubt Intel/AMD would move
    to such an encoding. The value of x86 is primarily legacy
    software. Providing a cleaner encoding hints that legacy software
    support might be dropped ("why add a radically different encoding
    if there is not the intent to drop support for the legacy
    encoding?"). That fear would reduce the value of legacy binary
    support, increasing the relative attractiveness of ARM or other
    alternatives.

    I do not see any hope for ISA excellence.

    Somedays I agree with you.

    But, realistically, what is a retired computer architect to do ??
    Take up gardening ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to BGB on Wed Jan 3 02:10:27 2024
    On Fri, 24 Nov 2023 20:49:57 -0600, BGB wrote:

    Granted, it does seem like x86 probably needs to be retired at some
    point...

    While in a certain sense, this is an undoubtedly true statement,
    my initial reaction to it was of the ROTFL nature.

    There's so much software out there that is distributed only in
    binary form that runs only on x86 that retiring the x86 by fiat
    while it's still so actively in use just won't happen, no matter
    how bad it may be.

    This is why I miss the 680x0 architecture so much. If that were
    still out there as an active competitor to x86, then because this
    alternative _is_ something better... even if not _radically_
    better, since it's still CISC, not something really different like
    RISC... then I could envisage, over time, the market for it gradually
    growing while that for the x86 gradually shrinks.

    At the present time, RISC-V and ARM are the contenders. Microsoft has
    a version of Windows that runs on ARM. Apple now uses ARM processors
    in its current Macintosh computers, and is claiming that their
    performance is superior to x86 processors.

    Right now, though, there's no real motive for people to go from x86
    to ARM.

    In time, something surely will happen to change matters, and new
    computer architectures will rise up to prominence. Right now, though,
    signs of movement away from x86 to something else are few.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Wed Jan 3 02:47:11 2024
    Quadibloc wrote:

    On Fri, 24 Nov 2023 20:49:57 -0600, BGB wrote:

    Granted, it does seem like x86 probably needs to be retired at some
    point...

    While in a certain sense, this is an undoubtedly true statement,
    my initial reaction to it was of the ROTFL nature.

    There's so much software out there that is distributed only in
    binary form that runs only on x86 that retiring the x86 by fiat
    while it's still so actively in use just won't happen, no matter
    how bad it may be.

    This is why I miss the 680x0 architecture so much. If that were
    still out there as an active competitor to x86, then because this
    alternative _is_ something better... even if not _radically_
    better, since it's still CISC, not something really different like
    RISC... then I could envisage, over time, the market for it gradually
    growing while that for the x86 gradually shrinks.

    At the present time, RISC-V and ARM are the contenders. Microsoft has
    a version of Windows that runs on ARM. Apple now uses ARM processors
    in its current Macintosh computers, and is claiming that their
    performance is superior to x86 processors.

    Technically, Apple uses its own processors under ISA license from ARM.

    Right now, though, there's no real motive for people to go from x86
    to ARM.

    You know, as much as I hate Intel and x86, I hate Apple even more.

    In time, something surely will happen to change matters, and new
    computer architectures will rise up to prominence. Right now, though,
    signs of movement away from x86 to something else are few.

    The movement is towards mobile {cell phones and tablets} and away
    from desktops. Thus, there are more ARM cores sold per year than x86s.
    But (crap) you cannot do large engineering on tablets or cells.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Wed Jan 3 07:29:52 2024
    Quadibloc <quadibloc@servername.invalid> writes:
    There's so much software out there that is distributed only in
    binary form that runs only on x86 that retiring the x86 by fiat
    while it's still so actively in use just won't happen, no matter
    how bad it may be.

    Well, Apple succeeded, for the computers they produce.

    Still, keep that thought in mind.

    This is why I miss the 680x0 architecture so much. If that were
    still out there as an active competitor to x86, then because this
    alternative _is_ something better...

    The 68000 is worse than IA-32, because it does not have
    general-purpose registers, while IA-32 does. And the 68000 then grew
    baroque extensions in the 68020, at a time when the rest of the world
    already knew that such things are more hindrance than help. And the
    hindrance showed, when the 68040 and 68060 took longer than Intel's counterparts, and much longer than the competing RISCs: The two-wide
    50MHz 68060 appeared in the same year as the 4-wide 266MHz 21164.

    even if not _radically_
    better, since it's still CISC, not something really different like
    RISC... then I could envisage, over time, the market for it gradually
    growing while that for the x86 gradually shrinks.

    It did not happen in the 1980s when 68000 was strong, why should it
    happen later? People even did not switch away from 8086/IA-32 when
    RISCs outperformed them by a lot, because, as you write above, it's
    about the software distributed in binary form. The users don't care
    whether the binary contains IA-32/AMD64 instructions, 68000, PowerPC,
    ARM A64, RV64GC, IA-64 or whatever else.

    A software ecosystem with a single controlling instance like the MacOS ecosystem can switch architectures, one without won't. The only
    opportunities for retiring AMD64 are if PCs are replaced by something
    else (mobile phones and tablets settled on ARM A32/T32 and A64), or
    when the address width becomes insufficient; Intel tried to use the
    latter event to replace IA-32 with IA-64, but failed (probably mostly
    because the necessary IA-32 emulation was not fast enough). We will
    see whether the address space of AMD64 ever becomes too small for the
    mass market.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Paul A. Clayton on Wed Jan 3 14:38:45 2024
    On Mon, 01 Jan 2024 15:12:40 -0500, Paul A. Clayton wrote:
    On 11/24/23 9:49 PM, BGB wrote:> On 11/24/2023 12:24 PM, MitchAlsup
    wrote:

    There seem to be people around who see RISC-V as the model of
    perfection.

    I would _like_ to think that all such people are noobs (or people who
    use "perfection" rather loosely).

    Given that David Patterson, one of the designers of MIPS, was on the
    RISC-V design team, though, I can quite understand if many people
    expect the RISC-V design to be a paragon of excellence - even before
    they had looked at it.

    I doubt even Mitch Alsup considers My
    66000 the model of perfection in ISA design, "merely"
    a model of unusual excellence superior to all other published ISAs for general purpose computing.

    This makes it sound as though he lacks modesty, but actually, that no
    doubt _is_ a factual categorization of what MY 66000 is.

    I do not see any hope for ISA excellence.

    Why? MY 66000 exists, and it is excellent.

    If you mean no hope in it taking over the market... yes, I think that
    x86 and ARM will dominate for a considerable time to come. Unlike x86,
    though, I would assume that ARM is at least passable; as a commercial
    RISC, it isn't as lacking in code density as RISC-V.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Anton Ertl on Wed Jan 3 14:26:52 2024
    On Wed, 03 Jan 2024 07:29:52 +0000, Anton Ertl wrote:

    The 68000 is worse than IA-32, because it does not have general-purpose registers, while IA-32 does. And the 68000 then grew baroque extensions
    in the 68020, at a time when the rest of the world already knew that
    such things are more hindrance than help.

    It is true that in addition to eight general registers, the 68000 also
    had eight _address_ registers. But in the addressing modes that used
    address registers, there was a bit to use a general register instead,
    so I don't think one can say that the 68000 didn't have general registers.

    Since base registers don't have their values changed frequently, having
    an additional register bank for base registers increases the supply of registers for all other purposes, so I don't think that's such a bad idea.
    I did the same thing in my original Concertina design.

    As for the 68020: with the 68000, the only address mode that let you form addresses by adding the contents of two registers to a displacement had a displacement of eight bits. The 68020 let you use a 16-bit displacement
    in that mode. Since base-index addressing is so fundamental to accessing arrays, I think that the 68020 added at least _one_ thing that was
    essential rather than superfluous.

    However, instructions in that mode took up three 16-bit words, so I won't
    argue against the claim that the 68000 and 68020 also had a lot of
    addressing modes that _weren't_ needed. In order to have 16-bit
    displacements instead of 12-bit ones, with 3-bit register fields instead
    of 4-bit ones, so following the 68000 instead of System/360, I made the
    format of memory reference instructions in the original Condertina this:

    opcode (7 bits)
    destination register (3 bits)
    index register (3 bits)
    base register (3 bits)

    The destination register could be any of the eight general registers.

    The index register could be general register 1 to 7; 0 in the field means
    no indexing.

    The base register could be base register 1 to 7; if 0 is in that field,
    then the "index register" field becomes the "source register" field,
    and the instruction is a 16-bit long register-to-register instruction.

    My goal was to combine the best of the System/360 and the 68000 in a
    single architecture - but then I switched to including every feature
    but the kitchen sink, so as to give me an opportunity to explain how
    they all worked.

    Since my base registers could not be used as index registers, they
    weren't the same as the address registers of the 68000.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Wed Jan 3 14:41:51 2024
    On Tue, 02 Jan 2024 20:41:07 +0000, MitchAlsup wrote:

    16-bit instructions take 3/4ths of the OpCode Map of RISC-V. If you
    dropped the compressed instructions, I can fit then entire My 66000 ISA
    into the vacated space.....

    Ouch! 16-bit instructions took up 1/4th of the opcode space of Concertina
    II, and that turned out to be too much, and I had to drop them.

    But then, RISC-V was designed with little or no regard for code density,
    while code density has been one of my foremost considerations in the design
    of Concertina II, so this is hardly a fair comparison.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Wed Jan 3 15:17:16 2024
    Quadibloc <quadibloc@servername.invalid> writes:
    On Wed, 03 Jan 2024 07:29:52 +0000, Anton Ertl wrote:

    The 68000 is worse than IA-32, because it does not have general-purpose
    registers, while IA-32 does. And the 68000 then grew baroque extensions
    in the 68020, at a time when the rest of the world already knew that
    such things are more hindrance than help.

    It is true that in addition to eight general registers, the 68000 also
    had eight _address_ registers. But in the addressing modes that used
    address registers, there was a bit to use a general register instead,
    so I don't think one can say that the 68000 didn't have general registers.

    The 68000 has 8 address registers and 8 data registers. Motorola say
    so themselves. It has no general-purpose registers. You may wish
    that the data registers could be used as GPRs by there being an
    addressing mode "(Dn)", but neither the 68000 nor the 68020 have such
    an addressing mode. I know, because I tried to code things in 68000
    assembly where I first used some instruction that produces the result
    in a data register, and wanted to use the result as address; this is
    only possible by first moving the result to an address register.

    You may not find my memory trustworthy, so look yourself at <https://en.wikibooks.org/wiki/68000_Assembly/Addressing_Modes> and
    search for the non-existent (Dn) addressing mode. This page includes
    the 68020 addressing modes; they added all kinds of baroque stuff, but
    not (Dn).

    As for the 68020: with the 68000, the only address mode that let you form >addresses by adding the contents of two registers to a displacement had a >displacement of eight bits. The 68020 let you use a 16-bit displacement
    in that mode. Since base-index addressing is so fundamental to accessing >arrays, I think that the 68020 added at least _one_ thing that was
    essential rather than superfluous.

    Essential? How often do you use a reg+reg+disp addressing mode where
    the displacement does not fit in 8 bits?

    Looking at a glibc-2.31 AMD64 binary:

    [~:145982] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|wc -l
    341019
    [~:145984] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
    124
    [~:145986] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
    119

    So 124 occurences of displacements that don't fit into unsigned 8
    bits, and 119 that fit into unsigned 8 bits, but not into signed 8
    bits, a total of less than 0.1% of the static instructions. And yes,
    counting with

    [~:145991] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[-]0x[0-9a-f]*[(]%[^)]*,'|sed 's/.*-0x/-0x/'|sed 's/(.*$//'|wc -l
    1202

    there are 1202 occurences of negative displacements, so making
    displacement a signed number is more valuable than fitting the values
    in the range 128..255 into the displacement.

    But sure, making the displacement longer is not a major problem of the
    68020; it's still the question if they did not add more complication
    than benefit. And given that the benefit is tiny, the answer is
    probably yes.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Quadibloc on Wed Jan 3 11:18:10 2024
    Quadibloc wrote:

    Given that David Patterson, one of the designers of MIPS, was on the
    RISC-V design team, though, I can quite understand if many people
    expect the RISC-V design to be a paragon of excellence - even before
    they had looked at it.

    Hennessy was Stanford MIPS in 1981, Patterson was RISC-1 at Berkeley in 1981.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Wed Jan 3 16:27:04 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Quadibloc <quadibloc@servername.invalid> writes:


    Essential? How often do you use a reg+reg+disp addressing mode where
    the displacement does not fit in 8 bits?

    Looking at a glibc-2.31 AMD64 binary:

    [~:145982] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|wc -l
    341019
    [~:145984] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
    124
    [~:145986] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
    119


    I'm not sure glibc is a representative sample. It's far more likely
    for application code to have structures larger than 128 bytes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Wed Jan 3 16:51:23 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Quadibloc <quadibloc@servername.invalid> writes:


    Essential? How often do you use a reg+reg+disp addressing mode where
    the displacement does not fit in 8 bits?

    Looking at a glibc-2.31 AMD64 binary:

    [~:145982] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|wc -l
    341019
    [~:145984] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
    124
    [~:145986] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
    119


    I'm not sure glibc is a representative sample. It's far more likely
    for application code to have structures larger than 128 bytes.

    Ok, so I measured the main firefox binary (firefox puts a lot of stuff
    in shared libaries, so the main binary contains only a part of the
    code):

    [~:145999] objdump -d /usr/lib/firefox-esr/firefox-esr|wc -l
    129114
    [~:146000] objdump -d /usr/lib/firefox-esr/firefox-esr|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
    134

    The next part was a copy-paste error. Here's the correct number:

    [~:146002] objdump -d /usr/lib/firefox-esr/firefox-esr|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
    12

    At least for Firefox your explanation with the larger structures does
    not seem to hold. Looking at the larger displacements, many don't
    seem to be due to field offsets:

    [~:146004] objdump -d /usr/lib/firefox-esr/firefox-esr|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |uniq -c
    1 0x1000
    7 0x10000
    2 0x1010
    8 0x180
    2 0x1c0
    2 0x2000
    6 0x20000
    3 0x280
    2 0x2b0
    1 0x30b
    4 0x320
    20 0x359d3e2a
    8 0x380
    1 0x4d0
    20 0x5a827999
    6 0x600
    20 0x6ed9eba1
    20 0x70e44324
    1 0x8000000

    Anyway, in the Firefox binary slightly more than 0.1% of the
    instructions have offsets outside the signed 8-bit range. Still does
    not seem essential to me.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Wed Jan 3 16:42:02 2024
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Quadibloc <quadibloc@servername.invalid> writes:


    Essential? How often do you use a reg+reg+disp addressing mode where
    the displacement does not fit in 8 bits?

    Looking at a glibc-2.31 AMD64 binary:

    [~:145982] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|wc -l
    341019
    [~:145984] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
    124
    [~:145986] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
    119


    I'm not sure glibc is a representative sample. It's far more likely
    for application code to have structures larger than 128 bytes.

    Ok, so I measured the main firefox binary (firefox puts a lot of stuff
    in shared libaries, so the main binary contains only a part of the
    code):

    [~:145999] objdump -d /usr/lib/firefox-esr/firefox-esr|wc -l
    129114
    [~:146000] objdump -d /usr/lib/firefox-esr/firefox-esr|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
    134
    [~:146001] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
    119

    So in this binary 0.2% of the instructions have displacements that do
    not fit into a signed 8 bits. Essential?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Wed Jan 3 17:07:43 2024
    Anton Ertl wrote:

    The 68000 is worse than IA-32, because it does not have
    general-purpose registers, while IA-32 does. And the 68000 then grew
    baroque extensions in the 68020, at a time when the rest of the world
    already knew that such things are more hindrance than help. And the hindrance showed, when the 68040 and 68060 took longer than Intel's counterparts, and much longer than the competing RISCs: The two-wide
    50MHz 68060 appeared in the same year as the 4-wide 266MHz 21164.

    Architecture is as much about what to leave out as what to put in.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Wed Jan 3 17:06:25 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes: >>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Quadibloc <quadibloc@servername.invalid> writes:


    Essential? How often do you use a reg+reg+disp addressing mode where >>>>the displacement does not fit in 8 bits?

    Looking at a glibc-2.31 AMD64 binary:

    [~:145982] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|wc -l
    341019
    [~:145984] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
    124
    [~:145986] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
    119


    I'm not sure glibc is a representative sample. It's far more likely
    for application code to have structures larger than 128 bytes.

    Ok, so I measured the main firefox binary (firefox puts a lot of stuff
    in shared libaries, so the main binary contains only a part of the
    code):

    [~:145999] objdump -d /usr/lib/firefox-esr/firefox-esr|wc -l
    129114
    [~:146000] objdump -d /usr/lib/firefox-esr/firefox-esr|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
    134

    The next part was a copy-paste error. Here's the correct number:

    [~:146002] objdump -d /usr/lib/firefox-esr/firefox-esr|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
    12

    At least for Firefox your explanation with the larger structures does
    not seem to hold. Looking at the larger displacements, many don't
    seem to be due to field offsets:

    [~:146004] objdump -d /usr/lib/firefox-esr/firefox-esr|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |uniq -c
    1 0x1000
    7 0x10000
    2 0x1010
    8 0x180
    2 0x1c0
    2 0x2000
    6 0x20000
    3 0x280
    2 0x2b0
    1 0x30b
    4 0x320
    20 0x359d3e2a
    8 0x380
    1 0x4d0
    20 0x5a827999
    6 0x600
    20 0x6ed9eba1
    20 0x70e44324
    1 0x8000000

    Anyway, in the Firefox binary slightly more than 0.1% of the
    instructions have offsets outside the signed 8-bit range. Still does
    not seem essential to me.


    I'm not sure you are picking up all the offsets with your grep.

    For one of my applications:

    5 0x100
    3 0x110
    5 0x1170
    4 0x160
    3 0x16b0
    14 0x170
    5 0x1720
    1 0x1723
    5 0x1724
    5 0x1728
    6 0x18a0
    1 0x198
    13 0x1a0
    3 0x1b0
    3 0x1d0
    5 0x200
    2 0x230
    20 0x28f8
    20 0x2900
    3 0x2f0
    8 0x3308
    4 0x350
    2 0x3528
    5 0x3748
    2 0x40a0
    5 0x54ed6
    2 0x54ef48
    9 0x5559f0
    5 0x555a41
    5 0x55f50
    2 0x800
    1 0x8b0
    1 0x9a0
    1 0xe6438
    1 0xe6590
    1 0xe728c
    1 0xe7668
    6 0xe7990
    3 0xe7a60

    232294: 48 89 85 08 d7 ff ff mov %rax,-0x28f8(%rbp)
    23229b: 48 8d 95 08 d7 ff ff lea -0x28f8(%rbp),%rdx
    2322a2: 48 8b 85 70 cd ff ff mov -0x3290(%rbp),%rax

    Why isn't 0x3290 in the output of the grep?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Wed Jan 3 17:05:59 2024
    BGB wrote:

    On 1/1/2024 2:12 PM, Paul A. Clayton wrote:


    I wish there were world enough and time for everyone (especially
    experts) to publish their experience and wisdom and everyone to
    interact with that wisdom, but I can intellectually (if not
    emotionally) recognize that recording history is often not as
    critical as making history.


    Might make sense if Mitch put his specifications and other stuff up on
    GitHub or something?...

    If someone could explain how I could do this, I would.

    At least, assuming it is meant to be open.

    Then, it becomes something one can look at, at their leisure.




    My main concern with PRED is that it seems like it will involve some
    amount of implicit architectural state which would need to be dealt with somehow in interrupt handlers (and "pipeline state" is extra hairy).

    PRED state is 8-bits in thread-header.

    Well, and also "make hardware do all of this stuff" isn't really part of
    my philosophy. Or, effectively, any state that may exist, the interrupt handler needs to make sure it can save/restore it correctly.

    Note: HW is responsible for saving and restoring state in My 66000, not SW.

    I agree with Mitch Alsup that having to paste constants together
    in software (or load them as if variable data) is suboptimal
    generally. (There may be some cases where the importance of static
    text size [or working set] justifies the extra effort of a level
    of indirection, but such would generally seem to be a performance
    loser.)


    Yeah, I can also agree with this.

    Though it seems a point of disagreement that I consider jumbo-prefixes
    and (occasionally) dropping constants into temporary registers, to be acceptable.

    The jumbo-prefix scheme does effectively still break the constant into pieces, but, at least all the pieces get reassembled within a single clock-cycle (unlike the multi-instruction case).

    Does still have the annoyance of needing to have relocs for these cases
    (and it is also desirable to try to limit the number of reloc types).


    I disagree, where some things seem to be corner cutting in areas
    where doing so is a foot gun, and other areas being needlessly
    expensive (and some things in the reaches of "extensions land"
    being just kinda absurd).>
    In some ways, it is (as I see it) better to define some things and
    leave them as optional, rather than define little, and leave
    everyone else to make an incoherent mess of things.

    One of the benefits of such is being able to approach elegance;
    nonce extensions have difficulty appropriating synergy.

    I do not really understand the hostility to subsetting.


    Yeah.

    Though, I sometimes wonder if defining everything up-front, and then
    allowing for implementations to use subsets, may make the ISA spec seem
    more threatening.

    This is my plan ! And it makes the ISA way cleaner than "anyone can add an extension" RISC-V model.

    Say, "Look at all this stuff, all this complexity", when someone doing a minimal implementation can safely ignore "most of it".

    As long as you don't violate the ISA specs of the things you implement
    you are OK.

    Then again, likely there is disagreements as to what sorts of
    features seem meaningful, wasteful, or needless extravagance.

    This is as it should be. Special purpose or experimental features
    should be viewed as "wasteful" when the target of those features
    is not shared. The contention also concerns the limited space for
    standardized extensions within a single encoding space.
    Standardized extensions can avoid redundant effort and some
    incompatibility, but without modes to break-up the encoding space
    the more extensions means less free encoding space.

    This also introduces the argument about extensions, coprocessors,
    and accelerators. Accelerators are obviously least tied to the ISA
    interface, but changing an accelerator can be effectively as
    incompatible as an ISA change. (Of course, microarchitecture
    changes can break software performance.)


    Yeah.

    Then there may also be things like putting devices in MMIO, but then
    needing some way to detect if the device, or certain functionality is present.

    Options like, "well, write this magic bit pattern to this MMIO register,
    read it back, and see how the bits are set" is a little tacky.

    Cores are devices and have a configuration page in configuration space
    you can directly read core capabilities from here. L2s are similar.
    So, CPUID is merely a LD to config space.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Wed Jan 3 17:15:46 2024
    EricP wrote:

    Quadibloc wrote:

    Given that David Patterson, one of the designers of MIPS, was on the
    RISC-V design team, though, I can quite understand if many people
    expect the RISC-V design to be a paragon of excellence - even before
    they had looked at it.

    Hennessy was Stanford MIPS in 1981, Patterson was RISC-1 at Berkeley in 1981.

    Stanford MIPS became MIPS the company
    Berkeley RISC-1 became Sun Microsystems and named SPARC

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Wed Jan 3 17:16:29 2024
    Scott Lurndal wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Quadibloc <quadibloc@servername.invalid> writes:


    Essential? How often do you use a reg+reg+disp addressing mode where
    the displacement does not fit in 8 bits?

    Looking at a glibc-2.31 AMD64 binary:

    [~:145982] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|wc -l
    341019
    [~:145984] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x...'|sort |wc -l
    124
    [~:145986] objdump -d /lib/x86_64-linux-gnu/libc-2.31.so|grep '[0-9a-f][(]%[^)]*,'|sed 's/.*0x/0x/'|sed 's/(.*$//'|grep '0x[89a-f].$'|sort|wc -l
    119


    I'm not sure glibc is a representative sample. It's far more likely
    for application code to have structures larger than 128 bytes.

    You might try EMBench.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Quadibloc on Wed Jan 3 17:42:22 2024
    Quadibloc <quadibloc@servername.invalid> schrieb:
    On Wed, 03 Jan 2024 07:29:52 +0000, Anton Ertl wrote:

    The 68000 is worse than IA-32, because it does not have general-purpose
    registers, while IA-32 does. And the 68000 then grew baroque extensions
    in the 68020, at a time when the rest of the world already knew that
    such things are more hindrance than help.

    It is true that in addition to eight general registers, the 68000 also
    had eight _address_ registers. But in the addressing modes that used
    address registers, there was a bit to use a general register instead,
    so I don't think one can say that the 68000 didn't have general registers.

    That used the _contents_ of the register, not where it was pointing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Wed Jan 3 17:44:57 2024
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes: >>>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Quadibloc <quadibloc@servername.invalid> writes:


    Essential? How often do you use a reg+reg+disp addressing mode where >>>>>the displacement does not fit in 8 bits?
    ...
    232294: 48 89 85 08 d7 ff ff mov %rax,-0x28f8(%rbp)
    23229b: 48 8d 95 08 d7 ff ff lea -0x28f8(%rbp),%rdx
    2322a2: 48 8b 85 70 cd ff ff mov -0x3290(%rbp),%rax

    Why isn't 0x3290 in the output of the grep?

    Because the grep is intended to pick up only reg+reg+disp addressing
    (with optional scaling), not reg+disp addressing. So it works as
    intended.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Anton Ertl on Wed Jan 3 18:58:20 2024
    On Wed, 03 Jan 2024 15:17:16 +0000, Anton Ertl wrote:

    Essential? How often do you use a reg+reg+disp addressing mode where
    the displacement does not fit in 8 bits?

    Every time I access an array element!

    Because presumably the array will be in somewhere in a 64K byte chunk of
    memory with an associated USING statement, so I need base register + 16
    bit displacement to specify the start of the array, and an index register
    to point to the element within the array.

    Otherwise, I would need to use an extra instruction prior to the array
    access to add two things together, and put the result in an index
    register.

    As for your memory: another post here explained what I missed. The bit
    which I thought indicated using a data register used its contents.

    However, that doesn't make sense to me for an instruction which also has
    a *displacement*, since then the displacement must be ignored. Unless
    it's an immediate add to the value...

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Wed Jan 3 20:55:23 2024
    BGB wrote:

    On 1/3/2024 11:05 AM, MitchAlsup wrote:


    My main concern with PRED is that it seems like it will involve some
    amount of implicit architectural state which would need to be dealt
    with somehow in interrupt handlers (and "pipeline state" is extra hairy). >>
    PRED state is 8-bits in thread-header.


    Yeah, but presumably it is a mask that shifts 1 bit per every
    instruction in the pipeline. If an interrupt occurs, then whatever state
    gets captured needs to be correct WRT the pipeline stage that the
    interrupt is captured off of.

    Granted, I guess this isn't really too much different (in premise) than needing to get PC / SR / registers into a coherent state.

    It is no harder than getting IP correct at an interrupt.

    Well, and also "make hardware do all of this stuff" isn't really part
    of my philosophy. Or, effectively, any state that may exist, the
    interrupt handler needs to make sure it can save/restore it correctly.

    Note: HW is responsible for saving and restoring state in My 66000, not SW. >>

    I did it full software in my case, but mostly to try to save cost on a mechanism that is used comparably infrequently.

    I used the same mechanism for prologue and epilogue sequences, so it gets
    used often.

    Like, need to try to find the cheapest possible mechanism that still
    allows state to be saved/restored well enough that the program doesn't
    just explode whenever an interrupt occurs.

    Doing it in HW eliminates the need for "a couple of" control registers
    to access the "stack" when control arrives at exception or interrupt dispatcher. But realistically, thread-state is 5 cache lines of thread
    specific data with a known thread-specific virtual address--so this all
    looks like a cache with 5-contiguous lines of state which one can
    "remember" with a single physical address.......


    Though, I sometimes wonder if defining everything up-front, and then
    allowing for implementations to use subsets, may make the ISA spec
    seem more threatening.

    This is my plan ! And it makes the ISA way cleaner than "anyone can add
    an extension" RISC-V model.


    Yeah.

    Consistency at the tradeoff of now people have to see a full ISA spec,
    rather than say:
    Integer ISA spec;
    FPU ISA spec;
    Privileged Mode spec;
    ...

    All as separate specification documents.

    I have an ISA specification document, how unprivileged SW uses ISA as a document, and how privileged SW uses ISA as a document; all with cross
    document pointers. Having separate documents allows the noon-proprietary
    ISA to be distributed allowing full access to ISA but knowledge of
    privileged state. {{There are no privileged instructions, but there
    is privileged state.}} I still have the privileged document under NDA.


    Options like, "well, write this magic bit pattern to this MMIO
    register, read it back, and see how the bits are set" is a little tacky.

    Cores are devices and have a configuration page in configuration space
    you can directly read core capabilities from here. L2s are similar.
    So, CPUID is merely a LD to config space.

    Each "block" around the chip contains 8 performance counters, and other
    control registers. The counters can be sampled en masé using LDM and
    reset en masé using STM. So, one has {CPU, L2, interconnect, L3, DRAM, Hostbridge, IOMMU} 8 performance counters.
    The high resolution counter/timer is one of these counters.



    Traditional way configuration worked as I understood it on older systems
    was say:
    Attempt a read access to an I/O page, if read returns device is present
    if read times out no device
    Read kind, vendor, and device from IO page.
    Use these to access driver from table.
    then::
    Write values to IO ports;
    Read values back;
    See if response is what is expected (say, if you only get 00 or FF,
    assume hardware is absent or doesn't work);
    Hope that some other hardware isn't at that address which totally owns
    the PC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Wed Jan 3 22:42:17 2024
    Chris M. Thomasson wrote:

    On 1/3/2024 10:58 AM, Quadibloc wrote:
    On Wed, 03 Jan 2024 15:17:16 +0000, Anton Ertl wrote:

    Essential? How often do you use a reg+reg+disp addressing mode where
    the displacement does not fit in 8 bits?

    Compilers can go through multiple clever steps to hoist indexing out of
    loops {consuming registers} and get the need down under about 5%. However,
    if you have [base+index<<scale+displacement] it ends up getting used around
    8% of the time.


    Every time I access an array element!

    Because presumably the array will be in somewhere in a 64K byte chunk of
    memory with an associated USING statement, so I need base register + 16
    bit displacement to specify the start of the array, and an index register
    to point to the element within the array.

    In My 66000 case, when you use scaled indexing, you have access to 32-bit
    and 64-bit displacements. So
    LDD R7,[R19,R5<<3]
    is 1 word, but:
    LDD R7,[R19,R5<<3,DISP32]
    is 2 words and 1 instruction, and:
    LDD R7,[R19,R5<<3,DISP64]
    is 3 words and 1 instruction.

    You also have the ability to do::
    STD #3.1415926535892145,[SP,16]
    as a 3 word instruction that stores 2 words on the stack as a single instruction. This form is used a lot, so while it is not "indexing"
    it is highly useful.

    Not sure if this is relevant. If the 64K byte chunk was aligned on a 64K
    byte boundary, then we can round a pointer to somewhere in the chunk
    down to the nearest 64K byte boundary. This gives us a pointer to the beginning of the chunk. I used this trick in some of my per-thread
    memory allocators. To free memory a thread would round the address down
    to the nearest chunk size an push the memory into a list. Memory
    allocations had to be at least the size of a word, or they would get
    rounded up to word size.

    It is best to avoid the 64KB limitations altogether; allowing .data
    to be "significantly" far away while still allowing single instruction
    access. {This is what universal constants brings to the party}
    In numerics code one sees:
    LDD R7,[IP,R3<<3,.LBB_002345_foo-.]
    where foo[] can reside within ±2GB of the LDD instruction, as a 2 word instruction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Thu Jan 4 01:15:40 2024
    On Wed, 03 Jan 2024 22:42:17 +0000, MitchAlsup wrote:

    It is best to avoid the 64KB limitations altogether; allowing .data to
    be "significantly" far away while still allowing single instruction
    access.

    I agree with that. However, my solution to that is a different
    one, which indeed is not so efficient.

    Immediates in my design are strictly for immediate mode operations,
    and can't also be used as absolute addresses, as you are doing.

    Instead, what I have is "array mode", which is a kind of post-indexed
    indirect addressing (array addresses are put in a short segment that
    a special base register points to). So the array address is referenced
    by a short displacement, but that means an extra memory access is
    needed, instead of the address being in the instruction stream.

    Can I modify my instruction format to allow for instead using your
    more efficient solution to this problem? There probably is room;
    change a 12-bit displacement to an 11-bit displacement, and 11
    bits is plenty when I only need six bits...

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Thu Jan 4 01:33:07 2024
    Quadibloc wrote:

    On Wed, 03 Jan 2024 22:42:17 +0000, MitchAlsup wrote:

    It is best to avoid the 64KB limitations altogether; allowing .data to
    be "significantly" far away while still allowing single instruction
    access.

    I agree with that. However, my solution to that is a different
    one, which indeed is not so efficient.

    Immediates in my design are strictly for immediate mode operations,
    and can't also be used as absolute addresses, as you are doing.

    LDD R7,IP,R3<<3,.L00BK123.foo - .]

    Is not an absolute address! IP is added as the base register and "-." ,
    as part of the displacement, subtracts that very same IP value. So,
    the displacement is not absolute, but a trick is used to make it smell
    as if it were.

    Instead, what I have is "array mode", which is a kind of post-indexed indirect addressing (array addresses are put in a short segment that
    a special base register points to). So the array address is referenced
    by a short displacement, but that means an extra memory access is
    needed, instead of the address being in the instruction stream.

    Can I modify my instruction format to allow for instead using your
    more efficient solution to this problem? There probably is room;
    change a 12-bit displacement to an 11-bit displacement, and 11
    bits is plenty when I only need six bits...

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Thu Jan 4 03:41:25 2024
    BGB wrote:

    On 1/3/2024 2:55 PM, MitchAlsup wrote:

    I did it full software in my case, but mostly to try to save cost on a
    mechanism that is used comparably infrequently.

    I used the same mechanism for prologue and epilogue sequences, so it gets
    used often.


    OK.

    Though, not having either is technically the cheapest option.

    You can buy chips with 64 GBOoO 4-to-6-wide cores on them, and you are
    worrying about a sequencer made of 100-odd gates !?


    All as separate specification documents.

    I have an ISA specification document, how unprivileged SW uses ISA as a
    document, and how privileged SW uses ISA as a document; all with cross
    document pointers. Having separate documents allows the noon-proprietary
    ISA to be distributed allowing full access to ISA but knowledge of
    privileged state. {{There are no privileged instructions, but there is
    privileged state.}} I still have the privileged document under NDA.


    Hmm...

    My stuff is all public (in my GitHub repository), had assumed that
    anyone that might want to do their own implementation would be free to
    do so.

    Also made an effort to avoid anything which lacks prior art from at
    least 20 years ago.

    Yes, over my 35+ year career I was exposed to 10s of thousands of patents.
    I tried rigorously to avoid the ones still in effect. I did borrow a few
    of my patents knowing their expiration dates. I also have a clean record
    of my <potential> inventions identifying when they were first conceived.

    <snip>

    OK, I don't have any real performance counters at the ISA level.

    This is the advantage of define everything and subset certain things back
    out.

    The microsecond counter was mostly so that programs using functions like "clock()" wouldn't burn too much CPU time with system calls (for some
    types of programs, it is not uncommon to make rapid-fire calls trying to
    get the current time in milliseconds or microseconds).

    I have been in discussions as to whether a RNG is used to add white noise
    to the high precision timer to make side-channels harder to utilize......

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Thu Jan 4 06:14:09 2024
    On Thu, 04 Jan 2024 01:33:07 +0000, MitchAlsup wrote:

    LDD R7,IP,R3<<3,.L00BK123.foo - .]

    Is not an absolute address! IP is added as the base register and "-." ,
    as part of the displacement, subtracts that very same IP value. So, the displacement is not absolute, but a trick is used to make it smell as if
    it were.

    That would never have occurred to me.

    I do use program counter relative addressing in my instruction set -
    in the 16 bit instructions (which are now removed from the main
    instruction set, but still exist as 17-bit instructions in blocks
    of variable-length instructions) there are conditional branch
    instructions (inspired by the PDP-11 and TI 9900) with 8-bit
    signed program counter relative displacements.

    But that's it.

    The reason it would never have occured to me to make full-size
    addresses program counter relative instead of absolute is because
    now the linking loader would have to handle them differently. It
    couldn't just _ignore_ them because the relative positions of the
    code segment and data segment of a program aren't determined at
    compile time; the operating system needs to be free to allocate
    them separately.

    The loader can relocate programs by adding the value of the
    appropriate segment start location to a full-size address within
    the code. That might be an address constant in the data segment,
    or it could be something else. But I don't want to ask the loader
    to do *anything else* for purposes of relocation.

    I have found opcode space - not the space I originally speculated
    about using - for this addition, so

    http://www.quadibloc.com/arch/cw01.htm

    has been revised.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Thu Jan 4 08:31:55 2024
    On Wed, 03 Jan 2024 17:07:43 +0000, MitchAlsup wrote:

    Architecture is as much about what to leave out as what to put in.

    This is very true, and of course the major flaw in Concertina II is that
    my choice is, in so far as it is at all possible, is to leave nothing
    out - to do, in a single instruction, anything that almost any other
    computer ever was able to do in a single instruction.

    With a few exceptions in order to pretend to remain within reason.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Thu Jan 4 08:36:43 2024
    On Wed, 03 Jan 2024 02:47:11 +0000, MitchAlsup wrote:
    Quadibloc wrote:

    Right now, though, there's no real motive for people to go from x86 to
    ARM.

    You know, as much as I hate Intel and x86, I hate Apple even more.

    I can appreciate the sentiment, as the restrictions on Apple's App Store
    mean that iOS devices are simply not an option I can consider. And, of
    course, Macs tend not to be upgradeable, and this seems to be so that
    Apple can charge higher prices.

    But there's also Windows on ARM. And there's the whole smartphone
    ecosystem of Android. But all these things together don't provide an
    incentive to leave x86.

    PowerPC and SPARC also exist as RISC alternatives, besides ARM and
    RISC-V, but they've been forgotten, bypassed, sidelined, or whatever.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Thu Jan 4 09:19:41 2024
    Quadibloc <quadibloc@servername.invalid> writes:
    On Tue, 02 Jan 2024 20:41:07 +0000, MitchAlsup wrote:

    16-bit instructions take 3/4ths of the OpCode Map of RISC-V. If you
    dropped the compressed instructions, I can fit then entire My 66000 ISA
    into the vacated space.....

    Ouch! 16-bit instructions took up 1/4th of the opcode space of Concertina
    II, and that turned out to be too much, and I had to drop them.

    But then, RISC-V was designed with little or no regard for code density,

    I think that the fact that they left 3/4ths of the encoding space to
    16-bit instructions shows that they care quite a bit for encoding
    size. If they did not, they would not have the C extension (the
    16-bit instructions) at all.

    How successful are they? Let's update my code-size measurements with
    current data.

    ARCHS="amd64 arm64 armel armhf i386 mips64el ppc64el riscv64 s390x"
    for i in $ARCHS; do
    wget http://ftp.at.debian.org/debian/pool/main/b/bash/bash_5.2.21-2_$i.deb
    wget http://ftp.at.debian.org/debian/pool/main/b/bash/bash_5.2.21-2+b1_$i.deb
    wget http://ftp.at.debian.org/debian/pool/main/g/grep/grep_3.11-4~exp1_$i.deb
    wget http://ftp.at.debian.org/debian/pool/main/g/gzip/gzip_1.12-1_$i.deb
    wget http://ftp.at.debian.org/debian/pool/main/g/gzip/gzip_1.12-1+b2_$i.deb done
    for i in $ARCHS; do
    for j in bash_5.2.21-2_$i.deb bash_5.2.21-2+b1_$i.deb grep_3.11-4~exp1_$i.deb gzip_1.12-1_$i.deb gzip_1.12-1+b2_$i.deb; do
    if test -f $j; then
    binary=bin/${j%%_*}
    if test "$binary" = "bin/grep"; then
    binary=usr/bin/grep
    fi
    ar x $j; tar xfJ data.tar.xz ./$binary; objdump -h $binary|awk --non-decimal-data '/[.]text/ {printf("%8d ","0x"$3)}'
    fi
    done
    echo $i
    done|sort -nk1

    This produces:

    bash grep gzip
    595204 107636 46744 armhf
    599832 101102 46898 riscv64
    796501 144926 57729 amd64
    829776 134784 56868 arm64
    853892 152068 61124 i386
    891128 158544 68500 armel
    892688 168816 64664 s390x
    1020720 170736 71088 mips64el
    1168104 194900 83332 ppc64el

    So RV64GC beats every other 64-bit instruction set in code density by
    a wide margin and the code density is similar to the 32-bit ARM
    A32/T32 instruction set. Given this evidence, it seems to me that
    RV64GC (and it's basis RISC-V) was designed with a lot of
    consideration for code density.

    One difference between armhf and armel is that armhf uses T32/A32
    (Thumb2 instructions) while armel uses only A32 (fixed-width 32-bin instructions). This probably accounts the most for the size
    difference between armhf and armel.

    It's interesting that A32/T32 and RV64GC with their fixed-width base
    and compressed extension beat the variable-width AMD64, i386, and
    S390x by such a wide margin.

    In case of armhf vs i386, you cannot even make the legacy argument,
    because ARM A32 was designed at the same time as i386, and T32 only
    tacked on later; ok, you may consider i386 to be tacked on to the
    slightly older 8086 instruction set, but given that 8086 code does not
    work in an i386 binary unless you set some mode flags first, while A32
    code runs without setting a mode bit on an A32/T32-capable CPU, the
    situation is not quite the same.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Quadibloc on Thu Jan 4 04:32:42 2024
    Quadibloc <quadibloc@servername.invalid> writes:

    Yes, the PDP-8 did have a small and simple instruction set.

    But that is _not_ what the meaning of RISC is commonly understood
    to be.

    My comments about the PDP-8 and RISC were not about what the
    meaning of RISC is comonly understood (or commonly misunderstood)
    to be. Rather they are about the meaning of RISC as described
    by the people who originally defined the term. Please see my
    longer response to John Levine.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Tim Rentsch on Thu Jan 4 14:26:40 2024
    On Thu, 04 Jan 2024 04:32:42 -0800, Tim Rentsch wrote:
    Quadibloc <quadibloc@servername.invalid> writes:

    Yes, the PDP-8 did have a small and simple instruction set.

    But that is _not_ what the meaning of RISC is commonly understood to
    be.

    My comments about the PDP-8 and RISC were not about what the meaning of
    RISC is comonly understood (or commonly misunderstood)
    to be. Rather they are about the meaning of RISC as described by the
    people who originally defined the term. Please see my longer response
    to John Levine.

    I'm not sure how this helps you, because the original definition
    includes the current common understanding, being a superset of it.

    Current common understanding:

    All instructions the same length.
    Load-store architecture.
    Relatively large register file (32 or more registers)

    Original definition:

    All the above, plus:
    All instructions execute in one cycle.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Thu Jan 4 10:09:06 2024
    MitchAlsup wrote:
    BGB wrote:

    Also made an effort to avoid anything which lacks prior art from at
    least 20 years ago.

    Yes, over my 35+ year career I was exposed to 10s of thousands of patents.
    I tried rigorously to avoid the ones still in effect. I did borrow a few
    of my patents knowing their expiration dates. I also have a clean record
    of my <potential> inventions identifying when they were first conceived.

    IANAL

    With the rule change from "first to invent" to "first to file"
    is having a date record of inventions any use?

    There is also the question of whether writing about something
    on the internet counts as "publication" and might block patenting.

    A quicky search finds this:

    How Publications Affect Patentability https://www.utoledo.edu/research/TechTransfer/Publish_and_Perish.html

    "The Internet: A message describing an invention on a web site or to a
    public newsgroup will be considered as published on the day prior to
    the posting"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Quadibloc on Thu Jan 4 18:18:26 2024
    On Thu, 4 Jan 2024 14:26:40 -0000 (UTC)
    Quadibloc <quadibloc@servername.invalid> wrote:

    On Thu, 04 Jan 2024 04:32:42 -0800, Tim Rentsch wrote:
    Quadibloc <quadibloc@servername.invalid> writes:

    Yes, the PDP-8 did have a small and simple instruction set.

    But that is _not_ what the meaning of RISC is commonly understood
    to be.

    My comments about the PDP-8 and RISC were not about what the
    meaning of RISC is comonly understood (or commonly misunderstood)
    to be. Rather they are about the meaning of RISC as described by
    the people who originally defined the term. Please see my longer
    response to John Levine.

    I'm not sure how this helps you, because the original definition
    includes the current common understanding, being a superset of it.

    Current common understanding:

    All instructions the same length.
    Load-store architecture.
    Relatively large register file (32 or more registers)

    Original definition:

    All the above, plus:
    All instructions execute in one cycle.

    John Savard


    Current common understanding by whom?
    If you'd ask an average embedded programmer or engineer whether cores
    of his dare Cortex-M microcontrollers are RISC or not then an absolute
    majority among those who would be able to understand your question
    (which by themselves will likely be in minority) will say "Yes, they
    are".
    As you know, the only instruction set supported by Cortex-M cores
    (except M0) has instructions of two lengths and 16 general-purpose
    registers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Anton Ertl on Thu Jan 4 16:31:59 2024
    On Thu, 04 Jan 2024 09:19:41 +0000, Anton Ertl wrote:

    I think that the fact that they left 3/4ths of the encoding space to
    16-bit instructions shows that they care quite a bit for encoding size.
    If they did not, they would not have the C extension (the 16-bit instructions) at all.

    That is a good point. I think I confused efficient use of RAM with
    efficient use of opcode space.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Thu Jan 4 16:38:23 2024
    On Wed, 03 Jan 2024 14:41:51 +0000, Quadibloc wrote:

    On Tue, 02 Jan 2024 20:41:07 +0000, MitchAlsup wrote:

    16-bit instructions take 3/4ths of the OpCode Map of RISC-V. If you
    dropped the compressed instructions, I can fit then entire My 66000 ISA
    into the vacated space.....

    Ouch! 16-bit instructions took up 1/4th of the opcode space of
    Concertina II, and that turned out to be too much, and I had to drop
    them.

    And this, of course, highlights another flaw of Concertina II, especially
    when contrasted with MY 66000.

    Concertina II uses virtually every scrap of available opcode space within
    the 32-bit instruction word. Just recently, I came up with an ingenious
    way to add one bit to the available (non-prefix) portion of the
    zero-overhead instruction/header (which lets me sneak in an operate
    instruction using a pseudo-immediate without using a whole 32-bit
    instruction slot to provide the three bits needed to reserve space for
    the pseudo-immediate value)... which allowed the set of opcodes I
    wanted to provide, _and_ allowed me to do a zero-overhead version of
    the new extra-long absolute address instruction (only for loads and
    stores) as well.

    Another recent change to the architecture was including instructions
    longer than 32 bits as part of the basic 32-bit instruction set without
    headers (through "composed instructions")... because I knew I needed
    a larger opcode space desperately and couldn't just restrict its
    availability to where it could be implemented efficiently.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Thu Jan 4 18:47:06 2024
    Quadibloc wrote:

    On Thu, 04 Jan 2024 04:32:42 -0800, Tim Rentsch wrote:
    Quadibloc <quadibloc@servername.invalid> writes:

    Yes, the PDP-8 did have a small and simple instruction set.

    But that is _not_ what the meaning of RISC is commonly understood to
    be.

    My comments about the PDP-8 and RISC were not about what the meaning of
    RISC is comonly understood (or commonly misunderstood)
    to be. Rather they are about the meaning of RISC as described by the
    people who originally defined the term. Please see my longer response
    to John Levine.

    I'm not sure how this helps you, because the original definition
    includes the current common understanding, being a superset of it.

    Current common understanding:

    All instructions the same length.
    Load-store architecture.
    Relatively large register file (32 or more registers)

    Original definition:

    All the above, plus:
    All instructions execute in one cycle.

    Which precludes FP calculations.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Thu Jan 4 18:56:11 2024
    Quadibloc wrote:

    On Wed, 03 Jan 2024 14:41:51 +0000, Quadibloc wrote:

    On Tue, 02 Jan 2024 20:41:07 +0000, MitchAlsup wrote:

    16-bit instructions take 3/4ths of the OpCode Map of RISC-V. If you
    dropped the compressed instructions, I can fit then entire My 66000 ISA
    into the vacated space.....

    Ouch! 16-bit instructions took up 1/4th of the opcode space of
    Concertina II, and that turned out to be too much, and I had to drop
    them.

    And this, of course, highlights another flaw of Concertina II, especially when contrasted with MY 66000.

    Concertina II uses virtually every scrap of available opcode space within
    the 32-bit instruction word.

    Whereas My 66000 has 21 slots freely available at the Major OpCode level.
    and 6 permanently reserved to prevent jumping into data and executing,
    out of the allocated 64 slots. {1,2,3}-Operand calculation instructions
    use 1 slot each. In essence I reserve 1/3rd of the OpCode space for the
    future and pre-reserved 1/10 of the OpCode Space for security.

    Just recently, I came up with an ingenious
    way to add one bit to the available (non-prefix) portion of the
    zero-overhead instruction/header (which lets me sneak in an operate instruction using a pseudo-immediate without using a whole 32-bit
    instruction slot to provide the three bits needed to reserve space for
    the pseudo-immediate value)... which allowed the set of opcodes I
    wanted to provide, _and_ allowed me to do a zero-overhead version of
    the new extra-long absolute address instruction (only for loads and
    stores) as well.

    Another recent change to the architecture was including instructions
    longer than 32 bits as part of the basic 32-bit instruction set without headers (through "composed instructions")... because I knew I needed
    a larger opcode space desperately and couldn't just restrict its
    availability to where it could be implemented efficiently.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Thu Jan 4 19:00:46 2024
    Quadibloc wrote:

    On Wed, 03 Jan 2024 02:47:11 +0000, MitchAlsup wrote:
    Quadibloc wrote:

    Right now, though, there's no real motive for people to go from x86 to
    ARM.

    You know, as much as I hate Intel and x86, I hate Apple even more.

    I can appreciate the sentiment, as the restrictions on Apple's App Store
    mean that iOS devices are simply not an option I can consider. And, of course, Macs tend not to be upgradeable, and this seems to be so that
    Apple can charge higher prices.

    But there's also Windows on ARM. And there's the whole smartphone
    ecosystem of Android. But all these things together don't provide an incentive to leave x86.

    I would really like MS to go back to windows 7 {last one I liked}.....

    PowerPC and SPARC also exist as RISC alternatives, besides ARM and
    RISC-V, but they've been forgotten, bypassed, sidelined, or whatever.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Thu Jan 4 19:16:32 2024
    EricP wrote:

    MitchAlsup wrote:
    BGB wrote:

    Also made an effort to avoid anything which lacks prior art from at
    least 20 years ago.

    Yes, over my 35+ year career I was exposed to 10s of thousands of patents. >> I tried rigorously to avoid the ones still in effect. I did borrow a few
    of my patents knowing their expiration dates. I also have a clean record
    of my <potential> inventions identifying when they were first conceived.

    IANAL

    With the rule change from "first to invent" to "first to file"
    is having a date record of inventions any use?

    There is also the question of whether writing about something
    on the internet counts as "publication" and might block patenting.

    A quicky search finds this:

    How Publications Affect Patentability https://www.utoledo.edu/research/TechTransfer/Publish_and_Perish.html

    "The Internet: A message describing an invention on a web site or to a
    public newsgroup will be considered as published on the day prior to
    the posting"

    If you describe how something works it loses its patentability.
    If you describe what something does abstractly it does not.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Thu Jan 4 19:20:02 2024
    BGB wrote:

    On 1/4/2024 9:09 AM, EricP wrote:
    MitchAlsup wrote:
    BGB wrote:

    Also made an effort to avoid anything which lacks prior art from at
    least 20 years ago.

    Yes, over my 35+ year career I was exposed to 10s of thousands of
    patents.
    I tried rigorously to avoid the ones still in effect. I did borrow a few >>> of my patents knowing their expiration dates. I also have a clean
    record of my <potential> inventions identifying when they were first
    conceived.

    IANAL

    With the rule change from "first to invent" to "first to file"
    is having a date record of inventions any use?

    There is also the question of whether writing about something
    on the internet counts as "publication" and might block patenting.

    A quicky search finds this:

    How Publications Affect Patentability
    https://www.utoledo.edu/research/TechTransfer/Publish_and_Perish.html

    "The Internet: A message describing an invention on a web site or to a
    public newsgroup will be considered as published on the day prior to
    the posting"


    My concern was more with the possibility of lawyers being jerks...

    I can alleviate you concerns--they are.

    But, if one mostly sticks to design features that were already in use
    20-30 years ago; there isn't much the lawyers can do...

    And written in books or published in papers.

    Granted, one could argue that this does not cover every possible way in
    which these features could be combined, which is a possible area for
    concern.

    Though, for the most part, it seems that the "enforcement" is mostly
    used against either direct re-implementations of a patented technology,
    or against popular common-use technologies that can be "interpreted" to somehow infringe on a patent (even if the artifact described is often
    almost entirely different), rather than going after ex-nihilo hobby
    projects or similar.

    Also note: if you are not making money by using something claimed in their patent, they can sue but they cannot get any money. So, it is not worth
    their time.....

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Fri Jan 5 02:43:43 2024
    On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:

    I would really like MS to go back to windows 7 {last one I liked}.....

    Finally, something we both agree on!

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Quadibloc on Fri Jan 5 14:25:27 2024
    Quadibloc <quadibloc@servername.invalid> writes:
    On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:

    I would really like MS to go back to windows 7 {last one I liked}.....

    Finally, something we both agree on!

    Really, there has never been an usable Windows release.....

    Unix forever! :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Scott Lurndal on Fri Jan 5 15:01:05 2024
    On Fri, 05 Jan 2024 14:25:27 +0000, Scott Lurndal wrote:
    Quadibloc <quadibloc@servername.invalid> writes:
    On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:

    I would really like MS to go back to windows 7 {last one I liked}.....

    Finally, something we both agree on!

    Really, there has never been an usable Windows release.....

    Unix forever! :-)

    It certainly is true that Linux has some major advantages. People
    have had to put up with Windows, though, because some software is
    only available for it.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Fri Jan 5 15:37:50 2024
    On Fri, 05 Jan 2024 15:18:46 +0000, Quadibloc wrote:

    To make it slightly less evil, have the address in the workspace pointer point into an on-chip static RAM instead of extenal DRAM.

    And then when you're switching from one virtualized operating
    system to another, you have to do a "big context switch" where
    you save and restore all the registers _and_ that on-chip
    static RAM!

    However, that can be cured. Since the feature is specifically
    *for* stuff like data acquisition programs that run straight on
    the hardware, treat it as an optional feature... which is *not
    included* on any virtual machine.

    Which is great, of course, unless you would like to virtualize
    some data acquisition software for purposes of debugging. So
    instead a more approprate response is perhaps to _allow_
    including fast context switching through slow mode in virtual
    machines... with a warning in the manual that this is only
    to be done when necessary, as it comes with a huge performance
    hit.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Fri Jan 5 15:18:46 2024
    On Fri, 24 Nov 2023 03:11:17 +0000, MitchAlsup wrote:

    This is headed in the right direction. Make context switching something
    easy to pull off.

    Oh, dear. You've just given me an evil idea.

    On a System/360, context switching wasn't too bad. You just save
    and restore the 16 general registers and the floating-point
    registers.

    On a more recent CPU, you might have to save and restore the
    general registers, the floating-point registers, and the SIMD
    registers.

    On Concertina II, in addition to 32 integer registers, 32
    floating-point registers, 16 SIMD registers, there are also
    eight 64-element vector registers!

    On the Texas Instruments TI 9900, there were 16 general registers
    which were 16 bits long - but they were in memory, so context
    switching was _really_ fast, you just saved and restored the
    workspace pointer!

    So the evil idea is...

    while the CPU does have real registers in order to run at an
    acceptable speed, allow it to also run in "slow mode" with
    a workspace pointer and all the registers in RAM!

    To make it slightly less evil, have the address in the
    workspace pointer point into an on-chip static RAM instead
    of extenal DRAM.

    And have a second bank of real registers, into which the
    register contents are gradually migrated as the program
    is running - I think the 990/10 or at least the 990/12
    actually used the technique of gradually migrating registers
    in RAM into real registers in the CPU for better performance,
    so that's not new.

    Of course, code that doesn't know it's running in slow mode
    will wastefully save and restore those in-memory registers,
    so the feature would be primarily recommended for use with
    special programs specifically designed for coping with things
    like a high frequency of interrupts.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Fri Jan 5 19:49:18 2024
    Quadibloc wrote:

    On Fri, 05 Jan 2024 15:18:46 +0000, Quadibloc wrote:

    To make it slightly less evil, have the address in the workspace pointer
    point into an on-chip static RAM instead of extenal DRAM.

    And then when you're switching from one virtualized operating
    system to another, you have to do a "big context switch" where
    you save and restore all the registers _and_ that on-chip
    static RAM!

    I submit the proper place for memory resident register files and
    thread-state is in DRAM. Then, writing a single control register
    can switch between user threads, and writing 2 control registers
    switches between GuestOSs,.....

    However, that can be cured.

    Yes, by placing the data in the right place at the beginning.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Fri Jan 5 19:46:51 2024
    Quadibloc wrote:

    On Fri, 24 Nov 2023 03:11:17 +0000, MitchAlsup wrote:

    This is headed in the right direction. Make context switching something
    easy to pull off.

    Oh, dear. You've just given me an evil idea.

    On a System/360, context switching wasn't too bad. You just save
    and restore the 16 general registers and the floating-point
    registers.

    On a more recent CPU, you might have to save and restore the
    general registers, the floating-point registers, and the SIMD
    registers.

    On Concertina II, in addition to 32 integer registers, 32
    floating-point registers, 16 SIMD registers, there are also
    eight 64-element vector registers!

    One of the reasons My 66000 only has 32 GPRs is context switch time.
    5 cache lines go out, 5 cache lines come in, presto you are in an
    entirely different context--with no more smarts added than a cache.

    On the Texas Instruments TI 9900, there were 16 general registers
    which were 16 bits long - but they were in memory, so context
    switching was _really_ fast, you just saved and restored the
    workspace pointer!

    Remembering where those 5 cache lines came from means you can
    deposit the data where it belongs long term rather than on
    the system/kernel stack.

    So the evil idea is...

    while the CPU does have real registers in order to run at an
    acceptable speed, allow it to also run in "slow mode" with
    a workspace pointer and all the registers in RAM!

    Just treat the registers as if they were a cache from an area
    in memory no other thread will be accessing.

    To make it slightly less evil, have the address in the
    workspace pointer point into an on-chip static RAM instead
    of extenal DRAM.

    Unless you can get all levels of privilege in that RAM you
    just added complexity and complexity management to context
    switch.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Fri Jan 5 21:43:05 2024
    mitchalsup@aol.com (MitchAlsup) writes:
    Quadibloc wrote:

    On Fri, 05 Jan 2024 15:18:46 +0000, Quadibloc wrote:

    To make it slightly less evil, have the address in the workspace pointer >>> point into an on-chip static RAM instead of extenal DRAM.

    And then when you're switching from one virtualized operating
    system to another, you have to do a "big context switch" where
    you save and restore all the registers _and_ that on-chip
    static RAM!

    I submit the proper place for memory resident register files and
    thread-state is in DRAM. Then, writing a single control register
    can switch between user threads, and writing 2 control registers
    switches between GuestOSs,.....

    Doesn't this cost at least one cache line in L1?

    Intel and AMD do this for the virtual machine state, but there's
    an access cost to read from dram. ARM64 keeps all the VM
    state in a small number of system registers that the HV can
    swap as necessary.


    However, that can be cured.

    Yes, by placing the data in the right place at the beginning.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Fri Jan 5 23:44:35 2024
    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:


    ARM64 keeps all the VM
    state in a small number of system registers that the HV can
    swap as necessary.

    My 66000 memory maps all control registers so even a remote CPU
    can diddle with stuff a local CPU will see instantaneously
    {mainly for debug of dead core}.

    ARM64 cores have a similar feature.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Fri Jan 5 23:15:21 2024
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Quadibloc wrote:

    On Fri, 05 Jan 2024 15:18:46 +0000, Quadibloc wrote:

    To make it slightly less evil, have the address in the workspace pointer >>>> point into an on-chip static RAM instead of extenal DRAM.

    And then when you're switching from one virtualized operating
    system to another, you have to do a "big context switch" where
    you save and restore all the registers _and_ that on-chip
    static RAM!

    I submit the proper place for memory resident register files and >>thread-state is in DRAM. Then, writing a single control register
    can switch between user threads, and writing 2 control registers
    switches between GuestOSs,.....

    Doesn't this cost at least one cache line in L1?

    No, because HW is doing the reads and writes, the data streams around
    the L1D. That is, it may have to pass by L1 on the way in, and it can
    pass by L1 on the way out, it does not interact with the footprint of
    data or inst already in L1. {I am leaning on not storing in L2 on the
    way out but in L3}. Inbound access probe caches so if data is present
    it gets used. Outbound accesses probe caches and be written on hits.

    One can in principle bypass the caches on the way in and on the way out.
    DRAM <-> core registers
    or even
    DRAM -> core registers -> DRAM
    where newly arriving registers push out the existing registers.

    You are not expecting the 5 to be needed any time soon.

    Intel and AMD do this for the virtual machine state, but there's
    an access cost to read from dram.

    The important point about using the word DRAM is that this 5-cache
    line structure has a fixed PA. It can be cached anywhere and that
    when that thread in not in control all its thread-state appears to
    be is in that PA.

    ARM64 keeps all the VM
    state in a small number of system registers that the HV can
    swap as necessary.

    My 66000 memory maps all control registers so even a remote CPU
    can diddle with stuff a local CPU will see instantaneously
    {mainly for debug of dead core}.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Fri Jan 5 23:57:14 2024
    On Fri, 05 Jan 2024 19:49:18 +0000, MitchAlsup wrote:
    Quadibloc wrote:
    On Fri, 05 Jan 2024 15:18:46 +0000, Quadibloc wrote:

    To make it slightly less evil, have the address in the workspace
    pointer point into an on-chip static RAM instead of extenal DRAM.

    And then when you're switching from one virtualized operating system to
    another, you have to do a "big context switch" where you save and
    restore all the registers _and_ that on-chip static RAM!

    I submit the proper place for memory resident register files and
    thread-state is in DRAM. Then, writing a single control register can
    switch between user threads, and writing 2 control registers switches
    between GuestOSs,.....

    That would certainly make my "evil idea" less evil.

    But, at first glance, that seems like something that
    couldn't possibly be true. Registers are in constant
    use by the processor, so accessing them should be very
    fast. DRAM is slow!

    Of course, though, a little bit of context shows that
    you're not as badly wrong as you might seem at first
    glance. Any computer these days with any pretensions
    to efficiency has cache.

    Oops: I missed reading "memory-resident" above; you did
    not claim that _all_ register files belong in RAM, just
    that my idea of having a special internal memory to allow
    putting registers in memory was a bad one (which I won't
    try to deny).

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Sat Jan 6 00:18:23 2024
    On Fri, 05 Jan 2024 23:15:21 +0000, MitchAlsup wrote:

    My 66000 memory maps all control registers so even a remote CPU can
    diddle with stuff a local CPU will see instantaneously {mainly for debug
    of dead core}.

    Oh, darn. I was going to save money by not providing proper cache
    coherency hardware in implementations of Concertina II, but that
    means I couldn't provide this useful feature!

    Just kidding... sort of.

    Mapping control registers to RAM is something I would never have
    thought of, but I would indeed put pins on the package, the function
    of which would be openly documented, to allow accessing chip internals.

    My perverted purpose in doing so, though, was not so much for legitimate debugging as to permit my chips to be used in retrocomputing toys...

    A computer with a *real front panel* just like in the old days, not
    just one like on the Altair that only handles the external memory bus!

    As for cache coherency... well, of course that has to be supported
    for a computer to actually work the way it's supposed to without
    error. However, the way I would handle it is like this:

    The CPU only bothers about cache coherency for cached data from
    memory that has been _explicitly marked as shared_. So the
    CPUs connected to the same memory have a message bus between them;
    when one requests some memory to be shared, it sends a message
    out about that, and doesn't use that memory until it gets acknowledged;
    _then_ the CPUs that are sharing a certain area of memory notify
    each other when they write to that area of memory.

    The CPUs have to be told - they don't try to keep track of everything
    anyone else might be doing on the bus.

    However, I haven't really thought through this aspect of CPU chip
    design. Since a microprocessor needs to handle the full speed of the
    memory bus in order to talk to memory, possibly bus monitoring is
    simpler than a conversational protocol handling only the memory that
    "needs" to be monitored.

    Come to think of it, though, perhaps a CPU needs to be able to do this
    both ways - bus monitoring for normal multi-CPU motherboards, and a conversational protocol so the chips can also be used in NUMA systems.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Sat Jan 6 01:35:02 2024
    Quadibloc wrote:

    On Fri, 05 Jan 2024 19:49:18 +0000, MitchAlsup wrote:
    Quadibloc wrote:
    On Fri, 05 Jan 2024 15:18:46 +0000, Quadibloc wrote:

    To make it slightly less evil, have the address in the workspace
    pointer point into an on-chip static RAM instead of extenal DRAM.

    And then when you're switching from one virtualized operating system to
    another, you have to do a "big context switch" where you save and
    restore all the registers _and_ that on-chip static RAM!

    I submit the proper place for memory resident register files and
    thread-state is in DRAM. Then, writing a single control register can
    switch between user threads, and writing 2 control registers switches
    between GuestOSs,.....

    That would certainly make my "evil idea" less evil.

    But, at first glance, that seems like something that
    couldn't possibly be true. Registers are in constant
    use by the processor, so accessing them should be very
    fast. DRAM is slow!

    Normally you are not as dense as you display tonight.

    Registers have a PA but can be in a core or somewhere
    in the memory hierarchy {not not config, not MMI/O }
    and normal caching rules COULD apply.

    Of course, though, a little bit of context shows that
    you're not as badly wrong as you might seem at first
    glance. Any computer these days with any pretensions
    to efficiency has cache.

    Oops: I missed reading "memory-resident" above; you did
    not claim that _all_ register files belong in RAM, just
    that my idea of having a special internal memory to allow
    putting registers in memory was a bad one (which I won't
    try to deny).

    All registers have a landing zone where they can be put back
    or brought forth defined by a PA. HW is responsible for
    obtaining new thread-state and of storing old thread-state.

    BUT BECAUSE thread-state is completely defined by a single
    PA, HW can change from one thread to another by writing
    the control register holding that context PA.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Sat Jan 6 01:41:42 2024
    Quadibloc wrote:

    On Fri, 05 Jan 2024 23:15:21 +0000, MitchAlsup wrote:

    My 66000 memory maps all control registers so even a remote CPU can
    diddle with stuff a local CPU will see instantaneously {mainly for debug
    of dead core}.

    Oh, darn. I was going to save money by not providing proper cache
    coherency hardware in implementations of Concertina II, but that
    means I couldn't provide this useful feature!

    Just kidding... sort of.

    Mapping control registers to RAM is something I would never have
    thought of, but I would indeed put pins on the package, the function
    of which would be openly documented, to allow accessing chip internals.

    My perverted purpose in doing so, though, was not so much for legitimate debugging as to permit my chips to be used in retrocomputing toys...

    A computer with a *real front panel* just like in the old days, not
    just one like on the Altair that only handles the external memory bus!

    As for cache coherency... well, of course that has to be supported
    for a computer to actually work the way it's supposed to without
    error. However, the way I would handle it is like this:

    The CPU only bothers about cache coherency for cached data from
    memory that has been _explicitly marked as shared_.

    So, shared instruction sections are marked exclusive ?!?
    So, thread-local-storage is marked shared if a pointer to its cats
    is constructed !?!
    Can a Hypervisor share code sections with Guest OS ??
    ,...

    Conversely, My 66000 allows one to map ROM (coherence and order free)
    onto DRAM, to provide relief from coherence traffic.

    So the
    CPUs connected to the same memory have a message bus between them;
    when one requests some memory to be shared, it sends a message
    out about that, and doesn't use that memory until it gets acknowledged; _then_ the CPUs that are sharing a certain area of memory notify
    each other when they write to that area of memory.

    The CPUs have to be told - they don't try to keep track of everything
    anyone else might be doing on the bus.

    But certainly, when writing a buffer in VA[k] to disk, the core caches
    have to be snooped so the disk gets the correct data.

    However, I haven't really thought through this aspect of CPU chip
    design. Since a microprocessor needs to handle the full speed of the
    memory bus in order to talk to memory, possibly bus monitoring is
    simpler than a conversational protocol handling only the memory that
    "needs" to be monitored.

    Come to think of it, though, perhaps a CPU needs to be able to do this
    both ways - bus monitoring for normal multi-CPU motherboards, and a conversational protocol so the chips can also be used in NUMA systems.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Sat Jan 6 09:01:02 2024
    On Sat, 06 Jan 2024 01:41:42 +0000, MitchAlsup wrote:
    Quadibloc wrote:

    The CPU only bothers about cache coherency for cached data from memory
    that has been _explicitly marked as shared_.

    So, shared instruction sections are marked exclusive ?!?
    So, thread-local-storage is marked shared if a pointer to its cats is constructed !?!
    Can a Hypervisor share code sections with Guest OS ??

    Presumably, when I get around to designing this part of the hardware,
    I would check on what industry-standard practice is. I may indeed
    have failed to properly think some things through.

    But I can still try to answer your questions, I think.

    The first question:

    Instruction sections aren't normally writeable. So cache coherency
    becomes a given; it's only lost when you write. Presumably, then,
    a shared instruction section would be...

    part of the OS,

    a shared library,

    a permanently resident popular program (i.e. a FORTRAN compiler on
    an ancient mainframe)

    and these areas would have only been written to by the operating system.

    So an OS thread would be its "owner", but other threads could read it.

    Your second question:

    The pointer can exist; the memory has to be readable only if the pointer
    is actually used. And its marked shared if it's used for writing as well
    as reading.

    Your third question:

    At first, I thought that this was something you would never want to do.

    But actually, it's quite common: there might be multiple instances of
    one particular guest OS running, and so one might as well start them off
    with all permanently resident parts of the OS loaded - and that memory
    might as well be shared by all the instances (and, initially, at least,
    by the parent hypervisor as well) to avoid duplication.

    Stuff that is only shared for reading isn't a coherency issue.

    The CPUs have to be told - they don't try to keep track of everything
    anyone else might be doing on the bus.

    But certainly, when writing a buffer in VA[k] to disk, the core caches
    have to be snooped so the disk gets the correct data.

    If you've designated an area in memory to be a buffer for DMA...

    then you need to treat it like video memory inside a video card.
    Mark it non-cacheable. So I do _not_ expect DMA controllers to
    have a cache snoop capability; as for the CPUs, I was thinking
    in terms of them always broadcasting any changes to shared memory,
    so it's always "push" and never a "pull" so that snoop would be
    needed. But cache snoop is common, so I guess it reduces message
    traffic for cache coherency, which means I'll need to study how
    this is done some more.

    But you have reminded me of something I had forgotten. I thought
    that if the CPUs, because they have to work with the memory bus
    at its full speed, can monitor every write to memory, and so
    maintain cache coherency that way, as an option, even if that
    wasn't my preferred option.

    But unless you always and only have write-through caches, the
    actual value of a location in memory can change before a hint
    of that gets out to the bus. In the case where the CPUs talk
    directly to each other about everything that happens in shared
    memory, that isn't a problem - but if they were to just monitor
    the bus without direct communication, they would miss recent
    updates to shared memory.

    Actually, even _with_ a write-through cache, there would still
    be a certain slight delay of a few cycles in a write, which
    would be entirely sufficient to cause occasional problems.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Quadibloc on Sat Jan 6 10:16:14 2024
    Quadibloc <quadibloc@servername.invalid> schrieb:
    On Thu, 04 Jan 2024 04:32:42 -0800, Tim Rentsch wrote:
    Quadibloc <quadibloc@servername.invalid> writes:

    Yes, the PDP-8 did have a small and simple instruction set.

    But that is _not_ what the meaning of RISC is commonly understood to
    be.

    My comments about the PDP-8 and RISC were not about what the meaning of
    RISC is comonly understood (or commonly misunderstood)
    to be. Rather they are about the meaning of RISC as described by the
    people who originally defined the term. Please see my longer response
    to John Levine.

    I'm not sure how this helps you, because the original definition
    includes the current common understanding, being a superset of it.

    Current common understanding:

    All instructions the same length.

    So, Power10, RISC-V and 32-bit ARM (which has Thumb) are not RISC.
    Good to know.

    Load-store architecture.
    Relatively large register file (32 or more registers)

    ... and the 801, the original ARM v2 (without Thumb) weren't,
    either.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Sat Jan 6 11:12:37 2024
    On Sat, 06 Jan 2024 09:01:02 +0000, Quadibloc wrote:

    Stuff that is only shared for reading isn't a coherency issue.

    Ouch. Stuff that is only being read isn't a coherency issue.

    But if even one CPU writes to an area of memory, with all the
    other CPUs to which it is shared only reading, clearly when
    those CPUs read, they may need to be sure of reading up-to-date
    information when they read it.

    Of course, though, if the read appears to have taken place
    earlier than it actually did, something else would have to
    have happened that contradicts that for there to be a real
    inconsistency, but the additional interaction that could
    lead to that could also be in the form of a read in the same
    direction rather than a write in the other direction.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to MitchAlsup on Sat Jan 6 12:42:30 2024
    MitchAlsup <mitchalsup@aol.com> schrieb:
    BGB wrote:

    On 1/4/2024 9:09 AM, EricP wrote:
    MitchAlsup wrote:
    BGB wrote:

    Also made an effort to avoid anything which lacks prior art from at
    least 20 years ago.

    Yes, over my 35+ year career I was exposed to 10s of thousands of
    patents.
    I tried rigorously to avoid the ones still in effect. I did borrow a few >>>> of my patents knowing their expiration dates. I also have a clean
    record of my <potential> inventions identifying when they were first
    conceived.

    IANAL

    With the rule change from "first to invent" to "first to file"
    is having a date record of inventions any use?

    There is also the question of whether writing about something
    on the internet counts as "publication" and might block patenting.

    A quicky search finds this:

    How Publications Affect Patentability
    https://www.utoledo.edu/research/TechTransfer/Publish_and_Perish.html

    "The Internet: A message describing an invention on a web site or to a
    public newsgroup will be considered as published on the day prior to
    the posting"


    My concern was more with the possibility of lawyers being jerks...

    I can alleviate you concerns--they are.

    But, if one mostly sticks to design features that were already in use
    20-30 years ago; there isn't much the lawyers can do...

    And written in books or published in papers.

    Granted, one could argue that this does not cover every possible way in
    which these features could be combined, which is a possible area for
    concern.

    Though, for the most part, it seems that the "enforcement" is mostly
    used against either direct re-implementations of a patented technology,
    or against popular common-use technologies that can be "interpreted" to
    somehow infringe on a patent (even if the artifact described is often
    almost entirely different), rather than going after ex-nihilo hobby
    projects or similar.

    Also note: if you are not making money by using something claimed in their patent, they can sue but they cannot get any money. So, it is not worth
    their time.....

    At least in Germany, there are exceptions to patent protectiion,
    among them using a patent privately for non-commercial purposes
    and doing research (commercial or otherwise) on the subject of
    the patent (§ 11 Patentgesetz). The latter is very important if,
    for example, people want to try out if what is claimed in the
    patent actually works.

    Not sure what the situation in the US is.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Sat Jan 6 16:21:35 2024
    On Mon, 04 Dec 2023 20:03:47 +0000, MitchAlsup wrote:
    Quadibloc wrote:

    Since out-of-order is so expensive in power and transistors, though, if
    mitigations do exact a performance cost, then going to a simple CPU
    that is not out-of-order might be a way to accept a loss of
    performance, but gain big savings in power and die size, whereas
    mitigations make those worse.

    18 years ago, when I quit building CPUs professionally, GBOoO
    performance was 2× what an 1-wide IO could deliver. In those 18 years
    the CPU makers have gone from 2× to 3× performance while the execution window has grown from 48 to 300 instructions.
    Clearly an unsustainable µArchitectural direction.

    Yes, the law of diminishing returns means that even if Moore's Law
    still lives on, they can't go _much_ further in that direction.

    But do they have any other directions they can go in to get more
    performance?

    We have heard of a few:

    1) Switch to a new, faster, semiconductor material if it becomes
    possible.

    2) Add new instructions, so as to make some additional operations
    faster. Better yet, put something like an FPGA in the CPU, so
    the chip can do anything quickly!

    3) If we can't make the processors faster, provide more of them.
    This is being done - first they put two CPUs on a chip, then four,
    and now we're seeing quite a few.

    Since we don't yet _have_ a new, faster semiconductor material
    we can use, and since single-thread performance is what is most
    ardently desired because software tends to be largely serial...
    taking out-of-order to extreme lengths, despite diminishing returns,
    has continued to be the most attractive option. Yes, that will have
    to come to an end, but before it does, it may go at least a little
    further, to a point which will seem even more like wretched excess
    to you and many others.

    And this brings me to

    4) Adopt a new ISA, based on a design that does much of what OoO
    does without OoO, based on DSP designs using VLIW and so on. Then,
    with that as a base, also apply OoO, and one should need _less
    extreme_ OoO for the same performance. And get more performance
    when reaching the same level of wretched excess as was tolerated
    before.

    My Concertina II, with its VLIW features, and even (optional)
    instructions to use banks of 128 registers is an attempt to
    show what such an ISA might look like. Or how about an OoO
    implementation of the Itanium? Or even, after the Mill becomes
    popular, a way might be figured out to apply OoO techniques
    to implementing that design, however revolting the thought may
    be to Ivan Godard and its other designers!

    Thanks to the end of Dennard scaling, until a new semiconductor
    material comes along, the pressure to find some way to increase
    performance still more is likely to lead to many novel designs,
    at least some of which will be weird and grotesque.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Sat Jan 6 17:15:01 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:
    All instructions the same length.

    So, Power10, RISC-V and 32-bit ARM (which has Thumb) are not RISC.
    Good to know.

    RISC-V without the C extension would be, but the C extension would
    make it non-RISC. Likewise ARM A32 would be, but A32/T32 would not.

    Power has instructions that are not 32 bits in size? Since when?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Quadibloc on Sat Jan 6 09:35:58 2024
    Quadibloc <quadibloc@servername.invalid> writes:

    On Thu, 04 Jan 2024 04:32:42 -0800, Tim Rentsch wrote:

    Quadibloc <quadibloc@servername.invalid> writes:

    Yes, the PDP-8 did have a small and simple instruction set.

    But that is _not_ what the meaning of RISC is commonly understood to
    be.

    My comments about the PDP-8 and RISC were not about what the meaning of
    RISC is comonly understood (or commonly misunderstood)
    to be. Rather they are about the meaning of RISC as described by the
    people who originally defined the term. Please see my longer response
    to John Levine.

    I'm not sure how this helps you, because the original definition
    includes the current common understanding, being a superset of it.

    Current common understanding:

    All instructions the same length.
    Load-store architecture.
    Relatively large register file (32 or more registers)

    Original definition:

    All the above, plus:
    All instructions execute in one cycle.

    It seems you are talking about the definition of an early
    RISC processor.

    What I'm talking about is the orginal description of the
    RISC concept.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sat Jan 6 17:49:21 2024
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    All instructions the same length.

    So, Power10, RISC-V and 32-bit ARM (which has Thumb) are not RISC.
    Good to know.

    RISC-V without the C extension would be, but the C extension would
    make it non-RISC. Likewise ARM A32 would be, but A32/T32 would not.

    Power has instructions that are not 32 bits in size? Since when?

    Since version 3.1 of the ISA (vulgo Power10), they have the prefixed instructions, which take up two 32-bit words. An example:

    [tkoenig@cfarm120 ~]$ cat add.c
    unsigned long int foo(unsigned long x)
    {
    return x + 0xdeadbeef;
    }
    [tkoenig@cfarm120 ~]$ gcc -c -O3 -mcpu=power10 add.c
    [tkoenig@cfarm120 ~]$ objdump -d add.o

    add.o: file format elf64-powerpcle


    Disassembly of section .text:

    0000000000000000 <foo>:
    0: ad de 00 06 paddi r3,r3,3735928559
    4: ef be 63 38
    8: 20 00 80 4e blr

    There is a restriction that the prefixed instructions cannot
    cross a 64-byte boundary.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to BGB on Sat Jan 6 17:50:18 2024
    BGB <cr88192@gmail.com> schrieb:
    On 1/5/2024 9:01 AM, Quadibloc wrote:
    On Fri, 05 Jan 2024 14:25:27 +0000, Scott Lurndal wrote:
    Quadibloc <quadibloc@servername.invalid> writes:
    On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:

    I would really like MS to go back to windows 7 {last one I liked}.....

    Finally, something we both agree on!

    Really, there has never been an usable Windows release.....

    Unix forever! :-)

    It certainly is true that Linux has some major advantages. People
    have had to put up with Windows, though, because some software is
    only available for it.


    Windows merits:
    More software support;
    Has nearly all of the games;

    Steam doesn't do too badly with Linux.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Sat Jan 6 19:52:13 2024
    On Sat, 06 Jan 2024 17:15:01 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    All instructions the same length.

    So, Power10, RISC-V and 32-bit ARM (which has Thumb) are not RISC.
    Good to know.

    RISC-V without the C extension would be, but the C extension would
    make it non-RISC. Likewise ARM A32 would be, but A32/T32 would not.

    Power has instructions that are not 32 bits in size? Since when?

    - anton

    A32 wouldn't be, even without T2. Too few registers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Sat Jan 6 18:50:46 2024
    BGB wrote:

    On 1/5/2024 9:01 AM, Quadibloc wrote:
    On Fri, 05 Jan 2024 14:25:27 +0000, Scott Lurndal wrote:
    Quadibloc <quadibloc@servername.invalid> writes:
    On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:

    I would really like MS to go back to windows 7 {last one I liked}.....

    Finally, something we both agree on!

    Really, there has never been an usable Windows release.....

    Unix forever! :-)

    It certainly is true that Linux has some major advantages. People
    have had to put up with Windows, though, because some software is
    only available for it.


    Windows merits:
    More software support;
    Has nearly all of the games;
    No endless fights with trying to get the GPU and sound hardware working;
    Much less needing to fight with hardware driver issues in general;
    ....

    Linux merits:
    You can mount nearly anything anywhere;
    Can do low-level HDD copies, have more freedom for how to partition and format drives, more available filesystems, ...

    You can back the whole thing up such that recovery is but a DD away.

    Accessing files on Linux is generally significantly faster (though, allegedly, this isn't so much because of the filesystem itself, but
    rather because antivirus software and Windows Defender tend to hook the filesystem access and scan every file that is being read/written, ...).

    Though, in a Windows style environment, it is generally preferable to
    have a small number of comparably large files, than a large number of
    small files.


    General coding experience is not that much different either way.
    If one sticks to mainstream languages and writes code in a portable way,
    they can use mostly similar code on either (apart from code dealing with
    the parts that differ).

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Sat Jan 6 21:00:38 2024
    BGB wrote:

    On 1/6/2024 12:50 PM, MitchAlsup wrote:
    BGB wrote:

    On 1/5/2024 9:01 AM, Quadibloc wrote:
    On Fri, 05 Jan 2024 14:25:27 +0000, Scott Lurndal wrote:
    Quadibloc <quadibloc@servername.invalid> writes:
    On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:

    I would really like MS to go back to windows 7 {last one I
    liked}.....

    Finally, something we both agree on!

    Really, there has never been an usable Windows release.....

    Unix forever!  :-)

    It certainly is true that Linux has some major advantages. People
    have had to put up with Windows, though, because some software is
    only available for it.


    Windows merits:
    More software support;
    Has nearly all of the games;
    No endless fights with trying to get the GPU and sound hardware working; >>> Much less needing to fight with hardware driver issues in general;
    ....

    Linux merits:
    You can mount nearly anything anywhere;
    Can do low-level HDD copies, have more freedom for how to partition
    and format drives, more available filesystems, ...

    You can back the whole thing up such that recovery is but a DD away.


    I often use Linux + DD to do low level copies of HDDs, which mostly
    works (and can often get an OK drive copy), except in cases where people ignored the drive failing for long enough that it is basically entirely failed, and then this is turned into a massive pain (modern Linux seems
    to drop drives about as soon as it encounters and irrecoverable IO
    error, which is super annoying for data recovery tasks).

    Consider that the alternative is a 4+ hour process (reloading and configuring W11); then reloading all your applications, passwords--and it never ends up "like it was".

    For my main PC, mostly still running Windows.
    For the most part, "everything just works", except when MS is doing
    something annoying.

    May or may not "jump ship" at some point though unless MS backs off on
    some of the stuff they pulled with Win11 (if/when Win10 starts to get unusable).

    Jumping ship, to me, is a dual system {1 Linux, 1 W <as low a number as possible}
    connected by ethernet chassis to chassis.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Sat Jan 6 21:12:05 2024
    BGB wrote:

    On 1/6/2024 10:21 AM, Quadibloc wrote:
    On Mon, 04 Dec 2023 20:03:47 +0000, MitchAlsup wrote:
    Quadibloc wrote:

    Since out-of-order is so expensive in power and transistors, though, if >>>> mitigations do exact a performance cost, then going to a simple CPU
    that is not out-of-order might be a way to accept a loss of
    performance, but gain big savings in power and die size, whereas
    mitigations make those worse.

    18 years ago, when I quit building CPUs professionally, GBOoO
    performance was 2× what an 1-wide IO could deliver. In those 18 years
    the CPU makers have gone from 2× to 3× performance while the execution >>> window has grown from 48 to 300 instructions.
    Clearly an unsustainable µArchitectural direction.

    Yes, the law of diminishing returns means that even if Moore's Law
    still lives on, they can't go _much_ further in that direction.


    Yes.

    And, even then, 2x .. 3x vs a 1-wide isn't THAT big of an advantage,
    given the GBOoO is going to use a lot more die area and power.


    But do they have any other directions they can go in to get more
    performance?

    We have heard of a few:

    1) Switch to a new, faster, semiconductor material if it becomes
    possible.

    <snip>
    3) If we can't make the processors faster, provide more of them.
    This is being done - first they put two CPUs on a chip, then four,
    and now we're seeing quite a few.

    Software continues to tell us that they cannot use 100+ cores, and
    the 3,4,5,6 they can use need to be as fast as one can figure out
    how to do. It is easily possible to put 256+ R3000 cores (plus FP)
    on a single die all of them running 3GHz+.

    This is where I had assumed small static scheduled CPUs could have merit.

    OoO costs roughly 3× In Order power and provides 1.4× performance (hand waving accuracy). GB, on the other hand, costs roughly 4× and provides
    1.4× performance. So, overall, the last factor of 2× in performance costs 12× in area and power and are generally surrounded with larger caches to
    keep up with the larger throughput raising the area (but not so much the
    power) again.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Sat Jan 6 22:43:58 2024
    BGB wrote:


    This is where I had assumed small static scheduled CPUs could have merit. >>
    OoO costs roughly 3× In Order power and provides 1.4× performance (hand
    waving accuracy). GB, on the other hand, costs roughly 4× and provides
    1.4× performance. So, overall, the last factor of 2× in performance
    costs 12× in area and power and are generally surrounded with larger
    caches to
    keep up with the larger throughput raising the area (but not so much the
    power) again.

    OK.

    I guess the question is, say, the cost/benefit tradeoffs between OoO vs static-scheduled 'LIW' (granted, 'LIW' (*) is probably fairly similar to
    an in-order superscalar, except possibly a little cheaper since it can
    leaves out one of the expensive parts of an in-order superscalar...).

    *: Say, designed for a maximum of 2 or 3 instructions/clock, with
    explicit tagging for parallel execution (where the 'V' in 'VLIW'
    seemingly tends to also imply wider execution often with an absence of
    useful things like interlock handling or register forwarding...).

    Also, assuming that one has a "doesn't suck" compiler for it...

    Is there a question here ??

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Sun Jan 7 06:07:21 2024
    On Sat, 06 Jan 2024 22:43:58 +0000, MitchAlsup wrote:
    BGB wrote:

    I guess the question is, say, the cost/benefit tradeoffs between OoO vs
    static-scheduled 'LIW' (granted, 'LIW' (*) is probably fairly similar
    to an in-order superscalar, except possibly a little cheaper since it
    can leaves out one of the expensive parts of an in-order
    superscalar...).

    *: Say, designed for a maximum of 2 or 3 instructions/clock, with
    explicit tagging for parallel execution (where the 'V' in 'VLIW'
    seemingly tends to also imply wider execution often with an absence of
    useful things like interlock handling or register forwarding...).

    Also, assuming that one has a "doesn't suck" compiler for it...

    Is there a question here ??

    I think it's clear what the _answer_ is:

    "You just described the Itanium. It failed big time, so your answer
    is no."

    Now, if you don't know the question, but you do have the answer, if it's something as enigmatic as "42", and you only have a vague description of
    the question: "The great question of life, the Universe, and everything",
    then the process of recovering the actual working of the question can be
    very convoluted, involving pan-dimensional beings disguising themselves
    as white mice.

    However, in this case, I don't think it's that difficult.

    To be, or not to be, that is the question.

    Whether 'tis nobler in the mind to suffer the thermal issues and
    excessive power consumption resulting from the outrageous transistor
    counts of Great Big Out-of-Order microarchitectures,

    or to oppose them with an ISA which directly handles the pipeline in
    VLIW or even RISC fashion, and by opposing them, end them...

    I recall that I derived the following understanding of _your_
    answer to this question some time ago, but I may have misunderstood
    what you were writing:

    (begin my description of what I think your answer is)
    VLIW-style ISAs have failed to serve as a replacement for OoO
    execution.

    But that does not mean we are without hope of finding something
    better. The problem is that the standard textbooks have failed to
    properly represent what OoO is _for_.

    The scoreboard in the Control Data 6600 is just briefly mentioned,
    and then it's noted that it couldn't solve all the hazards related
    to RAW and WAR and so on, and then the Tomasulo came along for the
    IBM System/360 Model 91, and did it _right_.

    That misses the fact that register hazards aren't the only thing
    that OoO execution helps with. It also helps with *cache misses*.

    And the 6600-style scoreboard is adequate to deal with cache misses.

    Therefore, if you want to make a computer that replaces today's
    bloated GBOoO designs, without the transistor bloat, but which
    offers performance that competes with them, what you need to do
    is indeed take care of the register hazards the way RISC
    architectures have done... but then, instead of abolishing OoO
    from your design after you've done that, keep the basic and
    reasonable 6600-style scoreboard so that cache misses don't
    kill your performance.
    (end description)

    I may have gotten it badly wrong, as I pieced it together from
    little things you wrote here and there on various occasions.

    But at least now we have a straw man to point at and debate.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to BGB on Sun Jan 7 09:21:47 2024
    On Sun, 07 Jan 2024 01:14:52 -0600, BGB wrote:

    What if the goal isn't "fastest single-thread performance", but instead,
    best performance relative to die area and per watt?...

    If _that_ were the case, we would _already_ be using in-order CPUs, and
    the wasteful nature of out-of-order execution would have precluded its
    adoption entirely.

    As you've pointed out, where that _is_ the goal, things like Cortex A53
    cores are still doing just fine.

    But when it comes even to the humble low-end laptop, Intel found it
    necessary to redesign their Atom processor to be a lightweight OoO
    chip, instead of the in-order design it originally had.

    As the saying goes, nine women working together can't have a baby in
    one month. Most computational problems aren't "embarassingly parallel",
    so they don't scale well enough to avoid the situation we're in today:
    people want their programs to run as fast as the current state of the
    art in technology allows, and to attain that, they need the maximum single-thread performance attainable.

    The path to that which we currently have available involves out-of-order execution.

    I have no quarrel with OoO as a useful tool, but I also acknowledge that,
    as Mitch has pointed out, today's desktop microprocessors have taken it
    to the point of wretched excess.

    Humanity could survive in a world where video games had to be written to
    run acceptably on computers with a clock speed no higher than a single gigahertz!

    And OoO isn't the _only_ wretchedly excessive thing about today's microprocessors. The small feature sizes that allow a single die to
    contain eight complete CPUs with a great big out-of-order design
    are attained by means of chip fabs that cost billions of dollars to
    build. Couldn't we have just stopped at, say, 33nm or something.


    fThe commpetitive demands Intel and AMD face - the desires of us as
    consumers - are what prevents this from happening, and I see no hope
    for the world to change to what might be seen as th path of virtue in
    this area.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Sun Jan 7 09:30:31 2024
    BGB <cr88192@gmail.com> writes:
    I can also note that I am still using a cellphone with running on
    in-order Cortex A53 cores...


    Like, seemingly ARM has gone one direction, moving to primarily OoO
    cores for newer designs, but then a lot of cellphones are seemingly
    like, "Meh, whatever, we will just stick with a 8x Cortex-A53 chip from >MediaTek..."


    But, if OoO were clearly superior, presumably people would have stopped
    using the Cortex-A53 ...

    But, there were still chips being released in 2023 using exclusively A53 >cores (and they appear to still be popular in cellphones).

    Say, for example: https://en.wikipedia.org/wiki/Moto_E7
    (Though, this is a model from 2020/2021).

    A Cortex-A53 is cheap, in both area and licensing fees to ARM. And
    the smartphones that use SoCs with these cores usually are cheap, too.
    If they were the same price, people would probably go for a smartphone
    with the Mediatek Dimensity 9300 with 4 OoO Cortex-X4 and 4 OoO
    Cortex-A720 or with an Apple A11 or later (not sure about A9 and A10,
    but A7 and A8 also used only OoO cores), neither of them with any of
    those in-order cores that you get with Qualcomm offerings.

    And if the users do not need the increased performance of the OoO
    cores, why should they pay more to get it?

    So, rather than (V)LIW competing against OoO, maybe it can compete
    against in-order superscalar? ...

    Not in smartphones, where software compatibility is a required
    feature.

    Or, with the higher end of the microcontroller space?...

    Even there, the benefits of a common platform means that the industry
    is consolidating on ARM; e.g., Philips (now NXP) made the Trimedia
    processors (VLIW), but terminated development in 2010. Some users,
    such as WD defecting to RISC-V to avoid the ARM tax, but RISC-V still
    provides a common platform. Are you (or anyone else) able to provide
    a VLIW platform that outcompetes ARM and RISC-V?

    My thinking is not so much that one should have an ISA that mandates
    VLIW, but instead, focuses on avoiding a few of the expensive parts of >in-order superscalar (namely the logic for figuring out whether
    instructions can be executed in parallel).

    Apparently that logic is not as expensive as you think.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to BGB on Sun Jan 7 10:30:00 2024
    In article <undj1h$10fej$1@dont-email.me>, cr88192@gmail.com (BGB) wrote:

    Like, seemingly ARM has gone one direction, moving to primarily OoO
    cores for newer designs, but then a lot of cellphones are seemingly
    like, "Meh, whatever, we will just stick with a 8x Cortex-A53 chip
    from MediaTek..."

    But, if OoO were clearly superior, presumably people would have
    stopped using the Cortex-A53 ...

    OoO is currently superior for achieving high performance per clock, but in-order allows better performance per watt. The Cortex-A53 has
    successors, in Cortex-A55, Cortex-A510, and Cortex-A520, which have ISA upgrades, better power efficiency and options for bigger caches.
    Interestingly, they run at lower clock speeds for beter performance.

    <https://en.wikipedia.org/wiki/ARM_Cortex-A520#Architecture_comparison>

    However, the -A53 seems to be cheaper to license, so it still gets used.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Sun Jan 7 14:30:53 2024
    Quadibloc <quadibloc@servername.invalid> writes:
    On Sun, 07 Jan 2024 01:14:52 -0600, BGB wrote:

    What if the goal isn't "fastest single-thread performance", but instead,
    best performance relative to die area and per watt?...

    If _that_ were the case, we would _already_ be using in-order CPUs, and
    the wasteful nature of out-of-order execution would have precluded its >adoption entirely.

    Reality check: There are areas where parallelism is embarrassing, or
    at least abundant, such as supercomputing and whatever people are
    using the 192-core AmpereOne, the 128-Core Bergamo, and the upcoming
    288-core Sierra Forest for. And yet Intel switched from the in-order
    Knight's corner to the OoO Knight's Landing and eventually replaced
    this line with AVX-512-enhanced mainline Xeons (wide OoO). And they
    also use the OoO Gracemont (or its successor) for Sierra Forest rather
    than building something that has a larger number of in-order cores.

    My guess is that the overhead of a shared-memory interface is so big
    that it does not pay to replace one such interface and a medium to big
    OoO core with, say, two such interfaces on and two tiny in-order
    cores, because the in-order cores are slower by more than a factor of
    2. And the fact that the in-order core itself is only 1/12 the size
    of the OoO core (or whatever number) does not really help because the
    core plus the shared-memory interface are not that much smaller.

    And OoO isn't the _only_ wretchedly excessive thing about today's >microprocessors. The small feature sizes that allow a single die to
    contain eight complete CPUs with a great big out-of-order design
    are attained by means of chip fabs that cost billions of dollars to
    build. Couldn't we have just stopped at, say, 33nm or something.

    That would be a wretched excess. Intel uses the denser processes to
    reduce its production costs. Admittedly, with increasing wafer
    processing costs of recent processes that may no longer work (or maybe
    the wafter costs we read about just reflect the fact that TSMC now has
    a monopoly on the densest processes.

    fThe commpetitive demands Intel and AMD face - the desires of us as
    consumers - are what prevents this from happening, and I see no hope
    for the world to change to what might be seen as th path of virtue in
    this area.

    Nobody forces you to replace your CPU with one with a denser process.
    If you want to use a 32nm CPU, get, e.g., an Intel Sandy Bridge. Or
    you can get a Raspi 3, where the SoC is made in 40nm (according to <https://wikimovel.com/index.php?title=Broadcom_BCM2837>), and which
    uses in-order processing as well.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Sun Jan 7 17:52:59 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Power has instructions that are not 32 bits in size? Since when?

    Since version 3.1 of the ISA (vulgo Power10), they have the prefixed >instructions, which take up two 32-bit words. An example:

    [tkoenig@cfarm120 ~]$ cat add.c
    unsigned long int foo(unsigned long x)
    {
    return x + 0xdeadbeef;
    }
    [tkoenig@cfarm120 ~]$ gcc -c -O3 -mcpu=power10 add.c
    [tkoenig@cfarm120 ~]$ objdump -d add.o

    add.o: file format elf64-powerpcle


    Disassembly of section .text:

    0000000000000000 <foo>:
    0: ad de 00 06 paddi r3,r3,3735928559
    4: ef be 63 38
    8: 20 00 80 4e blr

    Interesting. Maybe somebody read the long-constant advocacy in this
    group.

    There is a restriction that the prefixed instructions cannot
    cross a 64-byte boundary.

    Ouch. This means that Power with prefixed instructions is the second instruction set (after MIPS with its architectural delayed loads)
    where concatenating instruction blocks between two labels may result
    in invalid code; on all other (~10) instruction sets I looked at this
    works fine, including IA-64. Fortunately, for Power that's easy to
    fix by compiling with -mno-prefixed, while for MIPS with its multiple extravaganzas (apart from the load delay slots, the jump and call
    encoding is problematic) our solution was to just disable all
    optimizations based on this concatenation.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andreas Eder@21:1/5 to Scott Lurndal on Sun Jan 7 20:10:06 2024
    On Fr 05 Jan 2024 at 14:25, scott@slp53.sl.home (Scott Lurndal) wrote:

    Quadibloc <quadibloc@servername.invalid> writes:
    On Thu, 04 Jan 2024 19:00:46 +0000, MitchAlsup wrote:

    I would really like MS to go back to windows 7 {last one I liked}.....

    Finally, something we both agree on!

    Really, there has never been an usable Windows release.....

    Unix forever! :-)

    +1

    'Andreas

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Sun Jan 7 19:21:20 2024
    Quadibloc wrote:

    On Sat, 06 Jan 2024 22:43:58 +0000, MitchAlsup wrote:
    BGB wrote:

    I guess the question is, say, the cost/benefit tradeoffs between OoO vs
    static-scheduled 'LIW' (granted, 'LIW' (*) is probably fairly similar
    to an in-order superscalar, except possibly a little cheaper since it
    can leaves out one of the expensive parts of an in-order
    superscalar...).

    *: Say, designed for a maximum of 2 or 3 instructions/clock, with
    explicit tagging for parallel execution (where the 'V' in 'VLIW'
    seemingly tends to also imply wider execution often with an absence of
    useful things like interlock handling or register forwarding...).

    Also, assuming that one has a "doesn't suck" compiler for it...

    Is there a question here ??

    I think it's clear what the _answer_ is:

    "You just described the Itanium. It failed big time, so your answer
    is no."

    Now, if you don't know the question, but you do have the answer, if it's something as enigmatic as "42", and you only have a vague description of
    the question: "The great question of life, the Universe, and everything", then the process of recovering the actual working of the question can be
    very convoluted, involving pan-dimensional beings disguising themselves
    as white mice.

    However, in this case, I don't think it's that difficult.

    To be, or not to be, that is the question.

    Whether 'tis nobler in the mind to suffer the thermal issues and
    excessive power consumption resulting from the outrageous transistor
    counts of Great Big Out-of-Order microarchitectures,

    or to oppose them with an ISA which directly handles the pipeline in
    VLIW or even RISC fashion, and by opposing them, end them...

    I recall that I derived the following understanding of _your_
    answer to this question some time ago, but I may have misunderstood
    what you were writing:

    (begin my description of what I think your answer is)
    VLIW-style ISAs have failed to serve as a replacement for OoO
    execution.

    But that does not mean we are without hope of finding something
    better. The problem is that the standard textbooks have failed to
    properly represent what OoO is _for_.

    The scoreboard in the Control Data 6600 is just briefly mentioned,
    and then it's noted that it couldn't solve all the hazards related
    to RAW and WAR and so on, and then the Tomasulo came along for the
    IBM System/360 Model 91, and did it _right_.

    Thornton SB for CDC 6600 is 11,000 gates for the whole thing.
    Tomasulo RS for IBM 360/91 is 11,000 gates per entry.

    That misses the fact that register hazards aren't the only thing
    that OoO execution helps with. It also helps with *cache misses*.

    One CAN solve the other hazards with another SB should, one choose.

    And the 6600-style scoreboard is adequate to deal with cache misses.

    Therefore, if you want to make a computer that replaces today's
    bloated GBOoO designs, without the transistor bloat, but which
    offers performance that competes with them, what you need to do
    is indeed take care of the register hazards the way RISC
    architectures have done... but then, instead of abolishing OoO
    from your design after you've done that, keep the basic and
    reasonable 6600-style scoreboard so that cache misses don't
    kill your performance.
    (end description)

    I may have gotten it badly wrong, as I pieced it together from
    little things you wrote here and there on various occasions.

    But at least now we have a straw man to point at and debate.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sun Jan 7 20:39:37 2024
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    There is a restriction that the prefixed instructions cannot
    cross a 64-byte boundary.

    Ouch. This means that Power with prefixed instructions is the second instruction set (after MIPS with its architectural delayed loads)
    where concatenating instruction blocks between two labels may result
    in invalid code; on all other (~10) instruction sets I looked at this
    works fine, including IA-64. Fortunately, for Power that's easy to
    fix by compiling with -mno-prefixed,

    Or by inserting NOPs in the right places; otherwise you lose the
    functionality for Power10.

    Fortunately, the assembler will do this for you:

    [tkoenig@cfarm120 ~]$ cat foo.s
    .file "add.c"
    .machine power10
    .abiversion 2
    .section ".text"
    .align 2
    .p2align 4,,15
    .globl foo
    .type foo, @function
    foo:
    .LFB0:
    .cfi_startproc
    .localentry foo,1
    addi 3,3,0
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    blr
    .long 0
    .byte 0,0,0,0,0,0,0,0
    .cfi_endproc
    [tkoenig@cfarm120 ~]$ gcc -c foo.s
    [tkoenig@cfarm120 ~]$ objdump -d foo.o

    foo.o: file format elf64-powerpcle


    Disassembly of section .text:

    0000000000000000 <foo>:
    0: 00 00 63 38 addi r3,r3,0
    4: ad de 00 06 paddi r3,r3,3735928559
    8: ef be 63 38
    c: ad de 00 06 paddi r3,r3,3735928559
    10: ef be 63 38
    14: ad de 00 06 paddi r3,r3,3735928559
    18: ef be 63 38
    1c: ad de 00 06 paddi r3,r3,3735928559
    20: ef be 63 38
    24: ad de 00 06 paddi r3,r3,3735928559
    28: ef be 63 38
    2c: ad de 00 06 paddi r3,r3,3735928559
    30: ef be 63 38
    34: ad de 00 06 paddi r3,r3,3735928559
    38: ef be 63 38
    3c: 00 00 00 60 nop
    40: ad de 00 06 paddi r3,r3,3735928559
    44: ef be 63 38

    So, unless you prefer to write direct machine code, this should
    not be an issue.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Anton Ertl on Mon Jan 8 00:13:49 2024
    On Sun, 07 Jan 2024 14:30:53 +0000, Anton Ertl wrote:
    Quadibloc <quadibloc@servername.invalid> tried to write:

    The competitive demands Intel and AMD face - the desires of us as
    consumers - are what prevents this from happening, and I see no hope for >>the world to change to what might be seen as the path of virtue in this >>area.

    Nobody forces you to replace your CPU with one with a denser process. If
    you want to use a 32nm CPU, get, e.g., an Intel Sandy Bridge. Or you
    can get a Raspi 3, where the SoC is made in 40nm (according to <https://wikimovel.com/index.php?title=Broadcom_BCM2837>), and which
    uses in-order processing as well.

    What I wrote didn't contradict what you are saying in your response.

    I am not saying that Intel and AMD are forcing us to buy newer and
    faster microprocessors. (I could say that *Microsoft* is forcing us
    to buy newer and faster microprocessors, by refusing to continue
    issuing security updates for Windows 7, or, for that matter,
    Windows XP, Windows 98, or even Windows 3.1. Then I would be
    disagreeing with you, but I wasn't getting into that part of
    the issue.)

    Instead, what I wrote said that we, as consumers, are so greedy
    for ever faster computers that we are the ones to blame for forcing
    Intel and AMD to resort to techniques that require expensive fabs
    to make the chips, and that require the chips to have enormous
    numbers of transistors for each individual core.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Mon Jan 8 00:23:28 2024
    On Sun, 07 Jan 2024 19:21:20 +0000, MitchAlsup wrote:
    Quadibloc wrote:

    That misses the fact that register hazards aren't the only thing that
    OoO execution helps with. It also helps with *cache misses*.

    One CAN solve the other hazards with another SB should, one choose.

    Now that is something I did not know.

    In fact, if I am understanding what you are saying here correctly:

    It is possible to design an out-of-order CPU which addresses all the
    basic types of register hazard, just as those designed using the
    Tomasulo algorithm or those which equivalently use register renaming
    instead, by using a modified form of the scoreboard of the Control
    Data 6600.

    Doing so would be more efficient, as the transistor count would be significantly lower.

    ...then, of course, my questiion is why isn't this what everyone is
    doing already?

    I mean, the answer *could* be that:

    Only I, Mitch Alsup, know how this can be done. The world will have
    to await my patent filing to find out how...

    which is, in fact, a fair answer; you deserve to be paid for such
    a valuable invention...

    but if that _isn't_ the answer, then what the answer could possibly
    be that could explain such counter-productive behavior evades me
    completely.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Mon Jan 8 00:29:36 2024
    On Mon, 08 Jan 2024 00:23:28 +0000, Quadibloc wrote:

    but if that _isn't_ the answer, then what the answer could possibly be
    that could explain such counter-productive behavior evades me
    completely.

    Further reflection allowed me to recognize that there _was_ another
    possible answer:

    I wasn't saying that there was any free lunch here. Remember, there is
    simple basic out-of-order execution, and Great Big out-of-order
    execution.

    A basic out-of-order functional unit designed based on a second 6600-style scoreboard wouldn't actually be all that much different from one that
    uses register renaming. In particular, they would be similar in the
    speedup achieved, and in their transistor counts.

    If _that_ is the answer, then my initial response resulted from me
    having misunderstood what you had written.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Thomas Koenig on Mon Jan 8 00:40:16 2024
    Thomas Koenig wrote:

    Fortunately, the assembler will do this for you:

    [tkoenig@cfarm120 ~]$ cat foo.s
    .file "add.c"
    .machine power10
    .abiversion 2
    .section ".text"
    .align 2
    .p2align 4,,15
    .globl foo
    .type foo, @function
    foo:
    ..LFB0:
    .cfi_startproc
    .localentry foo,1
    addi 3,3,0
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    blr
    .long 0
    .byte 0,0,0,0,0,0,0,0
    .cfi_endproc
    [tkoenig@cfarm120 ~]$ gcc -c foo.s
    [tkoenig@cfarm120 ~]$ objdump -d foo.o

    foo.o: file format elf64-powerpcle


    Disassembly of section .text:

    0000000000000000 <foo>:
    0: 00 00 63 38 addi r3,r3,0
    4: ad de 00 06 paddi r3,r3,3735928559
    8: ef be 63 38
    c: ad de 00 06 paddi r3,r3,3735928559
    10: ef be 63 38
    14: ad de 00 06 paddi r3,r3,3735928559
    18: ef be 63 38
    1c: ad de 00 06 paddi r3,r3,3735928559
    20: ef be 63 38
    24: ad de 00 06 paddi r3,r3,3735928559
    28: ef be 63 38
    2c: ad de 00 06 paddi r3,r3,3735928559
    30: ef be 63 38
    34: ad de 00 06 paddi r3,r3,3735928559
    38: ef be 63 38
    3c: 00 00 00 60 nop
    40: ad de 00 06 paddi r3,r3,3735928559
    44: ef be 63 38

    How did 17 adds become 8 ??

    So, unless you prefer to write direct machine code, this should
    not be an issue.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Mon Jan 8 00:44:23 2024
    Quadibloc wrote:

    On Sun, 07 Jan 2024 14:30:53 +0000, Anton Ertl wrote:
    Quadibloc <quadibloc@servername.invalid> tried to write:

    The competitive demands Intel and AMD face - the desires of us as >>>consumers - are what prevents this from happening, and I see no hope for >>>the world to change to what might be seen as the path of virtue in this >>>area.

    Nobody forces you to replace your CPU with one with a denser process. If
    you want to use a 32nm CPU, get, e.g., an Intel Sandy Bridge. Or you
    can get a Raspi 3, where the SoC is made in 40nm (according to
    <https://wikimovel.com/index.php?title=Broadcom_BCM2837>), and which
    uses in-order processing as well.

    What I wrote didn't contradict what you are saying in your response.

    I am not saying that Intel and AMD are forcing us to buy newer and
    faster microprocessors. (I could say that *Microsoft* is forcing us
    to buy newer and faster microprocessors, by refusing to continue
    issuing security updates for Windows 7, or, for that matter,
    Windows XP, Windows 98, or even Windows 3.1. Then I would be
    disagreeing with you, but I wasn't getting into that part of
    the issue.)

    I am calling Strawman on this::

    I am of the opinion that the SW that arrives with a box/laptop be
    the same over the lifetime of the product. I turn all updating of
    SW off and remove power at time I am not using the device to prevent
    MS from updating things I DON'T want updated--this includes security
    patches.

    Instead, what I wrote said that we, as consumers, are so greedy
    for ever faster computers that we are the ones to blame for forcing
    Intel and AMD to resort to techniques that require expensive fabs
    to make the chips, and that require the chips to have enormous
    numbers of transistors for each individual core.

    In 07 I bought a W7 machine which worked for 9+years and died of
    a power transistor blowing out. I would still be using that machine
    today if it had not blown up. I reached the end of "chasing performance"
    more than a decade ago.....

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Mon Jan 8 01:12:32 2024
    On Mon, 08 Jan 2024 00:44:23 +0000, MitchAlsup wrote:
    Quadibloc wrote:

    I am not saying that Intel and AMD are forcing us to buy newer and
    faster microprocessors. (I could say that *Microsoft* is forcing us to
    buy newer and faster microprocessors, by refusing to continue issuing
    security updates for Windows 7, or, for that matter, Windows XP,
    Windows 98, or even Windows 3.1. Then I would be disagreeing with you,
    but I wasn't getting into that part of the issue.)

    I am calling Strawman on this::

    I am of the opinion that the SW that arrives with a box/laptop be the
    same over the lifetime of the product. I turn all updating of SW off and remove power at time I am not using the device to prevent MS from
    updating things I DON'T want updated--this includes security patches.

    The fact that I am seeing your posts here means that you _have_ tried connecting a computer to the Internet. Which invalidates the first
    response to that which comes to mind.

    Perhaps you use Linux or something. But as Windows users know very well
    through sad experience is that if you don't keep your computer up-to-date,
    to patch vulnerabilities in Windows as soon as they are discovered...
    your computer could end up infected within minutes of being connected to
    the Internet.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Mon Jan 8 00:58:24 2024
    Quadibloc wrote:

    On Sun, 07 Jan 2024 19:21:20 +0000, MitchAlsup wrote:
    Quadibloc wrote:

    That misses the fact that register hazards aren't the only thing that
    OoO execution helps with. It also helps with *cache misses*.

    One CAN solve the other hazards with another SB should, one choose.

    Now that is something I did not know.

    In fact, if I am understanding what you are saying here correctly:

    It is possible to design an out-of-order CPU which addresses all the
    basic types of register hazard, just as those designed using the
    Tomasulo algorithm or those which equivalently use register renaming
    instead, by using a modified form of the scoreboard of the Control
    Data 6600.

    Yes, and you can include timing such that you can forward results as
    operands, too. The only thing a SB mandates you do is the read RF
    after launch (which most GBOoO machines do today anyway.)

    Doing so would be more efficient, as the transistor count would be significantly lower.

    One has to be careful as a SB has a quadratic component where
    Tomasulo has (heavy weight) linear component.

    CDC 6600 SB partitioned the registers in to 3 files of 8 each,
    and had unpipelined (but concurrent) function units. These
    lead to the small number of instruction waiting for launch.

    ....then, of course, my questiion is why isn't this what everyone is
    doing already?

    The std textbook (H&P) basically says SB == bad use Thomasulo.

    I mean, the answer *could* be that:

    Only I, Mitch Alsup, know how this can be done. The world will have
    to await my patent filing to find out how...

    I gave it away when Luke ask for it.

    which is, in fact, a fair answer; you deserve to be paid for such
    a valuable invention...

    I did it for fun, actually, because I wanted to really know.

    but if that _isn't_ the answer, then what the answer could possibly
    be that could explain such counter-productive behavior evades me
    completely.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Mon Jan 8 01:18:06 2024
    On Mon, 08 Jan 2024 00:58:24 +0000, MitchAlsup wrote:

    One has to be careful as a SB has a quadratic component where Tomasulo
    has (heavy weight) linear component.

    Ah. Given that OoO as currently used in desktop processors is of the GBOoO variety, that quadratic component would loom large, and thus this is a
    big part of the answer of why everyone isn't doing it that way.

    But as you noted, this was a scheme unique to you - I had begun to
    speculate that perhaps register renaming, instead of being Tomasulo
    in disguise, could have been scoreboard-based at its very outset, and
    was doing a literature search for more information to see if that was
    how it went. But no, it wasn't.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Mon Jan 8 02:49:13 2024
    On Mon, 08 Jan 2024 01:12:32 +0000, Quadibloc wrote:
    But as Windows users know very well
    through sad experience is that if you don't keep your computer
    up-to-date,
    to patch vulnerabilities in Windows as soon as they are discovered...
    your computer could end up infected within minutes of being connected to
    the Internet.

    I was working in a call centre when MS 08-067 struck.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to BGB on Sun Jan 7 23:41:53 2024
    On Sat, 6 Jan 2024 11:48:40 -0600, BGB <cr88192@gmail.com> wrote:

    Windows merits:
    More software support;
    Has nearly all of the games;
    No endless fights with trying to get the GPU and sound hardware working;
    Much less needing to fight with hardware driver issues in general;

    - DLLs can have private heaps

    ...

    Windows demerits:

    - essentially no POSIX compliance. all the function is there but with
    different APIs

    - possible multiple instances of a DLL's code in memory
    [much less likely with 64-bit, but still possible]


    Linux merits:
    You can mount nearly anything anywhere;
    Can do low-level HDD copies, have more freedom for how to partition and >format drives, more available filesystems, ...

    - only one instance of any DLL's code in memory


    Linux demerits:

    - more difficult to give a DLL a private heap

    - mmap/mprotect/madvise are a crappy 1-button interface

    - no easy way to monitor VMM pages for writes (e.g., for GC)


    for contrast see
    https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/



    Though, in a Windows style environment, it is generally preferable to
    have a small number of comparably large files, than a large number of
    small files.

    Depends on the cache configuration:

    Workstations default to what essentially is a (small) private cache
    per process: if a second process opens the same file, it gets its own
    cache copy. Even if lots of memory is available, once a process fills
    up its own little cache, it starts to thrash.

    Servers default to a single combined cache for all processes.

    It is possible to change the cache sizes, to run workstations with a
    single combined cache, or to run servers with per process private
    caches ... in each case you just have know what to diddle in the
    registry.



    General coding experience is not that much different either way.
    If one sticks to mainstream languages and writes code in a portable way,
    they can use mostly similar code on either (apart from code dealing with
    the parts that differ).

    ...

    The problem is that you are quite limited in what you can do without
    using Windows own APIs. Although it can be (and has been) done, it is difficult for a simple abstraction to paper over the differences
    between POSIX and Windows.
    [Asynch IO in particular is completely different.]


    YMMV.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Mon Jan 8 09:59:22 2024
    BGB <cr88192@gmail.com> writes:
    On 1/7/2024 3:30 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    So, rather than (V)LIW competing against OoO, maybe it can compete
    against in-order superscalar? ...

    Not in smartphones, where software compatibility is a required
    feature.


    In smartphones, the program is typically being AOT'ed from a VM (such as >Dalvik), rather than distributing binaries as native ARM code.

    From the POV of a Dalvik style VM, it shouldn't really matter that much.

    If all programs used just Dalvik, yes, you would "just" need to write
    a Dalvik implementation for your VLIW. But reality is, that there are
    enough programs that are written or have components distributed as
    native code to make your non-ARM architecture uncompetetive, even with
    a working binary translator.

    Even there, the benefits of a common platform means that the industry
    is consolidating on ARM; e.g., Philips (now NXP) made the Trimedia
    processors (VLIW), but terminated development in 2010. Some users,
    such as WD defecting to RISC-V to avoid the ARM tax, but RISC-V still
    provides a common platform. Are you (or anyone else) able to provide
    a VLIW platform that outcompetes ARM and RISC-V?


    Trimedia (and the TMS320C6x) line differ partly in that they were true
    VLIW, rather than "LIW". So, in this case, I was imagining something
    more similar to the ESP32 (LX6) or Qualcomm Hexagon or similar.

    If even VLIW could not compete with ARM in a certain embedded niche,
    why should LIW?

    But, if RISC-V is run with similar restrictions on the pipeline, for
    some of the programs tested (such as Doom), it seems to require
    executing around twice as many instructions for a similar amount of work
    (*).

    The design philosphy of RISC-V favours having simple instructions and
    combining them in the decoder over providing combined instructions, so
    one would expect more executed instructions for RV64G(C) than for ARM
    A64, which favours a fixed 32-bit format with instructions that do as
    much as fits in 32 bits (i.e., precombined instructions). But if
    combining in the decoder works, that does not mean that the programs
    take longer to execute with a similarly capable back-end.

    Though, this is not true of Dhrystone, where seemingly RISC-V executes
    fewer instructions.

    Than what?

    Here's the instruction counts I get for gforth-fast onebench.fs:

    2244358492 AMD64
    1897389481 ARM A64
    2170142765 rv64gc

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Mon Jan 8 10:21:21 2024
    Quadibloc <quadibloc@servername.invalid> writes:
    On Sun, 07 Jan 2024 14:30:53 +0000, Anton Ertl wrote:
    Quadibloc <quadibloc@servername.invalid> tried to write:

    The competitive demands Intel and AMD face - the desires of us as >>>consumers - are what prevents this from happening, and I see no hope for >>>the world to change to what might be seen as the path of virtue in this >>>area.

    Nobody forces you to replace your CPU with one with a denser process. If
    you want to use a 32nm CPU, get, e.g., an Intel Sandy Bridge. Or you
    can get a Raspi 3, where the SoC is made in 40nm (according to
    <https://wikimovel.com/index.php?title=Broadcom_BCM2837>), and which
    uses in-order processing as well.

    What I wrote didn't contradict what you are saying in your response.

    I am not saying that Intel and AMD are forcing us to buy newer and
    faster microprocessors. (I could say that *Microsoft* is forcing us
    to buy newer and faster microprocessors, by refusing to continue
    issuing security updates for Windows 7, or, for that matter,
    Windows XP, Windows 98, or even Windows 3.1.

    Windows 10 works on pretty old hardware. However, for Windows 11
    tricks are required to make it run on everything but relatively recent hardware.

    As for "us", speak for yourself. If Microsoft does not support the
    hardware I own on any supported Windows, I certainly won't buy new
    hardware for it. It's only the game operating system for me.

    Instead, what I wrote said that we, as consumers, are so greedy
    for ever faster computers that we are the ones to blame for forcing
    Intel and AMD to resort to techniques that require expensive fabs
    to make the chips, and that require the chips to have enormous
    numbers of transistors for each individual core.

    There is certainly something to that, because the highest-performing
    CPUs are bought at a big premium compared to slightly slower ones.
    And in particular, you can buy cheap CPUs (like the Ryzen 5600G) or
    cheap systems like the GIGABYTE Brix GB-BMCE-4500C. But guess what,
    these also use TSMC 7nm or Intel 7 processes, so going for these
    processes does not seem that excessive.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Mon Jan 8 14:31:00 2024
    mitchalsup@aol.com (MitchAlsup) writes:
    Thomas Koenig wrote:

    Fortunately, the assembler will do this for you:

    [tkoenig@cfarm120 ~]$ cat foo.s
    .file "add.c"
    .machine power10
    .abiversion 2
    .section ".text"
    .align 2
    .p2align 4,,15
    .globl foo
    .type foo, @function
    foo:
    ..LFB0:
    .cfi_startproc
    .localentry foo,1
    addi 3,3,0
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    blr
    .long 0
    .byte 0,0,0,0,0,0,0,0
    .cfi_endproc
    [tkoenig@cfarm120 ~]$ gcc -c foo.s
    [tkoenig@cfarm120 ~]$ objdump -d foo.o

    foo.o: file format elf64-powerpcle


    Disassembly of section .text:

    0000000000000000 <foo>:
    0: 00 00 63 38 addi r3,r3,0
    4: ad de 00 06 paddi r3,r3,3735928559
    8: ef be 63 38
    c: ad de 00 06 paddi r3,r3,3735928559
    10: ef be 63 38
    14: ad de 00 06 paddi r3,r3,3735928559
    18: ef be 63 38
    1c: ad de 00 06 paddi r3,r3,3735928559
    20: ef be 63 38
    24: ad de 00 06 paddi r3,r3,3735928559
    28: ef be 63 38
    2c: ad de 00 06 paddi r3,r3,3735928559
    30: ef be 63 38
    34: ad de 00 06 paddi r3,r3,3735928559
    38: ef be 63 38
    3c: 00 00 00 60 nop
    40: ad de 00 06 paddi r3,r3,3735928559
    44: ef be 63 38

    How did 17 adds become 8 ??

    So, unless you prefer to write direct machine code, this should
    not be an issue.

    It didn't. Thomas only showed the first cache line.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to BGB on Mon Jan 8 14:39:26 2024
    BGB <cr88192@gmail.com> writes:
    On 1/7/2024 10:41 PM, George Neuner wrote:
    On Sat, 6 Jan 2024 11:48:40 -0600, BGB <cr88192@gmail.com> wrote:

    Windows merits:
    More software support;
    Has nearly all of the games;
    No endless fights with trying to get the GPU and sound hardware working; >>> Much less needing to fight with hardware driver issues in general;

    - DLLs can have private heaps


    Pros/cons it seems. I would have considered this a con.

    Indeed. A huge con.

    FWIW, it's not that difficult to implement a private heap in unix
    if needed (e.g. a pool allocator built on brk() or mmap()).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to MitchAlsup on Mon Jan 8 18:54:51 2024
    MitchAlsup <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Fortunately, the assembler will do this for you:

    [tkoenig@cfarm120 ~]$ cat foo.s
    .file "add.c"
    .machine power10
    .abiversion 2
    .section ".text"
    .align 2
    .p2align 4,,15
    .globl foo
    .type foo, @function
    foo:
    ..LFB0:
    .cfi_startproc
    .localentry foo,1
    addi 3,3,0
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    paddi 3,3,3735928559
    blr
    .long 0
    .byte 0,0,0,0,0,0,0,0
    .cfi_endproc
    [tkoenig@cfarm120 ~]$ gcc -c foo.s
    [tkoenig@cfarm120 ~]$ objdump -d foo.o

    foo.o: file format elf64-powerpcle


    Disassembly of section .text:

    0000000000000000 <foo>:
    0: 00 00 63 38 addi r3,r3,0
    4: ad de 00 06 paddi r3,r3,3735928559
    8: ef be 63 38
    c: ad de 00 06 paddi r3,r3,3735928559
    10: ef be 63 38
    14: ad de 00 06 paddi r3,r3,3735928559
    18: ef be 63 38
    1c: ad de 00 06 paddi r3,r3,3735928559
    20: ef be 63 38
    24: ad de 00 06 paddi r3,r3,3735928559
    28: ef be 63 38
    2c: ad de 00 06 paddi r3,r3,3735928559
    30: ef be 63 38
    34: ad de 00 06 paddi r3,r3,3735928559
    38: ef be 63 38
    3c: 00 00 00 60 nop
    40: ad de 00 06 paddi r3,r3,3735928559
    44: ef be 63 38

    How did 17 adds become 8 ??

    I didn't paste the rest because I felt it was irrelevant to the
    main point illustrated: The assembler will insert nops as
    required.

    (You might also note the lack of the final BLR).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Mon Jan 8 21:00:05 2024
    BGB wrote:

    On 1/8/2024 3:59 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:

    If even VLIW could not compete with ARM in a certain embedded niche,
    why should LIW?


    Because:
    True VLIW relies heavily on being able to extract a good amount of ILP,
    but falls on its face if not enough ILP is available;

    Note:: GBOoO machines can extract parallelism with instructions hundreds
    of instructions apart, whereas VLIW compilers cannot. Most of this para-
    lelism is causal not absolute (2 potentially aliasing addresses did not actually alias this iteration).

    A vaguely RISC-like LIW design (similar to ESP32 or similar), can still
    be performance competitive even with fairly meager ILP (where
    effectively it functions like a normal RISC just with explicit tagging
    rather than a superscalar fetch).

    But, if RISC-V is run with similar restrictions on the pipeline, for
    some of the programs tested (such as Doom), it seems to require
    executing around twice as many instructions for a similar amount of work >>> (*).

    The design philosphy of RISC-V favours having simple instructions and
    combining them in the decoder over providing combined instructions, so
    one would expect more executed instructions for RV64G(C) than for ARM
    A64, which favours a fixed 32-bit format with instructions that do as
    much as fits in 32 bits (i.e., precombined instructions). But if
    combining in the decoder works, that does not mean that the programs
    take longer to execute with a similarly capable back-end.


    Combining stuff in the decoder is expensive though...

    Only in power and area.....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to All on Mon Jan 8 21:22:43 2024
    I have made another change to the instruction formats, despite
    feeling that they are now largely finished, so that I'm ready
    to start defining the opcodes for all the instructions.

    Earlier, given that it was noted that my 15-bit short instructions
    would be difficult for a compiler to work with, given the
    restriction that the source and destination registers belong to the
    same group of eight registers within a bank of 32 registers, I
    replaced them with 17-bit short instructions, only available within
    blocks of variable-length instructions.

    I've brought back the 15-bit short instructions, but now only within
    an alternate or supplementary set of 32-bit instructions. This way,
    the main benefit of getting rid of the 15-bit instructions from my
    perspective - avoiding any address mode restrictions on the basic
    load-store memory-reference instructions - is retained.

    I felt that as the restriction on the 15-bit instructions was a good
    fit to the ISA having VLIW capabilities - implying that under some circumstances, a coding style suited to dealing with an exposed pipeline
    would be used - they were useful; and having short instructions
    available within code composed of 32-bit instructions rather than code
    with variable-length instructions would also promote compact code, and
    also complement the earlier feature of composed instructions, that makes instructions longer than 32 bits available without going to variable-length instructions.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Tue Jan 9 00:38:35 2024
    On Mon, 08 Jan 2024 21:22:43 +0000, Quadibloc wrote:

    I've brought back the 15-bit short instructions, but now only within an alternate or supplementary set of 32-bit instructions.

    Not being able to restrain myself when there are yet further depths
    of wretched excess to be plunged into, I have now added two additional alternate sets of 32-bit instructions, for a total of three.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to quadibloc@servername.invalid on Mon Jan 8 21:55:30 2024
    On Mon, 8 Jan 2024 01:12:32 -0000 (UTC), Quadibloc <quadibloc@servername.invalid> wrote:

    On Mon, 08 Jan 2024 00:44:23 +0000, MitchAlsup wrote:
    Quadibloc wrote:

    I am not saying that Intel and AMD are forcing us to buy newer and
    faster microprocessors. (I could say that *Microsoft* is forcing us to
    buy newer and faster microprocessors, by refusing to continue issuing
    security updates for Windows 7, or, for that matter, Windows XP,
    Windows 98, or even Windows 3.1. Then I would be disagreeing with you,
    but I wasn't getting into that part of the issue.)

    I am calling Strawman on this::

    I am of the opinion that the SW that arrives with a box/laptop be the
    same over the lifetime of the product. I turn all updating of SW off and
    remove power at time I am not using the device to prevent MS from
    updating things I DON'T want updated--this includes security patches.

    The fact that I am seeing your posts here means that you _have_ tried >connecting a computer to the Internet. Which invalidates the first
    response to that which comes to mind.

    Perhaps you use Linux or something. But as Windows users know very well >through sad experience is that if you don't keep your computer up-to-date,
    to patch vulnerabilities in Windows as soon as they are discovered...
    your computer could end up infected within minutes of being connected to
    the Internet.

    John Savard

    Worse then that ... if you don't keep Windows up to date, sooner or
    later you find some application software won't update, or won't work
    after it does update. And new software that won't install, or worse
    installs but won't run, because it depends on some feature introduced
    by a "minor" update.

    And if you do keep Windows up to date, you're likely to find devices
    that stop working.


    None of this requires an OS major *upgrade* - just an update.


    Recall the fun with NT4 SP1? How about SP3 or SP6a?
    2K with SP2?
    XP with SP1 and SP3?
    Win7 following the April 2015 service stack update?
    Win10 "versions" 1709 and 2004?

    [Didn't use Win8.x and not likely to touch Win11.]

    YMMV,
    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Tue Jan 9 06:50:00 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    There is a restriction that the prefixed instructions cannot
    cross a 64-byte boundary.

    Ouch. This means that Power with prefixed instructions is the second
    instruction set (after MIPS with its architectural delayed loads)
    where concatenating instruction blocks between two labels may result
    in invalid code; on all other (~10) instruction sets I looked at this
    works fine, including IA-64. Fortunately, for Power that's easy to
    fix by compiling with -mno-prefixed,

    Or by inserting NOPs in the right places; otherwise you lose the >functionality for Power10.

    The instruction blocks are opaque for this technique, so there is no
    way to know where "the right places" would be. And the benefit we get
    from code-block copying and everything that builds on it far exceeds
    what the prefix instructions are likely to buy. E.g., on Power 10
    (numbers are times in seconds):

    sieve bubble matrix fib fft
    0.075 0.099 0.042 0.110 0.032 with code-block copying
    0.181 0.184 0.123 0.230 0.119 without code-block copying

    Fortunately, the assembler will do this for you:

    It does not, because we copy (binary) machine-code blocks.

    So, unless you prefer to write direct machine code, this should
    not be an issue.

    Yes, we copy machine code.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Tue Jan 9 07:01:12 2024
    BGB <cr88192@gmail.com> writes:
    On 1/7/2024 3:21 AM, Quadibloc wrote:
    But when it comes even to the humble low-end laptop, Intel found it
    necessary to redesign their Atom processor to be a lightweight OoO
    chip, instead of the in-order design it originally had.


    Though, to be fair:
    Without OoO, x86 performance is effectively dog-crap.

    Without OoO, performance is much lower on all architectures; e.g.,
    from our LaTeX benchmark:

    Alpha:
    i 21164 600 MHz CPU, 2M L3-Cache, Redhat-Linux (a5) 8.1
    o Compaq XP1000 21264 500MHz 4M L2 (a7) 5.5

    AMD64:
    i Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Debian 9 64bit 2.368
    o AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216
    o Celeron J1900 (Silvermont) 2416MHz (Shuttle XS35V4) Ubuntu16.10 1.052
    o Xeon W-1370P (=Core i7-11700K), 5200MHz, Debian 11 (64-bit) 0.175

    ARM A64:
    i Rock 5B (1805MHz A55) Debian 11 (texlive-latex-recommended) 2.105
    o Odroid N2 (1800MHz Cortex A73) Ubuntu 18.04 1.224
    o Apple M1 Firestorm 3000MHz Asahi Linux Debian pre12 0.27

    "i" stands for in-order, and I present the best in-order result for
    the respective architecture. "o" stands for OoO, and I present both
    results with OoO cores with width comparable to the fastest in-order
    core on the same architecture, and the fastest core available.

    The 21164 and 21264 both are 4-wide. The Atom 330, E-450, and
    Silvermont are all 2-wide. The Cortex-A55 and Cortex-A73 are both
    2-wide.

    For many other ISA's, like 64-bit ARM, the performance holds up a lot
    better, and the up-front performance gains from in-order to OoO seems to
    be comparably smaller.

    Really? If we believe these results, the Cortex-A55 and the Intel
    Atom 330 show exactly the same performance/MHz (caveat: the Debian 11
    version of LaTeX, especially with the recommended extensions, probably
    does more work). The Cortex-A73's speedup over the A55 is slightly
    smaller than that of the E-450 oder Silvermont over the Atom 330, but
    then the A73 is a slightly older architecture than the A55 (and
    manufactured in a less advanced process), while the E-450 and
    Silvermont are younger (and manufactured in a more advanced process)
    than the Atom 330.

    Since, throw a crappy codegen at an x86, and it will happily accept it
    and run at nearly the same speed as the better codegen;

    What makes you think so?

    but throw it at
    an A53, and one find that it seemingly performs 3x-5x worse than the
    code that GCC produces

    I have seen speed differences by a factor of 3 or more between gcc -O0
    and gcc -O on various IA-32 and AMD64 implementations, as well as on
    other architectures.

    I have noticed, though that Intel and AMD engineers have worked over
    the years to get rid of some of the performance kinks that exist in
    older implementations and that are more often seen in implementations
    of other architectures. One example that I can think of is the
    performance of unaligned accesses. But code that keeps variables in
    memory rather than registers still is slow; yes, zero-cycle
    store-to-load forwarding helps, but even a Golden Cove can only
    perform IIRC 2 loads and 2 stores per cycle, whereas it can perform
    at least 10 (architectural) register reads and 5 register writes per
    cycle.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Tue Jan 9 07:56:21 2024
    BGB <cr88192@gmail.com> writes:
    On 1/8/2024 3:59 AM, Anton Ertl wrote:
    If all programs used just Dalvik, yes, you would "just" need to write
    a Dalvik implementation for your VLIW. But reality is, that there are
    enough programs that are written or have components distributed as
    native code to make your non-ARM architecture uncompetetive, even with
    a working binary translator.


    Possibly so.

    But, there were things like Atom-based Android devices, and stuff still >worked there as well, so...

    Yes, and at one point we got a report from someone who used such a
    tablet, and saw ARM code when he asked Gforth to show the code for a
    primitive. It turned out that, despite Android having fat binaries
    (which include code for multiple architectures) or somesuch, and us
    building them, for some reason the ARM code was run under emulation
    rather than the native code for Intel CPU. The emulation apparently
    provided the functionality just fine, but I doubt that the performance
    was good in the case of Gforth (lots of indirect branches, a worst
    case for binary translators).

    In any case, Intel found that they could not compete (with a profit)
    in that area and stopped developing new SoCs for smartphones and
    tablets. Which supports my claim.

    Trimedia (and the TMS320C6x) line differ partly in that they were true
    VLIW, rather than "LIW". So, in this case, I was imagining something
    more similar to the ESP32 (LX6) or Qualcomm Hexagon or similar.

    If even VLIW could not compete with ARM in a certain embedded niche,
    why should LIW?


    Because:
    True VLIW relies heavily on being able to extract a good amount of ILP,
    but falls on its face if not enough ILP is available;
    A vaguely RISC-like LIW design (similar to ESP32 or similar), can still
    be performance competitive even with fairly meager ILP (where
    effectively it functions like a normal RISC just with explicit tagging
    rather than a superscalar fetch).

    Ok, so in embedded systems there are too few widely-used application
    scenarios with wide ILP to make VLIW development profitable. That
    would not surprise me. I guess, that just like SIMD was the form of
    explicit parallelism that provided a large part of the benefits in the
    niche where IA-64 shone, but with less cost, the same happened with
    TriMedia.

    Still, if VLIW could not compete, why should LIW? The benefit of not
    having to check for register dependencies is small for 2-wide CPUs.

    It seems to me that the thing that sells the ESP32 and ESP32-S2/S3 was
    not the architecture of their core but the SoCs they are in, and
    especially the Wi-Fi capability.

    The design philosphy of RISC-V favours having simple instructions and
    combining them in the decoder over providing combined instructions, so
    one would expect more executed instructions for RV64G(C) than for ARM
    A64, which favours a fixed 32-bit format with instructions that do as
    much as fits in 32 bits (i.e., precombined instructions). But if
    combining in the decoder works, that does not mean that the programs
    take longer to execute with a similarly capable back-end.


    Combining stuff in the decoder is expensive though...

    Apparently cheap enough that the RISC-V people decided that that is
    preferable to having more instructions or more addressing modes.

    But, what if one can have something that is at least a little more >performance competitive, but also "free and open" like RISC-V.

    Performance is a property of the implementation, not the architecture. Especially these days. I expect that one could even make a performance-competetive VAX these days. Not that RISC-V poses the
    same kind of hurdles to the implementor as VAX.

    But, still kinda "glass cannon" performance on the A53, it seems to
    behave like it does something like:
    Look at two instructions;
    Can we run these in parallel?
    If yes, do so.
    If no, execute each sequentially.
    With full latency penalties if you try to load something and then
    immediately do arithmetic on it, ...

    Sure, that's what you get with an in-order implementation. You can
    twist the pipeline like the i486 and Pentium and especially the
    Bonnell (Atom) have done: On the Bonnell you can load a value and have
    a zero-cycle latency to the computation. But if you make a
    computation and use the result as address in a load, that costs IIRC 4
    cycles on the Bonnell.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Tue Jan 9 09:25:10 2024
    On Tue, 09 Jan 2024 00:38:35 +0000, Quadibloc wrote:

    On Mon, 08 Jan 2024 21:22:43 +0000, Quadibloc wrote:

    I've brought back the 15-bit short instructions, but now only within an
    alternate or supplementary set of 32-bit instructions.

    Not being able to restrain myself when there are yet further depths of wretched excess to be plunged into, I have now added two additional
    alternate sets of 32-bit instructions, for a total of three.

    I have now done something more important: after showing the prefix
    bits which make for Composed Instructions, I now show the formats
    of those instructions themselves on the page

    http://www.quadibloc.com/arch/cw010201.htm

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Tue Jan 9 08:47:51 2024
    Anton Ertl wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    There is a restriction that the prefixed instructions cannot
    cross a 64-byte boundary.
    Ouch. This means that Power with prefixed instructions is the second
    instruction set (after MIPS with its architectural delayed loads)
    where concatenating instruction blocks between two labels may result
    in invalid code; on all other (~10) instruction sets I looked at this
    works fine, including IA-64. Fortunately, for Power that's easy to
    fix by compiling with -mno-prefixed,
    Or by inserting NOPs in the right places; otherwise you lose the
    functionality for Power10.

    The instruction blocks are opaque for this technique, so there is no
    way to know where "the right places" would be. And the benefit we get
    from code-block copying and everything that builds on it far exceeds
    what the prefix instructions are likely to buy. E.g., on Power 10
    (numbers are times in seconds):

    sieve bubble matrix fib fft
    0.075 0.099 0.042 0.110 0.032 with code-block copying
    0.181 0.184 0.123 0.230 0.119 without code-block copying

    Fortunately, the assembler will do this for you:

    It does not, because we copy (binary) machine-code blocks.

    So, unless you prefer to write direct machine code, this should
    not be an issue.

    Yes, we copy machine code.

    - anton

    How about for POWER10 prefixed instructions always emit them as

    prefix
    inst
    nop

    Then when you copy the code block check the 64B boundary.
    If the prefix and inst cross it then move the nop up and prefix,inst down

    nop
    prefix
    inst

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Tue Jan 9 16:37:16 2024
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    [Code-block copying]
    The instruction blocks are opaque for this technique, so there is no
    way to know where "the right places" would be.

    How about for POWER10 prefixed instructions always emit them as

    prefix
    inst
    nop

    Then when you copy the code block check the 64B boundary.
    If the prefix and inst cross it then move the nop up and prefix,inst down

    nop
    prefix
    inst

    As mentioned, the code blocks are opaque to the copying technique; the
    program that copies knows nothing about the instructions in the code
    block, and in particular it would not know whether it contains a
    Power3.1 prefix instruction and where. It also does not know whether
    it ends in a MIPS load instruction (another problematic case).

    Fortunately it is easy to avoid the prefix instructions altogether, so
    that's what we have done. The MIPS case is harder, and MIPS also
    causes other trouble, so we just disabled code-copying there. Maybe
    for more mainstream architectures we would have gone to greater
    lengths, but they no longer are mainstream.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Tue Jan 9 19:19:46 2024
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    [Code-block copying]
    The instruction blocks are opaque for this technique, so there is no
    way to know where "the right places" would be.

    How about for POWER10 prefixed instructions always emit them as

    prefix
    inst
    nop

    Then when you copy the code block check the 64B boundary.
    If the prefix and inst cross it then move the nop up and prefix,inst down

    nop
    prefix
    inst

    As mentioned, the code blocks are opaque to the copying technique; the program that copies knows nothing about the instructions in the code
    block, and in particular it would not know whether it contains a
    Power3.1 prefix instruction and where.

    The difficulty of recognizing a Power Prefix instruction is low: It
    has major opcode 1.

    However, changing the position of instructions requires handling
    relocations in branches, which is probably not what you want to do.

    I have to say that your application is the first one I ever
    heard about that just pastes binary blobs of executables
    together. How do you manage branches which exceed the normal
    range (or is this something that cannot happen)?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Tue Jan 9 22:26:24 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:
    The difficulty of recognizing a Power Prefix instruction is low: It
    has major opcode 1.

    The difficulty of using -mno-prefixed is lower:-)

    However, changing the position of instructions requires handling
    relocations in branches, which is probably not what you want to do.

    If we wanted to cater to prefixed instructions, the way to go would be
    to insert enough noops before a code block containing a prefixed
    instruction that none of the prefixed instructions in the code block
    would violate the 64-byte-boundary restriction; at worst this means
    inserting as many noops as needed to have it aligned in the same way
    as in its original place.

    Concerning relocation, we copy only relocatable blocks. That is
    checked by compiling the same source code for the block twice, in two functions, with one function having padding between the blocks. If
    the resulting code blocks contain the same bytes, they are relocatable
    and can be used for this technique. If not, one has to fall back to
    jumping to the original code block for this piece of code.

    I have to say that your application is the first one I ever
    heard about that just pastes binary blobs of executables
    together. How do you manage branches which exceed the normal
    range (or is this something that cannot happen)?

    A code block may have internal branches, but otherwise all control
    flow is performed through indirect branches. The branch targets, just
    like any other virtual-machine-level immediate data, is accessed
    through a VM instruction pointer.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Wed Jan 10 04:12:44 2024
    On Tue, 09 Jan 2024 09:25:10 +0000, Quadibloc wrote:

    On Tue, 09 Jan 2024 00:38:35 +0000, Quadibloc wrote:

    On Mon, 08 Jan 2024 21:22:43 +0000, Quadibloc wrote:

    I've brought back the 15-bit short instructions, but now only within
    an alternate or supplementary set of 32-bit instructions.

    Not being able to restrain myself when there are yet further depths of
    wretched excess to be plunged into, I have now added two additional
    alternate sets of 32-bit instructions, for a total of three.

    I have now done something more important: after showing the prefix bits
    which make for Composed Instructions, I now show the formats of those instructions themselves on the page

    http://www.quadibloc.com/arch/cw010201.htm

    And now I'm getting even more serious about completing the description
    of instruction formats, so as to move on to listing the opcodes. On the
    page

    http://www.quadibloc.com/arch/cw01.htm

    I have now shown the layouts of the 32-bit forms of the short vector and
    long vector instructions.

    To find opcode space for them, I've had to put them in the third alternate
    set of 32-bit instructions, so a block header is required to access them.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Wed Jan 10 05:08:19 2024
    On Wed, 10 Jan 2024 04:43:29 +0000, Quadibloc wrote:

    If the first bit of an instruction prefix is 1, it will be a leftward
    decoded prefix. Leftward decoded prefixes allow opcode space to be
    shared between prefixes that make different kinds of modifications to an instruction depending on what kind of instruction it is.

    One way of trying to make it clear what a leftward-decoded prefix
    is about is to describe how it would look in the book describing the processor's instruction set.

    Rightward-decoded prefixes could be mostly described in their own
    section of the manual. Either they do the same thing to all
    instructions, or they create their own entirely new instruction set.

    On the other hand, in the section on leftward-decoded prefixes, all
    one could really give is the bit patterns that define a 16-bit prefix,
    a 32-bit prefix, a 48-bit prefix, and so on (should prefixes longer than
    48 bits ever be desired!)

    Instead, under the description of *each instruction*, there would be
    a section saying "if a 16-bit prefix is applied to this instruction,
    it will have these fields in this order, and their functions will be..."
    and the same for any other prefix length that applies to the instruction.

    That way, the complexity of the instruction set doesn't have to be
    duplicated in additional bits in the prefixes themselves: the instruction prefixes the prefix before the now-defined prefix prefixes the instruction.

    Exactly in what way is instruction decoding supposed to be "simple"
    in Concertina II, you might ask. In this way: decoding is strictly
    linear. Not linear in the sense of a straight-line; no, the path of
    decoding may be truly labyrinthine. But when you fetch a 256-bit block,
    you start by checking for a header, and then you do exactly what it
    tells you to do... at each point, the decoder is told what to do, and
    where to look next.

    No backtracking, no speculation.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Wed Jan 10 04:43:29 2024
    On Tue, 09 Jan 2024 00:38:35 +0000, Quadibloc wrote:

    On Mon, 08 Jan 2024 21:22:43 +0000, Quadibloc wrote:

    I've brought back the 15-bit short instructions, but now only within an
    alternate or supplementary set of 32-bit instructions.

    Not being able to restrain myself when there are yet further depths of wretched excess to be plunged into, I have now added two additional
    alternate sets of 32-bit instructions, for a total of three.

    As I've noted, the third alternate set of 32-bit instructions has
    turned out to be quite useful, as it finally provided the opcode
    space needed for short vector and long vector instructions!

    I also added a mechanism whereby these additional sets of 32-bit
    instructions may be used from within blocks of variable-length
    instructions. A set of fourteen *convert* bits indicate, when
    set to 1, that the corresponding 16-bit part of the block is the
    start of a 32-bit instruction - and the *prefix* bits of the
    portion of the header that made the block a block of variable-length instructions which correspond to that part of the block now indicate
    which 32-bit instruction set the instruction belongs to, instead
    of having their usual function.

    Having the convert bit equal to 1 and the prefix bits equal to 00
    would be a redundant way of indicating a regular 32-bit instruction,
    already indicated by the prefix bits equal to 10 when the convert
    bit is not used.

    Originally, I noted that this _could_ be used instead for an extra
    set of 32-bit instructions, unique to blocks of variable-length
    instructions. However, I also thought that this use would be kind of extravagant and silly.

    Now I've come up with a way to use this bit combination that is instead
    more specifically relevant to blocks of variable-length instructions.

    When the convert bit is 1, and the prefix bits are 00, let that indicate
    that the 16 bits referenced are the start of an _instruction prefix_.

    The instruction being prefixed will have to have its own prefix bits set
    to 11 all the way through, including at its first 16 bits, so that no
    attempt will be made to decode it without taking the instruction prefix
    into account.

    Because I want decoding to be extremely straightforward in Concertina II,
    aside from the complexity caused by the vast number of instruction formats,
    I have realized that I am going to need to define two general categories
    of instruction prefixes.

    If the first bit of an instruction prefix is 0, it will be a rightward
    decoded prefix. Such a prefix can have functions like: selecting a set
    of additional instructions completely unrelated to anything existing in
    the ISA, or just doing something extremely simple, like adding opcode bits
    to whatever instruction it follows.

    If the first bit of an instruction prefix is 1, it will be a leftward
    decoded prefix. Leftward decoded prefixes allow opcode space to be shared between prefixes that make different kinds of modifications to an
    instruction depending on what kind of instruction it is. Such prefixes
    are useful for dealing with the cases where I had to severely limit the addressing modes of a kind of instruction to fit it into 32 bits; instead
    of having bits in the prefix to indicate which of a large number of cases
    of this the prefix addresses, it can be indicated by the nature of the instruction being modified.

    Both leftward-decoded and rightward-decoded prefixes may be longer than
    16 bits (in the case of the leftward decoded kind, this has to be
    decoded within the prefix before leftward decoding starts) and may act
    on instructions of lengths longer than 32 bits. But they may not act
    on 17-bit instructions, since the prefix field corresponding to the
    start of a prefixed instruction is forced to be 11, and thus is not
    available to indicate the presence of a 17-bit instruction (as well
    as its first bit).

    John Savard
    modes of certain types of instructions

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Wed Jan 10 07:48:57 2024
    On Wed, 10 Jan 2024 05:08:19 +0000, Quadibloc wrote:

    the
    instruction prefixes the prefix before the now-defined prefix prefixes
    the instruction.

    Which may make you think of this famous work of art:

    https://www.artchive.com/artwork/drawing-hands-maurits-cornelis-escher-1948/

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Wed Jan 10 14:56:17 2024
    On Wed, 10 Jan 2024 04:43:29 +0000, Quadibloc wrote:

    When the convert bit is 1, and the prefix bits are 00, let that indicate
    that the 16 bits referenced are the start of an _instruction prefix_.

    The instruction being prefixed will have to have its own prefix bits set
    to 11 all the way through, including at its first 16 bits, so that no
    attempt will be made to decode it without taking the instruction prefix
    into account.

    I have decided to indeed add instruction prefixes for use with
    variable-length instructions to the instruction set, but *not* to
    require the use of the additional header with the "convert" bit
    for them.

    Instead, instruction prefixes will use some of the unused space at the
    end of the opcode space of the 17-bit short instructions.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Wed Jan 10 17:02:20 2024
    On Wed, 10 Jan 2024 14:56:17 +0000, Quadibloc wrote:

    I have decided to indeed add instruction prefixes for use with variable-length instructions to the instruction set, but *not* to
    require the use of the additional header with the "convert" bit
    for them.

    Instead, instruction prefixes will use some of the unused space at the
    end of the opcode space of the 17-bit short instructions.

    At least, in a minor, token victory for sanity, I decided that instruction prefixes longer than 16 bits (17 bits? 13 bits?) will not be entertained.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Jan 10 16:57:38 2024
    I do not see any hope for ISA excellence.
    Why? MY 66000 exists, and it is excellent.
    Though, the real proof would be if it can be implemented effectively on
    a typical Spartan or Artix class FPGA and also deliver on some of the other claims while doing so (and at a decent clock speed).

    History has shown (RISC-vs-CISC being a prime example) that changes to
    the underlying technology affect which ISA performs best.
    I have the impression that My 66000 is probably not best suited for
    an FPGA implementation.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Jan 10 17:17:10 2024
    The idioms recognized in My 66150 core:
    CMP Rt,--,-- ; BBit Rt,label
    Calk Rd,--,-- ; BCnd Rd,label
    LD Rd,[--] ; BCnd Rd,label
    ST Rd,[--] ; Calk --,--,--
    CALL Label ; BR Label
    These all CoIssue (both instruction pass through the pipeline

    Sorry, what's "Calk"?

    Oh, and what's "BR" (oh, wait, do you mean that the two "Label"s don't
    have to be the same, so you're talking about calling Label1 and setting
    the return address to Label2? Right, yes, that must be it, sorry for
    being dense).


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stefan Monnier on Wed Jan 10 22:32:50 2024
    Stefan Monnier wrote:

    The idioms recognized in My 66150 core:
    CMP Rt,--,-- ; BBit Rt,label
    Calk Rd,--,-- ; BCnd Rd,label
    LD Rd,[--] ; BCnd Rd,label
    ST Rd,[--] ; Calk --,--,--
    CALL Label ; BR Label
    These all CoIssue (both instruction pass through the pipeline

    Sorry, what's "Calk"?

    A calculation instruction {ADD, AND, ...}

    Oh, and what's "BR" (oh, wait, do you mean that the two "Label"s don't
    have to be the same, so you're talking about calling Label1 and setting
    the return address to Label2? Right, yes, that must be it, sorry for
    being dense).

    Yes, call somewhere and change the return address to that of the BR.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Jan 10 18:10:12 2024
    In time, something surely will happen to change matters, and new
    computer architectures will rise up to prominence. Right now, though,
    signs of movement away from x86 to something else are few.

    Really? AFAIK x86 is mostly popular for "personal computers", but the
    21 century has moved back to "mainframes" (farms of servers, where x86
    is still common, but ARM is a serious competitor and ), accessed from
    "weak" devices (smartphones and tablets, mostly using ARM), via network
    devices (using a variety of dedicated hardware, where x86 doesn't seem particularly popular).

    The x86 is not about to disappear, but I think there is a clear movement
    away from it.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stefan Monnier on Thu Jan 11 08:02:18 2024
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    History has shown (RISC-vs-CISC being a prime example) that changes to
    the underlying technology affect which ISA performs best.

    Has it? RISCs were first (in general-purpose computing) in the 1980s
    with pipelining, first with in-order superscalar implementations in
    the early 1990s (SuperSPARC, 88110, 21064), but OOO was introduced in
    IA-32 (with the Pentium Pro) one day earlier than in HPPA (PA-8000).

    The underlying technology may have made the architectural advantages
    of RISCs smaller, but I think that economic and project management
    aspects caused IA-32 to gain the performance advantage that they have
    had between about 2000-2020. The economic advantage was that there
    was more revenue in IA-32 than especially in the split landscape of
    the RISCs.

    Concerning project management, the early RISC implementations could be
    designed by small teams quickly. Designing superscalar and OoO
    implementations meant that teams had to become larger, and the
    projects longer, and project delays showed the teething problems that
    some of the teams had.

    And given these problems, several of the companies were just too happy to
    jump ship when the supposed saviour in the form of IA-64 appeared.

    The funny thing is that HP, MIPS and DEC already had OoO CPUs with
    SIMD instructions (the technologies that outcompeted IA-64). Maybe if
    Intel had made a cleaned-up 64-bit i960 instead of IA-64 as the one architecture to rule them all, and then did an OoO implementation of
    that, we all would be using i960-64 nowadays; but I guess that IA-64
    had the better roadmaps, and that it was easier to convince people in
    HP, MIPS and DEC with IA-64 with its promising new features, while
    just another RISC would have caused the reaction "we already have a
    fine 64-bit RISC, why switch to i960-64"? OTOH, switching to another
    RISC worked for Motorola.

    The irony is that the i960 team was redirected in 1990 to design the
    P6 (Pentium Pro).

    Anyway, back to the original question: Their economic model caused ARM
    T32 (and later A64) to become *the* smartphone architecture, and the
    demands of smartphone apps caused economic pressure for higher
    performance at low power, and despite Intels attempts to break into
    that market, they failed to make substantial inroads and eventually
    gave up; this may have been due to network effects, but they also seem
    to have problems reaching comparable performance at mobile power
    points.

    And if you compare the performance of the Apple Firestorm (ARM A64) to
    Intel and AMD P-cores at comparable power points, Firestorm looks
    pretty good. Even comparing to CPUs with a desktop power budget, the
    Firestorm does not fall far behind <https://images.anandtech.com/doci/16192/spec2006_A14_575px.png> (from <https://www.anandtech.com/show/16192/the-iphone-12-review/2>; I think
    Andrei Frumansu provided other measurements where the distance was
    even smaller; ah, there we are: <https://images.anandtech.com/doci/16983/SPECint-energy_575px.png>
    from <https://www.anandtech.com/print/16983/the-apple-a15-soc-performance-review-faster-more-efficient>;
    note how he does not give Joule results for the AMD64
    implementations). Maybe the people who designed the Firestorm are
    more capable than those who designed Skylake and Zen3, or maybe the
    advantages of the ARM A64 architecture allow them to provide more
    performance than AMD64 at the same power.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to BGB on Thu Jan 11 19:42:11 2024
    BGB <cr88192@gmail.com> writes:
    On 1/11/2024 2:02 AM, Anton Ertl wrote:


    Auto-increment:
    One has an operation that either needs to do something weird with the >register ports, or, more likely, needs to be decoded as two operations
    in the pipeline. It is also comparably infrequent, as "*ptr++" isn't
    used *that* often.

    There are certainly far more use cases for post increment than
    *ptr++ (which is actually a fairly common construct in library code
    like the string functions or a naive memcpy).

    ARM64, for example uses it to push the frame pointer and link
    register in a function prologue.

    401008: a9b87bfd stp x29, x30, [sp,#-128]!

    Other examples (from the compiled elf for the coremark program):

    40edf0: a9c4382d ldp x13, x14, [x1,#64]!
    40ee9c: f85f8c23 ldr x3, [x1,#-8]!
    40eea0: f81f8cc3 str x3, [x6,#-8]!
    40eea8: b85fcc23 ldr w3, [x1,#-4]!
    40eeac: b81fccc3 str w3, [x6,#-4]!
    40eeb4: 785fec23 ldrh w3, [x1,#-2]!
    40eeb8: 781fecc3 strh w3, [x6,#-2]!
    40eedc: f85f8c23 ldr x3, [x1,#-8]!
    40eee0: f81f8cc3 str x3, [x6,#-8]!
    40eee8: b85fcc23 ldr w3, [x1,#-4]!
    40eeec: b81fccc3 str w3, [x6,#-4]!
    40eef4: 785fec23 ldrh w3, [x1,#-2]!
    40eef8: 781fecc3 strh w3, [x6,#-2]!
    40ef00: 385ffc23 ldrb w3, [x1,#-1]!
    40ef04: 381ffcc3 strb w3, [x6,#-1]!
    40ef18: a9fc2027 ldp x7, x8, [x1,#-64]!
    40ef28: a9bc20c7 stp x7, x8, [x6,#-64]!
    40ef8c: a9fc382d ldp x13, x14, [x1,#-64]!
    40efa8: a9bc38cd stp x13, x14, [x6,#-64]!
    40efac: a9fc382d ldp x13, x14, [x1,#-64]!
    40efc4: a9bc38cd stp x13, x14, [x6,#-64]!
    40f110: a9c3382d ldp x13, x14, [x1,#48]!
    40f12c: a98438cd stp x13, x14, [x6,#64]!
    40f130: a9c4382d ldp x13, x14, [x1,#64]!
    40f234: a9841d07 stp x7, x7, [x8,#64]!
    40f350: a9bd7bfd stp x29, x30, [sp,#-48]!
    40f410: a9bb7bfd stp x29, x30, [sp,#-80]!
    40f4f8: a9b97bfd stp x29, x30, [sp,#-112]!
    40f748: a9bd7bfd stp x29, x30, [sp,#-48]!
    40f778: a9bb7bfd stp x29, x30, [sp,#-80]!
    40f8c8: b8404cc5 ldr w5, [x6,#4]!
    40f918: b85fcc22 ldr w2, [x1,#-4]!
    40f940: a9bb7bfd stp x29, x30, [sp,#-80]!
    40fa50: a9ba7bfd stp x29, x30, [sp,#-96]!
    40fbe0: b85fcc43 ldr w3, [x2,#-4]!
    40fbe4: b85fcc24 ldr w4, [x1,#-4]!
    40fc10: a9bb7bfd stp x29, x30, [sp,#-80]!
    40fd20: b85fcc81 ldr w1, [x4,#-4]!
    40fdd8: a9ba7bfd stp x29, x30, [sp,#-96]!
    40ff20: a9ba7bfd stp x29, x30, [sp,#-96]!
    40ff6c: b81f8c14 str w20, [x0,#-8]!
    410050: a9bb7bfd stp x29, x30, [sp,#-80]!
    4101e8: b85fcc20 ldr w0, [x1,#-4]!
    410248: a9b87bfd stp x29, x30, [sp,#-128]!
    4103c0: f8410c60 ldr x0, [x3,#16]!
    410510: f8410c60 ldr x0, [x3,#16]!
    410644: f8410ec0 ldr x0, [x22,#16]!
    ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Terje Mathisen on Fri Jan 12 04:09:30 2024
    On Mon, 13 Nov 2023 16:10:20 +0100, Terje Mathisen wrote:
    MitchAlsup wrote:
    Chris M. Thomasson wrote:

    Think of LL/SC... If one did not honor the reservation granule....
    well... Shit.. False sharing on a reservation granule can cause live
    lock and damage forward progress wrt some LL/SC setups.

    One should NEVER (N. E. V. E. R.) attempt ATOMIC stuff on an unaligned
    container. Only aligned containers possess ATOMIC-smelling properties.

    This is so obviously correct that you should not have needed to mention
    it. Hammering HW with unaligned (maybe even page-straddling) LOCKed
    updates is something that should only ever be done for testing purposes.

    While older machines used an "exchange" instruction for something
    atomic, the IBM 360 had the "Test and Set" instruction which had a
    single-byte operand, to avoid the issue.

    However, qualifications are needed to make the statement "obviously
    correct". Basically, one should never attempt an atomic operation on
    an unaligned value in memory... on a machine that does paging. Because
    the unaligned value _might_ cross a page boundary.

    Otherwise, there's no problem. And a computer certainly _could_ be
    aware that precautions are needed for atomic instructions, and
    proceed with their execution only after all the memory pages involved
    were brought into memory, and locked there. That would still mean
    the computer would be slowed unnecessarily, but error-free operation
    can be guaranteed.

    So if someone wanted, they could design a computer which didn't mind
    atomic operations on unaligned values all that much.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Fri Jan 12 08:11:06 2024
    BGB <cr88192@gmail.com> writes:
    Though, as I see it, 64-bit ARM still has a few concerning features in
    these areas:
    Auto-increment addressing;
    ALU status-flag bits;
    ...

    Both are features that the architects of MIPS (and its descendents,
    including Alpha and RISC-V) considered so concerning that they do not
    feature them in their architecture. The architects of A64 knew these
    concerns, yet decided to include these features, so they obviously
    were sure that they could implement these features efficiently at
    bearable cost.

    Auto-increment:
    One has an operation that either needs to do something weird with the >register ports, or, more likely, needs to be decoded as two operations
    in the pipeline.

    Yes. A64 seems to be designed to "do something weird with the
    register ports"; it has instructions that write three registers and instructions that read four registers.

    It is also comparably infrequent, as "*ptr++" isn't
    used *that* often.

    Out of 192106 instructions in /bin/bash (in Debian 11), there are 1829 instructions with pre-increment "]!"; most of them are stp (store
    pair) instructions, and the increment is usually negative and often
    smaller than the size of the two registers, the address register is
    usually sp. So the usual use seems to be for saving caller-saved or callee-saved registers.

    Out of these 195815 instructions, 3002 use post-increment "],"; most
    of them are ldp (load pair) instructions, and the increment is usually positive, and the address register is usually sp (2688 cases). So
    most of these cases seem to be due to loading caller-saved registers
    after the call or callee-saved registers before the return.

    Overall, there are 25197 loads and stores that use sp as address
    register, out of 61387 loads and stores.

    [a76:~:536] objdump -d /bin/bash|grep "^ "|wc -l
    192106
    [a76:~:537] objdump -d /bin/bash|grep ']!'|wc -l
    1829
    [a76:~:538] objdump -d /bin/bash|grep '[[][a-z].*],'|wc -l
    3002
    [a76:~:539] objdump -d /bin/bash|grep 'sp],'|wc -l
    2688
    [a76:~:540] objdump -d /bin/bash|grep '[[]sp'|wc -l
    25197
    [a76:~:541] objdump -d /bin/bash|grep '[[][a-z]'|wc -l
    61387

    ALU status flags:
    The flags themselves are fairly rarely used in practice,

    Conditional branches tend to be quite frequent.

    but the cost of
    keeping these sorts of flags consistent in the pipeline is not so cheap.

    Intel uses as many physical flags registers as physical integer
    registers (280 each on Tigerlake and Golden Cove), ARM somewhat less
    than the integer registers. And the register renamer needs to keep
    track of them separately (for AMD64 it needs to keep track of C, O,
    and NZP separately; I expect that A64 is better in this respect).
    Yes, not cheap, but obviously manageable.

    And, possibly the cost difference between a 1-bit status flag and, say,
    4 or 5 flag bits, isn't that large. In either case, may make sense to
    limit which instructions may update flags (unlike x86)

    Actually, updating all flags in every instruction would make the
    implementation easier, too: every flag-using instruction would only be
    able to use the result of the previous instruction, no need to store
    flags longer. However, I guess it might be harder to program with
    such a model.

    My take is that GPRs should have additional carry and overflow flags
    (which are not stored and loaded with the usual store and load
    instructions); they have the information of the N and Z flags already.
    This makes tracking the flags easy, and also allows programs to deal
    with multiple live carry flags, as needed for multi-precision
    multiplication.

    and possibly only
    allow them in "lane 1" or whatever the equivalent is (the secondary ALUs
    only doing non-flags-updating forms).

    That's an implementation issue. Do ARM A64 implementations have such restrictions?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Anton Ertl on Fri Jan 12 15:14:00 2024
    On Fri, 12 Jan 2024 08:11:06 +0000, Anton Ertl wrote:

    BGB <cr88192@gmail.com> writes:
    Though, as I see it, 64-bit ARM still has a few concerning features in >>these areas:
    Auto-increment addressing;
    ALU status-flag bits;
    ...

    Both are features that the architects of MIPS (and its descendents,
    including Alpha and RISC-V) considered so concerning that they do not
    feature them in their architecture. The architects of A64 knew these concerns, yet decided to include these features, so they obviously
    were sure that they could implement these features efficiently at
    bearable cost.

    As even the PDP-8 had auto-increment addressing, certainly
    its cost must be bearable, if cost is thought of as the
    number of transistors required to implement it. Although
    that feature seens odd for a RISC design even to me.

    As for ALU status flag bits, I think they're a feature that
    should be kept. But one concern with them relates to the
    same reason that some early RISC architectures had branch
    delay slots.

    So a common way in which this concern is mitigated is for
    RISC architectures to include, in instructions that can
    affect the condition codes, a bit that controls whether or
    not they do so. That way, other operate instructions can
    be placed between an instruction that sets the condition
    codes and the branch instruction that tests them.

    The PowerPC architecture went further, also perhaps
    addressing another concern with ALU status bits, by
    having multiple sets of condition codes, so that the
    condition codes would behave more like registers,
    rather than being a unique resource.

    To my mind, ALU status bits are at least essential
    for things like add-with-carry for multiple-precision
    arithmetic. Otherwise, one would need multiple
    awkward instructions to perform the same function.

    And since RISC typically has only load and store
    memory-reference instructions, thus limiting each
    instruction to one basic action, a design that
    apparently forces operate instructions to be
    combined with conditional branch instructions
    seems to be the opposite of RISC. I presume that
    they _don't_ solve the problem by including a
    conditional skip in operate instructions, that
    can skip over a jump instruction that follows
    them (sort of like a PDP-8!)... clearly, I'll
    need to take another look at the MIPS and/or
    the Alpha to see what it is that they _are_
    doing, to understand how it fits into the
    RISC philosophy.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Fri Jan 12 15:19:11 2024
    On Fri, 12 Jan 2024 15:14:00 +0000, Quadibloc wrote:

    clearly, I'll
    need to take another look at the MIPS and/or
    the Alpha to see what it is that they _are_
    doing, to understand how it fits into the
    RISC philosophy.

    Oh, silly me. I remembered shortly after: what
    is combined with a branch to make a conditional
    branch is not an operate instruction, but a
    test of the contents of a specified register.

    That's clearly basic enough to fit with RISC,
    but if the carry out from an operation is what
    you want to test, then awkwardness ensues.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Fri Jan 12 15:53:37 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    BGB <cr88192@gmail.com> writes:
    Though, as I see it, 64-bit ARM still has a few concerning features in >>these areas:
    Auto-increment addressing;
    ALU status-flag bits;
    ...

    Both are features that the architects of MIPS (and its descendents,
    including Alpha and RISC-V) considered so concerning that they do not
    feature them in their architecture. The architects of A64 knew these >concerns, yet decided to include these features, so they obviously
    were sure that they could implement these features efficiently at
    bearable cost.

    Auto-increment:
    One has an operation that either needs to do something weird with the >>register ports, or, more likely, needs to be decoded as two operations
    in the pipeline.

    Yes. A64 seems to be designed to "do something weird with the
    register ports"; it has instructions that write three registers and >instructions that read four registers.

    It has instructions that read or write 8 registers (64 bytes).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Fri Jan 12 17:08:14 2024
    Quadibloc <quadibloc@servername.invalid> writes:
    Oh, silly me. I remembered shortly after: what
    is combined with a branch to make a conditional
    branch is not an operate instruction, but a
    test of the contents of a specified register.

    MIPS, Alpha and RISC-V all have slightly different answers here:

    RISC-V has compare-and-branch instructions, and also compare
    instructions slt and sltu that produce 0 or 1.

    Alpha has compare instructions cmpeq cmple cmplt (slt on RISC-V)
    cmpule cmpult (sltu on RISC-V) that produce 0 or 1 and branch
    instructions that compare with 0.

    MIPS has slt and sltu, and a compare-equal-and-branch instruction.

    That's clearly basic enough to fit with RISC,
    but if the carry out from an operation is what
    you want to test, then awkwardness ensues.

    carry = sum<operand1

    That's one sltu/cmpult instruction.

    However, an add with carry-in carry-out is five instructions on these architectures.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Fri Jan 12 18:11:17 2024
    Quadibloc wrote:

    On Mon, 13 Nov 2023 16:10:20 +0100, Terje Mathisen wrote:
    MitchAlsup wrote:
    Chris M. Thomasson wrote:

    Think of LL/SC... If one did not honor the reservation granule....
    well... Shit.. False sharing on a reservation granule can cause live
    lock and damage forward progress wrt some LL/SC setups.

    One should NEVER (N. E. V. E. R.) attempt ATOMIC stuff on an unaligned
    container. Only aligned containers possess ATOMIC-smelling properties.

    This is so obviously correct that you should not have needed to mention
    it. Hammering HW with unaligned (maybe even page-straddling) LOCKed
    updates is something that should only ever be done for testing purposes.

    While older machines used an "exchange" instruction for something
    atomic, the IBM 360 had the "Test and Set" instruction which had a single-byte operand, to avoid the issue.

    However, qualifications are needed to make the statement "obviously
    correct". Basically, one should never attempt an atomic operation on
    an unaligned value in memory... on a machine that does paging. Because
    the unaligned value _might_ cross a page boundary.

    Even crossing a line boundary exposes interested 3rd parties to inter-
    mediate state. Consider a line spanning access to 0x1234567E while at
    the same time there is an access to 0x12345680. The first half or the
    access to 0x1234567E takes place in parallel with the access to 0x12345680
    and then the second half of the access to 0x1234567E is performed. This
    is not ATOMIC in any sense of the word.

    Otherwise, there's no problem. And a computer certainly _could_ be
    aware that precautions are needed for atomic instructions, and
    proceed with their execution only after all the memory pages involved
    were brought into memory, and locked there. That would still mean
    the computer would be slowed unnecessarily, but error-free operation
    can be guaranteed.

    So if someone wanted, they could design a computer which didn't mind
    atomic operations on unaligned values all that much.

    Adding lots of complexity for things that should never happen.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Fri Jan 12 17:57:17 2024
    Quadibloc <quadibloc@servername.invalid> writes:
    As even the PDP-8 had auto-increment addressing, certainly
    its cost must be bearable, if cost is thought of as the
    number of transistors required to implement it.

    What may be cheap in a 10CPI=0.1IPC implementation may be expensive in
    an implementation that tries to support up to 10IPC (as in Cortex-X4).
    In particular, widely ported register files are expensive. But
    apparently ARM, Apple, and others have found ways to implement
    auto-increment at bearable cost.

    Although
    that feature seens odd for a RISC design even to me.

    Not particularly: ARM A32, HPPA, Power, and ARM A64 have
    auto-increment.

    To my mind, ALU status bits are at least essential
    for things like add-with-carry for multiple-precision
    arithmetic.

    It's not essential to have them separate from the rest of the results
    of the computation. And if you attach carry and overflow bits to the
    register that contains the rest of the results of the computation,
    that has advantages: Tasks like multi-precision multiplication that
    benefit from having several carry bits become easier to write. And
    you can easier deal with these bits in compilers.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Fri Jan 12 18:19:58 2024
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Yes. A64 seems to be designed to "do something weird with the
    register ports"; it has instructions that write three registers and >>instructions that read four registers.

    It has instructions that read or write 8 registers (64 bytes).

    Yes, I think there are crypto instructions or somesuch that handle a
    lot of registers in one instruction, and I guess that they take
    multiple cycles, so accessing many registers can be distributed across
    multiple cycles.

    What I had in mind were store-pair with [reg+reg] addressing (4
    reads), and load-pair with auto-increment (three writes), and I expect
    these instructions to be fast, as in: if the microarchitecture allows
    two stores per cycle, I expect to be able to do two store-pair with
    [reg+reg], and likewise for load-pair on a microarchitecture that
    allows three loads per cycle.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Fri Jan 12 18:24:39 2024
    Anton Ertl wrote:

    BGB <cr88192@gmail.com> writes:
    Though, as I see it, 64-bit ARM still has a few concerning features in >>these areas:
    Auto-increment addressing;
    ALU status-flag bits;
    ...

    Both are features that the architects of MIPS (and its descendents,
    including Alpha and RISC-V) considered so concerning that they do not
    feature them in their architecture. The architects of A64 knew these concerns, yet decided to include these features, so they obviously
    were sure that they could implement these features efficiently at
    bearable cost.

    Auto-increment:
    One has an operation that either needs to do something weird with the >>register ports, or, more likely, needs to be decoded as two operations
    in the pipeline.

    Yes. A64 seems to be designed to "do something weird with the
    register ports"; it has instructions that write three registers and instructions that read four registers.

    It is also comparably infrequent, as "*ptr++" isn't
    used *that* often.

    Out of 192106 instructions in /bin/bash (in Debian 11), there are 1829

    1%

    instructions with pre-increment "]!"; most of them are stp (store
    pair) instructions, and the increment is usually negative and often
    smaller than the size of the two registers, the address register is
    usually sp. So the usual use seems to be for saving caller-saved or callee-saved registers.

    Out of these 195815 instructions, 3002 use post-increment "],"; most

    1½%

    of them are ldp (load pair) instructions, and the increment is usually positive, and the address register is usually sp (2688 cases). So
    most of these cases seem to be due to loading caller-saved registers
    after the call or callee-saved registers before the return.

    Overall, there are 25197 loads and stores that use sp as address
    register, out of 61387 loads and stores.

    So we have a use case where the designers were willing to <ahem>
    burden their ISA with pre/post-increment/decrement for a 2½% use
    while simultaneously screwing with their register porting, and
    creating an artificial data-dependency on future access to the
    stack (sp).

    I will gently suggest one can do better.....

    [a76:~:536] objdump -d /bin/bash|grep "^ "|wc -l
    192106
    [a76:~:537] objdump -d /bin/bash|grep ']!'|wc -l
    1829
    [a76:~:538] objdump -d /bin/bash|grep '[[][a-z].*],'|wc -l
    3002
    [a76:~:539] objdump -d /bin/bash|grep 'sp],'|wc -l
    2688
    [a76:~:540] objdump -d /bin/bash|grep '[[]sp'|wc -l
    25197
    [a76:~:541] objdump -d /bin/bash|grep '[[][a-z]'|wc -l
    61387

    ALU status flags:
    The flags themselves are fairly rarely used in practice,

    Conditional branches tend to be quite frequent.

    15%-ish with 3% BR and 2% CALL/RET and 1% JMP (switches)

    but the cost of
    keeping these sorts of flags consistent in the pipeline is not so cheap.

    SPARC showed the way. If you are going to have CCs, have instructions that
    do not set the CCs distinct from those that do set CC.

    x86 showed what not to do:: C-O-ZAPS must be tracked separately in the
    pipeline for efficient use of CC and conditional branching.

    Intel uses as many physical flags registers as physical integer
    registers (280 each on Tigerlake and Golden Cove), ARM somewhat less
    than the integer registers. And the register renamer needs to keep
    track of them separately (for AMD64 it needs to keep track of C, O,
    and NZP separately; I expect that A64 is better in this respect).
    Yes, not cheap, but obviously manageable.

    In AMD's versions there is more reservation station logic used to track
    CC as 3 independent containers than are used to track the operand
    registers themselves.

    And, possibly the cost difference between a 1-bit status flag and, say,
    4 or 5 flag bits, isn't that large. In either case, may make sense to
    limit which instructions may update flags (unlike x86)

    Actually, updating all flags in every instruction would make the implementation easier, too: every flag-using instruction would only be
    able to use the result of the previous instruction, no need to store
    flags longer. However, I guess it might be harder to program with
    such a model.

    This is what SPARC showed.

    My take is that GPRs should have additional carry and overflow flags
    (which are not stored and loaded with the usual store and load
    instructions); they have the information of the N and Z flags already.
    This makes tracking the flags easy, and also allows programs to deal
    with multiple live carry flags, as needed for multi-precision
    multiplication.

    Completely unnecessary and insufficient at the same time.

    and possibly only
    allow them in "lane 1" or whatever the equivalent is (the secondary ALUs >>only doing non-flags-updating forms).

    That's an implementation issue. Do ARM A64 implementations have such restrictions?

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Fri Jan 12 18:55:38 2024
    Quadibloc wrote:



    As for ALU status flag bits, I think they're a feature that
    should be kept. But one concern with them relates to the
    same reason that some early RISC architectures had branch
    delay slots.

    So a common way in which this concern is mitigated is for
    RISC architectures to include, in instructions that can
    affect the condition codes, a bit that controls whether or
    not they do so. That way, other operate instructions can
    be placed between an instruction that sets the condition
    codes and the branch instruction that tests them.

    Having condition codes in GPRs gets you as many codes as
    you can every use and for free !! You make CMP instructions
    return a bit-vector to a GPR and you have branch on bit
    instructions--and you still don't need the cruft of CCs
    in your ISA.

    The PowerPC architecture went further, also perhaps
    addressing another concern with ALU status bits, by
    having multiple sets of condition codes, so that the
    condition codes would behave more like registers,
    rather than being a unique resource.

    CCs in GPRs means you have as many CC sets as your compiler
    can use.

    To my mind, ALU status bits are at least essential
    for things like add-with-carry for multiple-precision
    arithmetic. Otherwise, one would need multiple
    awkward instructions to perform the same function.

    Or a CARRY instruction-modifier.

    And since RISC typically has only load and store
    memory-reference instructions, thus limiting each
    instruction to one basic action, a design that
    apparently forces operate instructions to be
    combined with conditional branch instructions
    seems to be the opposite of RISC. I presume that
    they _don't_ solve the problem by including a
    conditional skip in operate instructions, that
    can skip over a jump instruction that follows
    them (sort of like a PDP-8!)... clearly, I'll
    need to take another look at the MIPS and/or
    the Alpha to see what it is that they _are_
    doing, to understand how it fits into the
    RISC philosophy.

    Many RISCs use the CMP-BC style as 1 instruction.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Fri Jan 12 19:50:14 2024
    Scott Lurndal wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes: >>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Yes. A64 seems to be designed to "do something weird with the
    register ports"; it has instructions that write three registers and >>>>instructions that read four registers.

    It has instructions that read or write 8 registers (64 bytes).

    Yes, I think there are crypto instructions or somesuch that handle a
    lot of registers in one instruction, and I guess that they take
    multiple cycles, so accessing many registers can be distributed across >>multiple cycles.

    These aren't crypto. They are intended to allow atomic
    64-byte transactions initiated by the cpu, generally to
    on-chip coprocessors (it's called FEAT_LS64).

    They use eight consecutive registers.

    Only consecutive before renaming.......

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Fri Jan 12 19:45:36 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Yes. A64 seems to be designed to "do something weird with the
    register ports"; it has instructions that write three registers and >>>instructions that read four registers.

    It has instructions that read or write 8 registers (64 bytes).

    Yes, I think there are crypto instructions or somesuch that handle a
    lot of registers in one instruction, and I guess that they take
    multiple cycles, so accessing many registers can be distributed across >multiple cycles.

    These aren't crypto. They are intended to allow atomic
    64-byte transactions initiated by the cpu, generally to
    on-chip coprocessors (it's called FEAT_LS64).

    They use eight consecutive registers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Fri Jan 12 22:01:19 2024
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    But
    apparently ARM, Apple, and others have found ways to implement
    auto-increment at bearable cost.

    It is certainly possible to crack such an instruction into two
    micro-ops, liker POWER does with ldu and friends. If you have a
    mechanism for cracking into micro-ops, that cost certainly looks
    bearable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Sat Jan 13 04:38:22 2024
    On Tue, 19 Dec 2023 17:47:25 +0000, Quadibloc wrote:

    On Tue, 19 Dec 2023 07:22:10 +0000, Quadibloc wrote:

    On Tue, 19 Dec 2023 03:36:06 +0000, Quadibloc wrote:

    I changed where, in the opcode space, the supplementary
    memory-reference instructions were located. This allowed me to have a
    few more bits available for them.

    I've moved them again, making even more space available... because in my
    last change, I made the mistake of using the opcode space that I was
    already using for block headers. I couldn't reduce the amount of
    information in a block header by two bits, by using a combination of ten
    bits instead of eight to indicate a block header, so I had to do my
    rearranging in this place instead.

    And now, with what I've learned from this experience, I've made further changes. I've increased the length of the opcode field in the supplementary memory-reference instructions that were moved to be among the other memory-reference instructions, so as to have enough for the different
    sizes of the various types to be supported.

    But in addition, I have now engaged in what some may see as an act of
    pure evil.

    Once again there are supplementary memory-reference instructions among
    the operate instructions as well. *These*, however, provide for the conventional integer and floating-point types, CISC-style memory to
    register operate instructions! So even within the basic 32-bit instruction set, although _these_ instructions are highly restricted in register use
    and addressing modes, the pretense of being a load-store architecture
    has been dropped!

    I have made further changes to both of these types of supplementary memory-reference instructions.

    The ones that provide load-store memory-reference instructions have had
    the restrictions on their addressing modes slightly relaxed by means of shrinking the opcode field by one bit.

    The ones that provide memory to register operate instructions of the most common types have had the restrictions on their addressing modes slightly relaxed by restricting them to aligned operands.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Jan 16 16:46:13 2024
    MitchAlsup [2024-01-10 22:32:50] wrote:
    Stefan Monnier wrote:
    MitchAlsup [2024-01-10 22:32:50] wrote:
    The idioms recognized in My 66150 core:
    CMP Rt,--,-- ; BBit Rt,label
    Calk Rd,--,-- ; BCnd Rd,label
    LD Rd,[--] ; BCnd Rd,label
    ST Rd,[--] ; Calk --,--,--
    CALL Label ; BR Label
    These all CoIssue (both instruction pass through the pipeline
    Sorry, what's "Calk"?
    A calculation instruction {ADD, AND, ...}

    Hmmm, what's the benefit of co-issuing an ST with a Calk?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Wed Jan 17 00:18:37 2024
    Stefan Monnier wrote:

    MitchAlsup [2024-01-10 22:32:50] wrote:
    Stefan Monnier wrote:
    MitchAlsup [2024-01-10 22:32:50] wrote:
    The idioms recognized in My 66150 core:
    CMP Rt,--,-- ; BBit Rt,label
    Calk Rd,--,-- ; BCnd Rd,label
    LD Rd,[--] ; BCnd Rd,label
    ST Rd,[--] ; Calk --,--,--
    CALL Label ; BR Label
    These all CoIssue (both instruction pass through the pipeline
    Sorry, what's "Calk"?
    A calculation instruction {ADD, AND, ...}

    Hmmm, what's the benefit of co-issuing an ST with a Calk?

    The opportunity is that there are register file ports available.

    CoIssuing ST with Calk allows a 1-wide machine to perform 2 inst
    in a single beat down the pipeline.

    Not all STs can CoIssue with all Calks {this is a register counting
    problem.}


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Paul A. Clayton on Sat Jan 20 18:41:35 2024
    On Fri, 19 Jan 2024 16:16:07 -0500, Paul A. Clayton wrote:

    RISC-V has enough inelegance that considering it a model of
    perfection implies, in my opinion, significant noobiness (or
    perhaps what I might consider poor taste).

    I feel assured that my efforts with regard to Concertina II
    are so obscure that there is no real danger that those who
    see RISC-V as a model of perfection _compared to it_ will
    wield considerable influence in its favor...

    Courage often is doing the right thing even when it is (by all
    rational examination) pointless.

    The duty of the courageous soldier is to work towards
    achieving victory. Choose when to fight; avoid wasting
    energy, resources, and men in losing battles.

    Yes, the common soldier is a tool in the hands of his
    general, but when as an individual, one is one's own
    general, then one bears the same responsibilities as
    a real general.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Quadibloc on Thu Feb 15 20:23:30 2024
    On Fri, 12 Jan 2024 04:09:30 -0000 (UTC), Quadibloc wrote:

    While older machines used an "exchange" instruction for something
    atomic, the IBM 360 had the "Test and Set" instruction which had a single-byte operand, to avoid the issue.

    That assumes it doesn’t create a new issue. Like some 16-bit architecture
    I remember from the 1980s, that could not do single-byte bus cycles. So
    writing a byte involved a read-modify-write sequence of bus operations.
    Try doing that atomically ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Feb 15 20:58:55 2024
    Lawrence D'Oliveiro wrote:

    On Fri, 12 Jan 2024 04:09:30 -0000 (UTC), Quadibloc wrote:

    While older machines used an "exchange" instruction for something
    atomic, the IBM 360 had the "Test and Set" instruction which had a
    single-byte operand, to avoid the issue.

    That assumes it doesn’t create a new issue. Like some 16-bit architecture
    I remember from the 1980s, that could not do single-byte bus cycles. So writing a byte involved a read-modify-write sequence of bus operations.
    Try doing that atomically ...

    Do not allow the arbiter to allow any other access between the Rd and the Wt.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Chris M. Thomasson on Fri Feb 16 15:20:40 2024
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 1/10/2024 7:03 PM, Chris M. Thomasson wrote:
    On 11/13/2023 7:10 AM, Terje Mathisen wrote:

    Actually, it was experimented with wrt artificially triggering a bus
    lock on Intel via unaligned access and dummy LOCK RMW (iirc) to > implement a user space RCU wrt remote memory barriers. Dave Dice comes
    to mind. I am having trouble trying to find the god damn paper! I know I
    read it before.


    I need to point out that that unaligned access that would trigger an
    actual bus lock is when the access straddled a l2 cache line wrt the
    LOCK'ed RMW.

    You don't actually _need_ an unaligned access to trigger an actual
    bus lock - if you can arrange for sufficient contention to a single
    line, the processor may eventually grab the bus lock to make forward
    progress.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Chris M. Thomasson on Fri Feb 16 21:39:08 2024
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 2/16/2024 1:05 PM, Chris M. Thomasson wrote:
    On 2/16/2024 7:20 AM, Scott Lurndal wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 1/10/2024 7:03 PM, Chris M. Thomasson wrote:
    On 11/13/2023 7:10 AM, Terje Mathisen wrote:

    Actually, it was experimented with wrt artificially triggering a bus >>>>> lock on Intel via unaligned access and dummy LOCK RMW (iirc) to  >
    implement a user space RCU wrt remote memory barriers. Dave Dice comes >>>>> to mind. I am having trouble trying to find the god damn paper! I
    know I
    read it before.


    I need to point out that that unaligned access that would trigger an
    actual bus lock is when the access straddled a l2 cache line wrt the
    LOCK'ed RMW.

    You don't actually _need_ an unaligned access to trigger an actual
    bus lock - if you can arrange for sufficient contention to a single
    line, the processor may eventually grab the bus lock to make forward
    progress.

    True, but I think a LOCK'ed RMW on unaligned memory that straddles a
    cache line triggers one right off the bat? There was something called
    QPI that abused this to get remote memory barriers. I got a response a
    while back from my friend Dmitry Vyukov that we both read the paper but
    it seems to have been taken down. Dave Dice comes to mind.

    I remember the Q in QPI was for quiescence.

    Quick Path Interface. Like HT (Hyper Transport) but from Intel.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)