• "Mini" tags to reduce the number of op codes

    From Stephen Fuld@21:1/5 to All on Wed Apr 3 09:43:44 2024
    There has been discussion here about the benefits of reducing the number
    of op codes. One reason not mentioned before is if you have fixed
    length instructions, you may want to leave as many codes as possible
    available for future use. Of course, if you are doing a 16-bit
    instruction design, where instruction bits are especially tight, you may
    save enough op-codes to save a bit, perhaps allowing a larger register specifier field, or to allow more instructions in the smaller subset.

    It is in this spirit that I had an idea, partially inspired by Mill’s
    use of tags in registers, but not memory. I worked through this idea
    using the My 6600 as an example “substrate” for two reasons. First, it
    has several features that are “friendly” to the idea. Second, I know
    Mitch cares about keeping the number of op codes low.

    Please bear in mind that this is just the germ of an idea. It is
    certainly not fully worked out. I present it here to stimulate
    discussions, and because it has been fun to think about.

    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag. If
    set, the bit indicates that the corresponding register contains a floating-point value. Clear indicates not floating point (integer,
    address, etc.). There would be two additional instructions, load single floating and load double floating, which work the same as the other 32-
    and 64-bit loads, but in addition to loading the value, set the tag bit
    for the destination register. Non-floating-point loads would clear the
    tag bit. As I show below, I don’t think you need any special "store
    tag" instructions.

    When executing arithmetic instructions, if the tag bits of both sources
    of an instruction are the same, do the appropriate operation (floating
    or integer), and set the tag bit of the result register appropriately.
    If the tag bits of the two sources are different, I see several
    possibilities.

    1. Generate an exception.
    2. Use the sense of source 1 for the arithmetic operation, but perform
    the appropriate conversion on the second operand first, potentially
    saving an instruction
    3. Always do the operation in floating point and convert the integer operand prior to the operation. (Or, if you prefer, change floating
    point to integer in the above description.)
    4. Same as 2 or 3 above, but don’t do the conversions.

    I suspect this is the least useful choice. I am not sure which is the
    best option.

    Given that, use the same op code for the floating-point and fixed
    versions of the same operations. So we can save eight op codes, the
    four arithmetic operations, max, min, abs and compare. So far, a net
    savings of six opcodes.

    But we can go further. There are some opcodes that only make sense for
    FP operands, e.g. the transcendental instructions. And there are some operations that probably only make sense for non-FP operands, e.g. POP,
    FF1, probably shifts. Given the tag bit, these could share the same
    op-code. There may be several more of these.

    I think this all works fine for a single compilation unit, as the
    compiler certainly knows the type of the data. But what happens with
    separate compilations? The called function probably doesn’t know the
    tag value for callee saved registers. Fortunately, the My 66000
    architecture comes to the rescue here. You would modify the Enter and
    Exit instructions to save/restore the tag bits of the registers they are
    saving or restoring in the same data structure it uses for the registers
    (yes, it adds 32 bits to that structure – minimal cost). The same
    mechanism works for interrupts that take control away from a running
    process.

    I don’t think you need to set or clear the tag bits without doing
    anything else, but if you do, I think you could “repurpose” some other instructions to do this, without requiring another op-code. For
    example, Oring a register with itself could be used to set the tag bit
    and Oring a register with zero could clear it. These should be pretty rare.

    That is as far as I got. I think you could net save perhaps 8-12 op
    codes, which is about 10% of the existing op codes - not bad. Is it
    worth it? To me, a major question is the effect on performance. What
    is the cost of having to decode the source registers and reading their respective tag bits before knowing which FU to use? If it causes an
    extra cycle per instruction, then it is almost certainly not worth it.
    IANAHG, so I don’t know. But even if it doesn’t cost any performance, I think the overall gains are pretty small, and probably not worth it
    unless the op-code space is really tight (which, for My 66000 it isn’t).

    Anyway, it has been fun thinking about this, so I hope you don’t mind
    the, probably too long, post.
    Any comments are welcome.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stephen Fuld on Wed Apr 3 17:24:05 2024
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag. If
    set, the bit indicates that the corresponding register contains a >floating-point value. Clear indicates not floating point (integer,
    address, etc.). There would be two additional instructions, load single >floating and load double floating, which work the same as the other 32-
    and 64-bit loads, but in addition to loading the value, set the tag bit
    for the destination register. Non-floating-point loads would clear the
    tag bit. As I show below, I don’t think you need any special "store
    tag" instructions.
    ...
    But we can go further. There are some opcodes that only make sense for
    FP operands, e.g. the transcendental instructions. And there are some >operations that probably only make sense for non-FP operands, e.g. POP,
    FF1, probably shifts. Given the tag bit, these could share the same
    op-code. There may be several more of these.

    Certainly makes reading disassembler output fun (or writing the
    disassembler). This reminds me of the work on SafeTSA [amme+01] where
    they encode only programs that are correct (according to some notion
    of correctness).

    I think this all works fine for a single compilation unit, as the
    compiler certainly knows the type of the data. But what happens with >separate compilations? The called function probably doesn’t know the
    tag value for callee saved registers. Fortunately, the My 66000
    architecture comes to the rescue here. You would modify the Enter and
    Exit instructions to save/restore the tag bits of the registers they are >saving or restoring in the same data structure it uses for the registers >(yes, it adds 32 bits to that structure – minimal cost).

    That's expensive in an OoO CPU. There you want each tag to be stored
    alongside with the other 64 bits of the register, because they should
    be renamed at the same time. So the ENTER instruction would depend on
    all the registers that it saves (or maybe on all registers). And upon
    EXIT the restored registers have to be reassembled (which ist not that expensive).

    I have a similar problem for the carry and overflow bits in <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>, and chose to
    let those bits not survive across calls; if there was a cheap solution
    for the problem, it would eliminate this drawback of my idea.

    The same
    mechanism works for interrupts that take control away from a running
    process.

    For context switches one cannot get around the problem, but they are
    much rarer than calls and returns, so requiring a pipeline drain for
    them is not so bad.

    Concerning interrupts, as long as nesting is limited, one could just
    treat the physical registers of the interrupted program as taken, and
    execute the interrupt with the remaining physical registers. No need
    to save any architectural registers or their tag, carry, or overflow
    bits.

    That is as far as I got. I think you could net save perhaps 8-12 op
    codes, which is about 10% of the existing op codes - not bad. Is it
    worth it? To me, a major question is the effect on performance. What
    is the cost of having to decode the source registers and reading their >respective tag bits before knowing which FU to use?

    In in OoO CPU, that's pretty heavy.

    But actually, your idea does not need any computation results for
    determining the tag bits of registers (except during EXIT), so you
    probably can handle the tags in the front end (decoder and renamer).
    Then the tags are really separate and not part of the rgisters that
    have to be renamed, and you don't need to perform any waiting on
    ENTER.

    However, in EXIT the front end would have to wait for the result of
    the load/store unit loading the 32 bits, unless you add a special
    mechanism for that. So EXIT would become expensive, one way or the
    other.

    @InProceedings{amme+01,
    author = {Wolfram Amme and Niall Dalton and Jeffery von Ronne
    and Michael Franz},
    title = {Safe{TSA}: A Type Safe and Referentially Secure
    Mobile-Code Representation Based on Static Single
    Assignment Form},
    crossref = {sigplan01},
    pages = {137--147},
    annote = {The basic ideas in this representation are:
    variables are named as the pair (distance in the
    dominator tree, assignment within basic block);
    variables are separated by type, with operations
    referring only to variables of the right type (like
    integer and FP instructions and registers in
    assemblers); memory references use types to encode
    that a null-pointer check and/or a range check has
    already occured, allowing optimizing these
    operations; the resulting code is encoded (using
    text compression methods) in a way that supports
    only correct code. These ideas are discussed mostly
    in a general way, with some Java-specifics, but the
    representation supposedly also supports Fortran95
    and Ada95. The representation supports some CSE, but
    not for address computation operations. The paper
    also gives numbers on size (usually a little smaller
    than Java bytecode), and some other static metrics,
    especially wrt. the effect of optimizations.}
    }

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Stephen Fuld on Wed Apr 3 14:44:27 2024
    Stephen Fuld wrote:
    There has been discussion here about the benefits of reducing the number
    of op codes. One reason not mentioned before is if you have fixed
    length instructions, you may want to leave as many codes as possible available for future use. Of course, if you are doing a 16-bit
    instruction design, where instruction bits are especially tight, you may
    save enough op-codes to save a bit, perhaps allowing a larger register specifier field, or to allow more instructions in the smaller subset.

    It is in this spirit that I had an idea, partially inspired by Mill’s
    use of tags in registers, but not memory. I worked through this idea
    using the My 6600 as an example “substrate” for two reasons. First, it has several features that are “friendly” to the idea. Second, I know Mitch cares about keeping the number of op codes low.

    Please bear in mind that this is just the germ of an idea. It is
    certainly not fully worked out. I present it here to stimulate
    discussions, and because it has been fun to think about.

    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag. If
    set, the bit indicates that the corresponding register contains a floating-point value. Clear indicates not floating point (integer,
    address, etc.). There would be two additional instructions, load single floating and load double floating, which work the same as the other 32-
    and 64-bit loads, but in addition to loading the value, set the tag bit
    for the destination register. Non-floating-point loads would clear the
    tag bit. As I show below, I don’t think you need any special "store
    tag" instructions.

    If you are adding a float/int data type flag you might as well
    also add operand size for floats at least, though some ISA's
    have both int32 and int64 ALU operations for result compatibility.

    When executing arithmetic instructions, if the tag bits of both sources
    of an instruction are the same, do the appropriate operation (floating
    or integer), and set the tag bit of the result register appropriately.
    If the tag bits of the two sources are different, I see several possibilities.

    1. Generate an exception.
    2. Use the sense of source 1 for the arithmetic operation, but
    perform the appropriate conversion on the second operand first,
    potentially saving an instruction
    3. Always do the operation in floating point and convert the integer operand prior to the operation. (Or, if you prefer, change floating
    point to integer in the above description.)
    4. Same as 2 or 3 above, but don’t do the conversions.

    I suspect this is the least useful choice. I am not sure which is the
    best option.

    Given that, use the same op code for the floating-point and fixed
    versions of the same operations. So we can save eight op codes, the
    four arithmetic operations, max, min, abs and compare. So far, a net
    savings of six opcodes.

    But we can go further. There are some opcodes that only make sense for
    FP operands, e.g. the transcendental instructions. And there are some operations that probably only make sense for non-FP operands, e.g. POP,
    FF1, probably shifts. Given the tag bit, these could share the same
    op-code. There may be several more of these.

    I think this all works fine for a single compilation unit, as the
    compiler certainly knows the type of the data. But what happens with separate compilations? The called function probably doesn’t know the
    tag value for callee saved registers. Fortunately, the My 66000
    architecture comes to the rescue here. You would modify the Enter and
    Exit instructions to save/restore the tag bits of the registers they are saving or restoring in the same data structure it uses for the registers (yes, it adds 32 bits to that structure – minimal cost). The same mechanism works for interrupts that take control away from a running
    process.

    I don’t think you need to set or clear the tag bits without doing
    anything else, but if you do, I think you could “repurpose” some other instructions to do this, without requiring another op-code. For
    example, Oring a register with itself could be used to set the tag bit
    and Oring a register with zero could clear it. These should be pretty
    rare.

    That is as far as I got. I think you could net save perhaps 8-12 op
    codes, which is about 10% of the existing op codes - not bad. Is it
    worth it? To me, a major question is the effect on performance. What
    is the cost of having to decode the source registers and reading their respective tag bits before knowing which FU to use? If it causes an
    extra cycle per instruction, then it is almost certainly not worth it. IANAHG, so I don’t know. But even if it doesn’t cost any performance, I think the overall gains are pretty small, and probably not worth it
    unless the op-code space is really tight (which, for My 66000 it isn’t).

    Anyway, it has been fun thinking about this, so I hope you don’t mind
    the, probably too long, post.
    Any comments are welcome.

    Currently the opcode data type can tell the uArch how to route
    the operands internally without knowing the data values.
    For example, FPU reservation stations monitor float operands
    and schedule for just the FPU FADD or FMUL units.

    Dynamic data typing would change that to be data dependent routing.
    It means, for example, you can't begin to schedule a uOp
    until you know all its operand types and opcode.

    Looks like it makes such distributed decisions impossible.
    Probably everything winds up in a big pile of logic in the center,
    which might be problematic for those things whose complexity grows N^2.
    Not sure how significant that is.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stephen Fuld on Wed Apr 3 20:02:25 2024
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    [saving opcodes]


    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag. If
    set, the bit indicates that the corresponding register contains a floating-point value. Clear indicates not floating point (integer,
    address, etc.).

    I don't think this would save a lot of opcode space, which
    is the important thing.

    A typical RISC design has a six-bit major opcode.
    Having three registers takes away fifteen bits, leaving
    eleven, which is far more than anybody would ever want as
    minor opdoce for arithmetic instructions. Compare with https://en.wikipedia.org/wiki/DEC_Alpha#Instruction_formats
    where DEC actually left out three bits because they did not
    need them.

    What is _really_ eating up opcode space are many- (usually 16-) bit
    constants in the instructions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB-Alt on Wed Apr 3 21:30:02 2024
    BGB-Alt wrote:

    On 4/3/2024 11:43 AM, Stephen Fuld wrote:
    There has been discussion here about the benefits of reducing the number
    of op codes.  One reason not mentioned before is if you have fixed
    length instructions, you may want to leave as many codes as possible
    available for future use.  Of course, if you are doing a 16-bit
    instruction design, where instruction bits are especially tight, you may
    save enough op-codes to save a bit, perhaps allowing a larger register
    specifier field, or to allow more instructions in the smaller subset.

    It is in this spirit that I had an idea, partially inspired by Mill’s
    use of tags in registers, but not memory.  I worked through this idea
    using the My 6600 as an example “substrate” for two reasons.  First, it
    66000
    has several features that are “friendly” to the idea.  Second, I know >> Mitch cares about keeping the number of op codes low.

    Please bear in mind that this is just the germ of an idea.  It is
    certainly not fully worked out.  I present it here to stimulate
    discussions, and because it has been fun to think about.

    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag.  If
    set, the bit indicates that the corresponding register contains a
    floating-point value.  Clear indicates not floating point (integer,
    address, etc.).  There would be two additional instructions, load single
    floating and load double floating, which work the same as the other 32-
    and 64-bit loads, but in addition to loading the value, set the tag bit
    for the destination register.  Non-floating-point loads would clear the
    tag bit.  As I show below, I don’t think you need any special "store
    tag" instructions.

    What do you do when you want a FP bit pattern interpreted as an integer,
    or vice versa.

    When executing arithmetic instructions, if the tag bits of both sources
    of an instruction are the same, do the appropriate operation (floating
    or integer), and set the tag bit of the result register appropriately.
    If the tag bits of the two sources are different, I see several
    possibilities.

    1.    Generate an exception.
    2.    Use the sense of source 1 for the arithmetic operation, but
    perform the appropriate conversion on the second operand first,
    potentially saving an instruction

    Conversions to/from FP often require a rounding mode. How do you specify that?

    3.    Always do the operation in floating point and convert the integer >> operand prior to the operation.  (Or, if you prefer, change floating
    point to integer in the above description.)
    4.    Same as 2 or 3 above, but don’t do the conversions.

    I suspect this is the least useful choice.  I am not sure which is the
    best option.

    Given that, use the same op code for the floating-point and fixed
    versions of the same operations.  So we can save eight op codes, the
    four arithmetic operations, max, min, abs and compare.  So far, a net
    savings of six opcodes.

    But we can go further.  There are some opcodes that only make sense for
    FP operands, e.g. the transcendental instructions.  And there are some
    operations that probably only make sense for non-FP operands, e.g. POP,
    FF1, probably shifts.  Given the tag bit, these could share the same
    op-code.  There may be several more of these.

    Hands waving:: "Danger Will Robinson, Danger" more waving of hands.

    I think this all works fine for a single compilation unit, as the
    compiler certainly knows the type of the data.  But what happens with
    separate compilations?  The called function probably doesn’t know the

    The compiler will certainly have a function prototype. In any event, if FP
    and Integers share a register file the lack of prototype is much less stress- full to the compiler/linking system.

    tag value for callee saved registers.  Fortunately, the My 66000
    architecture comes to the rescue here.  You would modify the Enter and
    Exit instructions to save/restore the tag bits of the registers they are
    saving or restoring in the same data structure it uses for the registers
    (yes, it adds 32 bits to that structure – minimal cost).  The same
    mechanism works for interrupts that take control away from a running
    process.

    Yes, but we do just fine without the tag and without the stuff mentioned
    above. Neither ENTER nor EXIT care about the 64-bit pattern in the register.

    I don’t think you need to set or clear the tag bits without doing
    anything else, but if you do, I think you could “repurpose” some other >> instructions to do this, without requiring another op-code.   For
    example, Oring a register with itself could be used to set the tag bit
    and Oring a register with zero could clear it.  These should be pretty
    rare.

    That is as far as I got.  I think you could net save perhaps 8-12 op
    codes, which is about 10% of the existing op codes - not bad.  Is it
    worth it? 

    No.

    To me, a major question is the effect on performance.  What
    is the cost of having to decode the source registers and reading their
    respective tag bits before knowing which FU to use? 

    The problem is you have put decode dependent on dynamic pipeline information.
    I suggest you don't want to do that. Consider a change from int to FP instruction
    as a predicated instruction, so the pipeline cannot DECODE the instruction at hand until the predicate resolves. Yech.

    If it causes an
    extra cycle per instruction, then it is almost certainly not worth it.
    IANAHG, so I don’t know.  But even if it doesn’t cost any performance, I
    think the overall gains are pretty small, and probably not worth it
    unless the op-code space is really tight (which, for My 66000 it isn’t). >>
    Anyway, it has been fun thinking about this, so I hope you don’t mind
    the, probably too long, post.
    Any comments are welcome.

    It is actually an interesting idea if you want to limit your architecture
    to 1-wide.



    FWIW:
    This doesn't seem too far off from what would be involved with dynamic
    typing at the ISA level, but with many of same sorts of drawbacks...



    Say, for example, top 2 bits of a register:
    00: Object Reference
    Next 2 bits:
    00: Pointer (with type-tag)
    01: ?
    1z: Bounded Array
    01: Fixnum (route to ALU)
    10: Flonum (route to FPU)
    11: Other types
    00: Smaller value types
    Say: int/uint, short/ushort, ...
    ...

    One issue:
    Decoding based on register tags would mean needing to know the register
    tag bits at the same time the instruction is being decoded. In this
    case, one is likely to need two clock-cycles to fully decode the opcode.

    More importantly, you added a cycle AFTER register READ/Forward before
    you can start executing (more when OoO is in use).

    And finally, the compiler KNOWS what the type is at compile time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB-Alt on Wed Apr 3 21:53:26 2024
    BGB-Alt wrote:


    FWIW:
    This doesn't seem too far off from what would be involved with dynamic
    typing at the ISA level, but with many of same sorts of drawbacks...



    Say, for example, top 2 bits of a register:
    00: Object Reference
    Next 2 bits:
    00: Pointer (with type-tag)
    01: ?
    1z: Bounded Array
    01: Fixnum (route to ALU)
    10: Flonum (route to FPU)
    11: Other types
    00: Smaller value types
    Say: int/uint, short/ushort, ...
    ...

    So, you either have 66-bit registers, or you have 62-bit FP numbers ?!?
    This solves nobody's problems; not even LISP.

    One issue:
    Decoding based on register tags would mean needing to know the register
    tag bits at the same time the instruction is being decoded. In this
    case, one is likely to need two clock-cycles to fully decode the opcode.

    Not good. But what if you don't know the tag until the register is delivered from a latent FU, do you stall DECODE, or do you launch and make the instruction
    queue element have to deal with all outcomes.

    ID1: Unpack instruction to figure out register fields, etc.
    ID2: Fetch registers, specialize variable instructions based on tag bits.

    For timing though, one ideally doesn't want to do anything with the
    register values until the EX stages (since ID2 might already be tied up
    with the comparably expensive register-forwarding logic), but asking for
    3 cycles for decode is a bit much.

    Otherwise, if one does not know which FU should handle the operation
    until EX1, this has its own issues.

    Real-friggen-ely

    Or, possible, the FU's decide
    whether to accept the operation:
    ALU: Accepts operation if both are fixnum, FPU if both are Flonum.

    What if IMUL is performed in FMAC, IDIV in FDIV,... Int<->FP routing is
    based on calculation capability {Even CDC 6600 performed int × in the
    FP × unit (not in Thornton's book, but via conversation with 6600 logic designer at Asilomar some time ago. All they had to do to get FP × to
    perform int × was disable 1 gate.......)

    But, a proper dynamic language allows mixing fixnum and flonum with the result being implicitly converted to flonum, but from the FPU's POV,
    this would effectively require two chained FADD operations (one for the Fixnum to Flonum conversion, one for the FADD itself).

    That is a LANGUAGE problem not an ISA problem. SNOBOL allowed one to add
    a string to an integer and the string would be converted to int before.....

    Many other cases could get hairy, but to have any real benefit, the CPU
    would need to be able to deal with them. In cases where the compiler
    deals with everything, the type-tags become mostly moot (or potentially detrimental).

    You are arguing that the added complexity would somehow pay for itself.
    I can't see it paying for itself.

    But, then, there is another issue:
    C code expects C type semantics to be respected, say:
    Signed int overflow wraps at 32 bits (sign extending);
    maybe
    Unsigned int overflow wraps at 32 bits (zero extending);
    maybe
    Variables may not hold values out-of-range for that type;
    LLVM does this GCC does not.
    The 'long long' and 'unsigned long long' types are exactly 64-bit;
    At least 64-bit not exactly.
    ...
    ...

    If one has tagged 64-bit registers, then fixnum might not hold the
    entire range of 'long long'. If one has 66 or 68 bit registers, then
    memory storage is a problem.

    Ya think ?

    If one has untagged registers for cases where they are needed, one has
    not saved any encoding space.

    I give up--not worth trying to teach cosmologist why the color of the
    lipstick going on the pig is not the problem.....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Apr 3 23:20:46 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB-Alt wrote:



    But, a proper dynamic language allows mixing fixnum and flonum with the
    result being implicitly converted to flonum, but from the FPU's POV,
    this would effectively require two chained FADD operations (one for the
    Fixnum to Flonum conversion, one for the FADD itself).

    That is a LANGUAGE problem not an ISA problem. SNOBOL allowed one to add
    a string to an integer and the string would be converted to int before.....

    The Burroughs B3500 would simply ignore the zone digit when adding
    a string to an integer, based on the address controller for the
    operand.

    ADD 1225 010000(UN) 020000(UA) 030000(UN)

    Would add the 12 unsigned numeric nibbles at address 10000
    to the 25 numeric digits of the 8-bit EBCDIC/ASCII data at address 20000
    and store the result as 25 numeric nibbles at address 30000.

    ADD 0507 010000(UN) 020000(UN) 030000(UA)

    Would add the 5 unsigned numeric nibbles at 10000 to
    the 7 unsigned numeric nibbles at 20000 and store them
    as 8-bit EBCDIC bytes at 30000 (inserting the zone digit @F@
    before each numeric nibble). A processor mode toggle selected
    whether the inserted zone digit should be @F@ (EBCDIC) or @3@ (ASCII).

    Likewise for SUB, INC, DEC, MPY, DIV and data movement instructions.

    The data movement instructions would left- or right-align the destination
    field (MVN (move numeric) would right justify and MVA (move alphanumeric) would left justify) when the destination and source field lengths differ.

    Floating point was BCD with an exponent sign digit, two exponent digits,
    a mantissa sign digit and a variable length mantissa of up
    to 100 digits in length. The integer instructions could be used
    on either the mantissa or exponent individually, as they were
    just fields in memory.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Thu Apr 4 10:32:48 2024
    MitchAlsup1 wrote:
    BGB-Alt wrote:

    On 4/3/2024 11:43 AM, Stephen Fuld wrote:
    There has been discussion here about the benefits of reducing the
    number of op codes.  One reason not mentioned before is if you have
    fixed length instructions, you may want to leave as many codes as
    possible available for future use.  Of course, if you are doing a
    16-bit instruction design, where instruction bits are especially
    tight, you may save enough op-codes to save a bit, perhaps allowing a
    larger register specifier field, or to allow more instructions in the
    smaller subset.

    It is in this spirit that I had an idea, partially inspired by Mill’s >>> use of tags in registers, but not memory.  I worked through this idea
    using the My 6600 as an example “substrate” for two reasons.  First, it
                   66000
    has several features that are “friendly” to the idea.  Second, I know >>> Mitch cares about keeping the number of op codes low.

    Please bear in mind that this is just the germ of an idea.  It is
    certainly not fully worked out.  I present it here to stimulate
    discussions, and because it has been fun to think about.

    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag.
    If set, the bit indicates that the corresponding register contains a
    floating-point value.  Clear indicates not floating point (integer,
    address, etc.).  There would be two additional instructions, load
    single floating and load double floating, which work the same as the
    other 32- and 64-bit loads, but in addition to loading the value, set
    the tag bit for the destination register.  Non-floating-point loads
    would clear the tag bit.  As I show below, I don’t think you need any >>> special "store tag" instructions.

    What do you do when you want a FP bit pattern interpreted as an integer,
    or vice versa.

    This is why, if you want to copy Mill, you have to do it properly:

    Mill does NOT care about the type of data loaded into a particular belt
    slot, only the size and if it is a scalar or a vector filling up the
    full belt slot. In either case you will also have marker bits for
    special types like None and NaR.

    So scalar 8/16/32/64/128 and vector 8x16/16x8/32x4/64x2/128x1 (with the
    last being the same as the scalar anyway).

    Only load ops and explicit widening/narrowing ops sets the size tag
    bits, from that point any op where it makes sense will do the right
    thing for either a scalar or a short vector, so you can add 16+16 8-bit
    vars with the same ADD encoding as you would use for a single 64-bit ADD.

    We do NOT make any attempt to interpret the actual bit patterns sotred
    within each belt slot, that is up to the instructions. This means that
    there is no difference between loading a float or an int32_t, it also
    means that it is perfectly legel (and supported) to use bit operations
    on a FP variable. This can be very useful, not just to fake exact
    arithmetic by splitting a double into two 26-bit mantissa parts:

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Thu Apr 4 16:47:44 2024
    On Thu, 4 Apr 2024 10:32:48 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    We do NOT make any attempt

    Terje


    Does a present tense means that you are still involved in Mill project?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Thu Apr 4 21:13:21 2024
    Michael S wrote:
    On Thu, 4 Apr 2024 10:32:48 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    We do NOT make any attempt

    Terje


    Does a present tense means that you are still involved in Mill project?

    I am much less active than I used to be, but I still get the weekly conf
    call invites and respond to any interesting subject on our mailing list.

    So, yes, I do consider myself to still be involved.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Thu Apr 4 22:25:30 2024
    On Thu, 4 Apr 2024 21:13:21 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Thu, 4 Apr 2024 10:32:48 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    We do NOT make any attempt

    Terje


    Does a present tense means that you are still involved in Mill
    project?
    I am much less active than I used to be, but I still get the weekly
    conf call invites and respond to any interesting subject on our
    mailing list.

    So, yes, I do consider myself to still be involved.

    Terje


    Thank you

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB-Alt on Fri Apr 5 01:48:33 2024
    BGB-Alt wrote:

    On 4/4/2024 3:32 AM, Terje Mathisen wrote:
    MitchAlsup1 wrote:


    As I can note, in my actual ISA, any type-tagging in the registers was explicit and opt-in, generally managed by the compiler/runtime/etc; in
    this case, the ISA merely providing facilities to assist with this.


    The main exception would likely have been the possible "Bounds Check
    Enforce" mode, which would still need a bit of work to implement, and is
    not likely to be terribly useful.

    A while back (and maybe in the future) My 66000 had what I called the
    Foreign Access Mode. When the HoB of the pointer was set, the first
    entry in the translation table was a 4 doubleword structure, A Root
    pointer, the Lowest addressable Byte, the Highest addressable Byte,
    and a DW of access rights, permissions,... While sort-of like a capability
    I don't think it was close enough to actually be a capability or used as
    one.

    So, it fell out of favor, and it was not clear how it fit into the HyperVisor/SuperVisor model, either.

    Most complicated and expensive parts
    are that it will require implicit register and memory tagging (to flag capabilities). Though, cheaper option is simply to not enable it, in
    which case things either behave as before, with the new functionality essentially being NOP. Much of the work still needed on this would be
    getting the 128-bit ABI working, and adding some new tweaks to the ABI
    to play well with the capability addressing (effectively it requires
    partly reworking how global variables are accessed).


    The type-tagging scheme used in my case is very similar to that used in
    my previous BGBScript VMs (where, as I can note, BGBCC was itself a fork
    off of an early version of the BGBScript VM, and effectively using a lax hybrid typesystem masquerading as C). Though, it has long since moved to
    a more proper C style typesystem, with dynamic types more as an optional extension.

    In general, any time one needs to change the type you waste an instruction compared to type less registers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Thu Apr 4 21:13:13 2024
    On some older CPUs, there might be one set of integer opcodes and one
    set of floating-point opcodes, with a status register containing the
    integer precision, and the floating-point precision, currently in use.

    The idea was that this would be efficient because most programs only
    use one size of each type of number, so the number of opcodes would be
    the most appropriate, and that status register wouldn't need to be
    reloaded too often.

    It's considered dangerous, though, to have a mechanism for changing
    what instructions mean, since this could let malware alter what
    programs do in a useful and sneaky fashion. Memory bandwidth is no
    longer a crippling constraint the way it was back in the days of core
    memory and discrete transistors - at least not for program code, even
    if memory bandwidth for _data_ often limits the processing speed of
    computers.

    This is basically because any program that does any real work, taking
    any real length of time to do its job, is going to mostly consist of
    loops that fit in cache. So letting program code be verbose if there
    are other benefits obtained thereby is the current conventional
    wisdom.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Fri Apr 5 21:34:16 2024
    John Savard wrote:

    On some older CPUs, there might be one set of integer opcodes and one
    set of floating-point opcodes, with a status register containing the
    integer precision, and the floating-point precision, currently in use.

    The idea was that this would be efficient because most programs only
    use one size of each type of number, so the number of opcodes would be
    the most appropriate, and that status register wouldn't need to be
    reloaded too often.

    Most programs I write use bytes (mostly unsigned) a few halfwords (mostly signed) a useful count of integers (both signed and unsigned--mainly as
    already defined arguments/returns), and a vast majority of doublewords (invariably unsigned).

    Early in My 66000 LLVM development Brian looked at the cost of having
    only 1 FP OpCode set--and it did not look good--so we went back to the
    standard way of an OpCode for each FP size × calculation.

    It's considered dangerous, though, to have a mechanism for changing
    what instructions mean, since this could let malware alter what
    programs do in a useful and sneaky fashion. Memory bandwidth is no
    longer a crippling constraint the way it was back in the days of core
    memory and discrete transistors - at least not for program code, even
    if memory bandwidth for _data_ often limits the processing speed of computers.

    This is basically because any program that does any real work, taking
    any real length of time to do its job, is going to mostly consist of
    loops that fit in cache. So letting program code be verbose if there
    are other benefits obtained thereby is the current conventional
    wisdom.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Sat Apr 6 21:30:47 2024
    On Fri, 5 Apr 2024 21:34:16 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    Early in My 66000 LLVM development Brian looked at the cost of having
    only 1 FP OpCode set--and it did not look good--so we went back to the >standard way of an OpCode for each FP size calculation.

    I do tend to agree.

    However, a silly idea has now occurred to me.

    256 bits can contain eight instructions that are 32 bits long.

    Or they can also contain seven instructions that are 36 bits long,
    with four bits left over.

    So they could contain *nine* instructions that are 28 bits long, also
    with four bits left over.

    Thus, instead of having mode bits, one _could_ do the following:

    Usually, have 28 bit instructions that are shorter because there's
    only one opcode for each floating and integer operation. The first
    four bits in a block give the lengths of data to be used.

    But have one value for the first four bits in a block that indicates
    36-bit instructions instead, which do include type information, so
    that very occasional instructions for rarely-used types can be mixed
    in which don't fill a whole block.

    While that's a theoretical possibility, I don't view it as being
    worthwhile in practice.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Savard on Sun Apr 7 21:01:15 2024
    John Savard <quadibloc@servername.invalid> schrieb:

    Thus, instead of having mode bits, one _could_ do the following:

    Usually, have 28 bit instructions that are shorter because there's
    only one opcode for each floating and integer operation. The first
    four bits in a block give the lengths of data to be used.

    But have one value for the first four bits in a block that indicates
    36-bit instructions instead, which do include type information, so
    that very occasional instructions for rarely-used types can be mixed
    in which don't fill a whole block.

    While that's a theoretical possibility, I don't view it as being
    worthwhile in practice.

    I played around a bit with another scheme: Encoding things into
    128-bit blocks, with either 21-bit or 42-bit or longer instructions
    (or a block header with six bits, and 20 or 40 bits for each
    instruction).

    Did that look promising? Not really; the 21 bits offered a lot
    of useful opcode space for two-register operations and even for
    a few of the often-used three-register, but 42 bits was really
    a bit too long, so the advantage wasn't great. And embedding
    32-bit or 64-bit instructions in the code stream does not really
    fit the 21-bit raster well, so compared to an ISA which can do so
    (like My 66000) it came out at a disadvantage. Might be possible
    to beat RISC-V, though.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Sun Apr 7 20:41:45 2024
    John Savard wrote:

    On Fri, 5 Apr 2024 21:34:16 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    Early in My 66000 LLVM development Brian looked at the cost of having
    only 1 FP OpCode set--and it did not look good--so we went back to the >>standard way of an OpCode for each FP size × calculation.

    I do tend to agree.

    However, a silly idea has now occurred to me.

    256 bits can contain eight instructions that are 32 bits long.

    Or they can also contain seven instructions that are 36 bits long,
    with four bits left over.

    So they could contain *nine* instructions that are 28 bits long, also
    with four bits left over.

    I agree with the arithmetic going into this statement. What I don't
    have sufficient data concerning is "whether these extra formats pay
    for themselves". For example, how many of the 36-bit encodings are
    irredundant with the 32-bit ones, and so on with the 28-bit ones.

    Take::

    ADD R7,R7,#1

    I suspect there is a 28-bit form, a 32-bit form, and a 36-bit form
    for this semantic step, that you pay for multiple times in decoding
    and possibly pipelining. {{There may also be other encodings for
    this; such as:: INC R7}}

    Thus, instead of having mode bits, one _could_ do the following:

    Usually, have 28 bit instructions that are shorter because there's
    only one opcode for each floating and integer operation. The first
    four bits in a block give the lengths of data to be used.

    How do you attach 32-bit or 64-bit constants to 28-bit instructions ??

    How do you switch from 64-bit to Byte to 32-bit to 16-bit in one
    set of 256-bit instruction decodes ??

    But have one value for the first four bits in a block that indicates
    36-bit instructions instead, which do include type information, so
    that very occasional instructions for rarely-used types can be mixed
    in which don't fill a whole block.

    In complicated if-then-else codes (and switches) I often see one inst-
    ruction followed by a branch to a common point. Does your encoding deal
    with these efficiently ?? That is:: what happens when you jump to the
    middle of a block of 36-bit instructions ??

    While that's a theoretical possibility, I don't view it as being
    worthwhile in practice.

    Agreed.............

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Sun Apr 7 21:22:50 2024
    Thomas Koenig wrote:

    John Savard <quadibloc@servername.invalid> schrieb:

    Thus, instead of having mode bits, one _could_ do the following:

    Usually, have 28 bit instructions that are shorter because there's
    only one opcode for each floating and integer operation. The first
    four bits in a block give the lengths of data to be used.

    But have one value for the first four bits in a block that indicates
    36-bit instructions instead, which do include type information, so
    that very occasional instructions for rarely-used types can be mixed
    in which don't fill a whole block.

    While that's a theoretical possibility, I don't view it as being
    worthwhile in practice.

    I played around a bit with another scheme: Encoding things into
    128-bit blocks, with either 21-bit or 42-bit or longer instructions
    (or a block header with six bits, and 20 or 40 bits for each
    instruction).

    Not having seen said encoding scheme:: I suspect you used the Rd=Rs1 destructive operand model for the 21-bit encodings. Yes :: no ??
    Otherwise one has 3×5-bit registers = 15-bits leaving only 6-bits
    for 64 OpCodes. Now if you have floats and doubles and signed and
    unsigned, you get 16 of each and we have not looked at memory
    references or branching.

    Did that look promising? Not really; the 21 bits offered a lot
    of useful opcode space for two-register operations and even for
    a few of the often-used three-register, but 42 bits was really
    a bit too long, so the advantage wasn't great. And embedding
    32-bit or 64-bit instructions in the code stream does not really
    fit the 21-bit raster well, so compared to an ISA which can do so
    (like My 66000) it came out at a disadvantage. Might be possible
    to beat RISC-V, though.

    But beating RISC-V is easy, try getting you instruction count down
    to VAX counts without losing the ability to pipeline and parallel
    instruction execution.

    At handwaving accuracy::
    VAX has 1.0 instructions
    My 66000 has 1.1 instructions
    RISC-V has 1.5 instructions

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Mon Apr 8 06:21:43 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    John Savard <quadibloc@servername.invalid> schrieb:

    Thus, instead of having mode bits, one _could_ do the following:

    Usually, have 28 bit instructions that are shorter because there's
    only one opcode for each floating and integer operation. The first
    four bits in a block give the lengths of data to be used.

    But have one value for the first four bits in a block that indicates
    36-bit instructions instead, which do include type information, so
    that very occasional instructions for rarely-used types can be mixed
    in which don't fill a whole block.

    While that's a theoretical possibility, I don't view it as being
    worthwhile in practice.

    I played around a bit with another scheme: Encoding things into
    128-bit blocks, with either 21-bit or 42-bit or longer instructions
    (or a block header with six bits, and 20 or 40 bits for each
    instruction).

    Not having seen said encoding scheme:: I suspect you used the Rd=Rs1 destructive operand model for the 21-bit encodings. Yes :: no ??

    It was not very well developed, I gave it up when I saw there wasn't
    much to gain.

    Otherwise one has 3×5-bit registers = 15-bits leaving only 6-bits
    for 64 OpCodes.

    There could have been a case for adding this (maybe just for
    the a few frequent ones: "add r1,r2,r3", "add r1,r2,-r3", "add
    r1,r2,#num" and "add r1,r2,#-num", but I did not pursue that
    further.

    I looked at load and store instructions with short offsets
    (these would then have been scaled), and short branches. But
    the 21-bit opcode space filled up really, really rapidly.

    Also, it is easy to synthesize a 3-register operation from
    a 2-register operation and a memory move. If the decoder is
    set up for 42 bits anyway, instruction fusion also a possibility.
    This got a bit weird.

    Now if you have floats and doubles and signed and
    unsigned, you get 16 of each and we have not looked at memory
    references or branching.

    For somebody who does Fortran, I find the frequency of floating
    point instructions surprisingly low, even in Fortran code.

    Did that look promising? Not really; the 21 bits offered a lot
    of useful opcode space for two-register operations and even for
    a few of the often-used three-register, but 42 bits was really
    a bit too long, so the advantage wasn't great. And embedding
    32-bit or 64-bit instructions in the code stream does not really
    fit the 21-bit raster well, so compared to an ISA which can do so
    (like My 66000) it came out at a disadvantage. Might be possible
    to beat RISC-V, though.

    But beating RISC-V is easy, try getting you instruction count down
    to VAX counts without losing the ability to pipeline and parallel
    instruction execution.

    At handwaving accuracy::
    VAX has 1.0 instructions
    My 66000 has 1.1 instructions
    RISC-V has 1.5 instructions

    To reach VAX instruction density, one would have to have things
    like memory operands (with the associated danger that compilers
    will not put intermediate results in registers, but since they have
    been optimized for x86 for decades, they are probably better now)
    and load with update, which would then have to be cracked
    into two micro-ops. Not sure about the benefit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Mon Apr 8 07:16:08 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:
    But beating RISC-V is easy, try getting you instruction count down
    to VAX counts without losing the ability to pipeline and parallel
    instruction execution.

    At handwaving accuracy::
    VAX has 1.0 instructions
    My 66000 has 1.1 instructions
    RISC-V has 1.5 instructions

    To reach VAX instruction density

    Note that in recent times Mitch Alsup ist writing not about code
    density (static code size or dynamically executed bytes), but about instrruction counts. It's unclear why instruction count would be a
    primary metric, except that he thinks that he can score points for My
    66000 with it. As VAX demonstrates, you can produce an instruction
    set with low instruction counts that is bad at the metrics that really
    count: cycles for executing the program (for a given CPU chip area in
    a given manufacturing process), and, for very small systems, static
    code size.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Mon Apr 8 07:05:35 2024
    On Sun, 7 Apr 2024 20:41:45 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    How do you attach 32-bit or 64-bit constants to 28-bit instructions ??

    Yes, that's a problem. Presumably, I would have to do without
    immediates.

    An option would be to reserve some 16-bit codes to indicate a block
    consisting of one 28-bit instruction and seven 32-bit instructions,
    but that means a third instruction set.

    How do you switch from 64-bit to Byte to 32-bit to 16-bit in one
    set of 256-bit instruction decodes ??

    By using 36-bit instructions instead of 28-bit instructions.

    In complicated if-then-else codes (and switches) I often see one inst- >ruction followed by a branch to a common point. Does your encoding deal
    with these efficiently ?? That is:: what happens when you jump to the
    middle of a block of 36-bit instructions ??

    Well, when the computer fetches a 256-bit block of code, the first
    four bits indicates whether it is composed of 36-bit instructions or
    28-bit instructions. So the computer knows where the instructions are;
    and thus a convention can be applied, such as addressing each 36-bit instruction by the addresses of the first seven 32-bit positions in
    the block.

    In the case of 28-bit instructions, the first eight correspond to the
    32-bit positions, the ninth corresponds to the last 16 bits of the
    block.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Savard on Mon Apr 8 17:25:38 2024
    John Savard <quadibloc@servername.invalid> schrieb:

    Well, when the computer fetches a 256-bit block of code, the first
    four bits indicates whether it is composed of 36-bit instructions or
    28-bit instructions.

    Do you think that instructions which require a certain size (almost)
    always happen to be situated together so they fit in a block?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Mon Apr 8 19:56:27 2024
    John Savard wrote:

    On Sun, 7 Apr 2024 20:41:45 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    In complicated if-then-else codes (and switches) I often see one inst- >>ruction followed by a branch to a common point. Does your encoding deal >>with these efficiently ?? That is:: what happens when you jump to the >>middle of a block of 36-bit instructions ??

    Well, when the computer fetches a 256-bit block of code, the first
    four bits indicates whether it is composed of 36-bit instructions or
    28-bit instructions. So the computer knows where the instructions are;
    and thus a convention can be applied, such as addressing each 36-bit instruction by the addresses of the first seven 32-bit positions in
    the block.

    So, instead of using the branch target address, one rounds it down to
    a 256-bit boundary, reads 256-bits and looks at the first 4-bits to
    determine the format, nd then uses the branch offset to pick a cont-
    tainer which will become the first instruction executed.

    Sounds more complicated than necessary.

    In the case of 28-bit instructions, the first eight correspond to the
    32-bit positions, the ninth corresponds to the last 16 bits of the
    block.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to All on Tue Apr 9 18:24:55 2024
    I wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    John Savard <quadibloc@servername.invalid> schrieb:

    Thus, instead of having mode bits, one _could_ do the following:

    Usually, have 28 bit instructions that are shorter because there's
    only one opcode for each floating and integer operation. The first
    four bits in a block give the lengths of data to be used.

    But have one value for the first four bits in a block that indicates
    36-bit instructions instead, which do include type information, so
    that very occasional instructions for rarely-used types can be mixed
    in which don't fill a whole block.

    While that's a theoretical possibility, I don't view it as being
    worthwhile in practice.

    I played around a bit with another scheme: Encoding things into
    128-bit blocks, with either 21-bit or 42-bit or longer instructions
    (or a block header with six bits, and 20 or 40 bits for each
    instruction).

    Not having seen said encoding scheme:: I suspect you used the Rd=Rs1
    destructive operand model for the 21-bit encodings. Yes :: no ??

    It was not very well developed, I gave it up when I saw there wasn't
    much to gain.

    Maybe one more thing: In order to justify the more complex encoding,
    I was going for 64 registers, and that didn't work out too well
    (missing bits).

    Having learned about M-Core in the meantime, pure 32-register,
    21-bit instruction ISA might actually work better.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Apr 9 21:05:44 2024
    BGB wrote:

    On 4/9/2024 1:24 PM, Thomas Koenig wrote:
    I wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Maybe one more thing: In order to justify the more complex encoding,
    I was going for 64 registers, and that didn't work out too well
    (missing bits).

    Having learned about M-Core in the meantime, pure 32-register,
    21-bit instruction ISA might actually work better.


    For 32-bit instructions at least, 64 GPRs can work out OK.

    Though, the gain of 64 over 32 seems to be fairly small for most
    "typical" code, mostly bringing a benefit if one is spending a lot of
    CPU time in functions that have large numbers of local variables all
    being used at the same time.


    Seemingly:
    16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code density;
    32/64/96 bit instructions, with 64 GPRs, seems likely optimal for performance.

    Where, 16 GPRs isn't really enough (lots of register spills), and 128
    GPRs is wasteful (would likely need lots of monster functions with 250+
    local variables to make effective use of this, *, which probably isn't
    going to happen).

    16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part of
    GPRs AND you have good access to constants.

    *: Where, it appears it is most efficient (for non-leaf functions) if
    the number of local variables is roughly twice that of the number of CPU registers. If more local variables than this, then spill/fill rate goes
    up significantly, and if less, then the registers aren't utilized as effectively.

    Well, except in "tiny leaf" functions, where the criteria is instead
    that the number of local variables be less than the number of scratch registers. However, for many/most small leaf functions, the total number
    of variables isn't all that large either.

    The vast majority of leaf functions use less than 16 GPRs, given one has
    a SP not part of GPRs {including arguments and return values}. Once one
    starts placing things like memove(), memset(), sin(), cos(), exp(), log()
    in the ISA, it goes up even more.


    Where, function categories:
    Tiny Leaf:
    Everything fits in scratch registers, no stack frame, no calls.
    Leaf:
    No function calls (either explicit or implicit);
    Will have a stack frame.
    Non-Leaf:
    May call functions, has a stack frame.

    You are forgetting about FP, GOT, TLS, and whatever resources are required
    to do try-throw-catch stuff as demanded by the source language.

    There is a "static assign everything" case in my case, where all of the variables are statically assigned to registers (for the scope of the function). This case typically requires that everything fit into callee
    save registers, so (like the "tiny leaf" category, requires that the
    number of local variables is less than the available registers).

    On a 32 register machine, if there are 14 available callee-save
    registers, the limit is 14 variables. On a 64 register machine, this
    limit might be 30 instead. This seems to have good coverage.

    The apparent number of registers goes up when one does not waste a register
    to hold a use-once constant.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB-Alt on Wed Apr 10 00:28:02 2024
    BGB-Alt wrote:

    On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
    BGB wrote:

    Seemingly:
    16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code
    density;
    32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
    performance.

    Where, 16 GPRs isn't really enough (lots of register spills), and 128
    GPRs is wasteful (would likely need lots of monster functions with
    250+ local variables to make effective use of this, *, which probably
    isn't going to happen).

    16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part
    of GPRs AND you have good access to constants.


    On the main ISA's I had tried to generate code for, 16 GPRs was kind of
    a pain as it resulted in fairly high spill rates.

    Though, it would probably be less bad if the compiler was able to use
    all of the registers at the same time without stepping on itself (such
    as dealing with register allocation involving scratch registers while
    also not conflicting with the use of function arguments, ...).


    My code generators had typically only used callee save registers for variables in basic blocks which ended in a function call (in my compiler design, both function calls and branches terminating the current basic-block).

    On SH, the main way of getting constants (larger than 8 bits) was via PC-relative memory loads, which kinda sucked.


    This is slightly less bad on x86-64, since one can use memory operands
    with most instructions, and the CPU tends to deal fairly well with code
    that has lots of spill-and-fill. This along with instructions having
    access to 32-bit immediate values.

    Yes, x86 and any architecture (IBM 360, S.E.L. , Interdata, ...) that have LD-Ops act as if they have 4-6 more registers than they really have. x86
    with 16 GPRs acts like a RISC with 20-24 GPRs as does 360. Does not really
    take the place of universal constants, but goes a long way.


    The vast majority of leaf functions use less than 16 GPRs, given one has
    a SP not part of GPRs {including arguments and return values}. Once one
    starts placing things like memove(), memset(), sin(), cos(), exp(), log()
    in the ISA, it goes up even more.


    Yeah.

    Things like memcpy/memmove/memset/etc, are function calls in cases when
    not directly transformed into register load/store sequences.

    My 66000 does not convert them into LD-ST sequences, MM is a single inst- ruction.

    Did end up with an intermediate "memcpy slide", which can handle medium
    size memcpy and memset style operations by branching into a slide.

    MMs and MSs that do not cross page boundaries are ATOMIC. The entire system sees only the before or only the after state and nothing in between. This
    means one can start (queue up) a SATA disk access without obtaining a lock
    to the device--simply because one can fill in all the data of a command in
    a single instruction which smells ATOMIC to all interested 3rd parties.

    As noted, on a 32 GPR machine, most leaf functions can fit entirely in scratch registers.

    Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without getting totally screwed.

    On a 64 GPR machine, this percentage is slightly
    higher (but, not significantly, since there are few leaf functions
    remaining at this point).


    If one had a 16 GPR machine with 6 usable scratch registers, it is a
    little harder though (as typically these need to cover both any
    variables used by the function, and any temporaries used, ...). There
    are a whole lot more leaf functions that exceed a limit of 6 than of 14.

    The data back in the R2000-3000 days indicated that 32 GPRs has a 15%+ advantage over a 16 GPRs; while 84 had only a 3% advantage.

    But, say, a 32 GPR machine could still do well here.


    Note that there are reasons why I don't claim 64 GPRs as a large
    performance advantage:
    On programs like Doom, the difference is small at best.


    It mostly effects things like GLQuake in my case, mostly because TKRA-GL
    has a lot of functions with a large numbers of local variables (some exceeding 100 local variables).

    Partly though this is due to code that is highly inlined and unrolled
    and uses lots of variables tending to perform better in my case (and
    tightly looping code, with lots of small functions, not so much...).



    Where, function categories:
       Tiny Leaf:
         Everything fits in scratch registers, no stack frame, no calls. >>>    Leaf:
         No function calls (either explicit or implicit);
         Will have a stack frame.
       Non-Leaf:
         May call functions, has a stack frame.

    You are forgetting about FP, GOT, TLS, and whatever resources are required >> to do try-throw-catch stuff as demanded by the source language.


    Yeah, possibly true.

    In my case:
    There is no frame pointer, as BGBCC doesn't use one;

    Can't do PASCAL and other ALOGO derived languages with block structure.

    All stack-frames are fixed size, VLA's and alloca use the heap;

    longjump() is at a serious disadvantage here.
    desctructors are sometimes hard to position on the stack.

    GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
    TLS, accessed via TBR.

    Try/throw/catch:
    Mostly N/A for leaf functions.

    Any function that can "throw", is in effect no longer a leaf function. Implicitly, any function which uses "variant" or similar is also, no
    longer a leaf function.

    You do realize that there is a set of #define-s that can implement try-throw-catch without requiring any subroutines ?!?

    Need for GBR save/restore effectively excludes a function from being tiny-leaf. This may happen, say, if a function accesses global variables
    and may be called as a function pointer.

    ------------------------------------------------------

    One "TODO" here would be to merge constants with the same "actual" value
    into the same register. At present, they will be duplicated if the types
    are sufficiently different (such as integer 0 vs NULL).

    In practice, the upper 48-bits of a extern variable is completely shared whereas the lower 16-bits are unique.

    For functions with dynamic assignment, immediate values are more likely
    to be used. If the code-generator were clever, potentially it could
    exclude assigning registers to constants which are only used by
    instructions which can encode them directly as an immediate. Currently,
    BGBCC is not that clever.

    And then there are languages like PL/1 and FORTRAN where the compiler
    has to figure out how big an intermediate array is, allocate it, perform
    the math, and then deallocate it.

    Or, say:
    y=x+31; //31 only being used here, and fits easily in an Imm9.
    Ideally, compiler could realize 31 does not need a register here.


    Well, and another weakness is with temporaries that exist as function arguments:
    If static assigned, the "target variable directly to argument register" optimization can't be used (it ends up needing to go into a callee-save register and then be MOV'ed into the argument register; otherwise the compiler breaks...).

    Though, I guess possible could be that the compiler could try to
    partition temporaries that are used exclusively as function arguments
    into a different category from "normal" temporaries (or those whose
    values may cross a basic-block boundary), and then avoid
    statically-assigning them (and somehow not cause this to effectively
    break the full-static-assignment scheme in the process).

    Brian's compiler finds the largest argument list and the largest return
    value list and merges them into a single area on the stack used only
    for passing arguments and results across the call interface. And the
    <static> SP points at this area.

    Though, IIRC, I had also considered the possibility of a temporary
    "virtual assignment", allowing the argument value to be temporarily
    assigned to a function argument register, then going "poof" and
    disappearing when the function is called. Hadn't yet thought of a good
    way to add this logic to the register allocator though.


    But, yeah, compiler stuff is really fiddly...


    More orthogonality helps.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Apr 10 17:29:22 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:

    On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:


    Also the blob of constants needed to be within 512 bytes of the load
    instruction, which was also kind of an evil mess for branch handling
    (and extra bad if one needed to spill the constants in the middle of a
    basic block and then branch over it).

    In My 66000 case, the constant is the word following the instruction.
    Easy to find, easy to access, no register pollution, no DCache pollution.

    It does occupy some icache space, however; have you boosted the icache
    size to compensate?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Apr 10 17:12:47 2024
    BGB wrote:

    On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:


    Also the blob of constants needed to be within 512 bytes of the load instruction, which was also kind of an evil mess for branch handling
    (and extra bad if one needed to spill the constants in the middle of a
    basic block and then branch over it).

    In My 66000 case, the constant is the word following the instruction.
    Easy to find, easy to access, no register pollution, no DCache pollution.

    Usually they were spilled between basic-blocks, with the basic-block
    needing to branch to the following basic-block in these cases.

    Also 8-bit branch displacements are kinda lame, ...

    Why do that to yourself ??

    And, if one wanted a 16-bit branch:
    MOV.W (PC, 4), R0 //load a 16-bit branch displacement
    BRA/F R0
    .L0:
    NOP // delay slot
    .WORD $(Label - .L0)

    Also kinda bad...

    Can you say Yech !!

    Things like memcpy/memmove/memset/etc, are function calls in cases
    when not directly transformed into register load/store sequences.

    My 66000 does not convert them into LD-ST sequences, MM is a single inst-
    ruction.


    I have no high-level memory move/copy/set instructions.
    Only loads/stores...

    You have the power to fix it.........

    For small copies, can encode them inline, but past a certain size this becomes too bulky.

    A copy loop makes more sense for bigger copies, but has a high overhead
    for small to medium copy.


    So, there is a size range where doing it inline would be too bulky, but
    a loop caries an undesirable level of overhead.

    All the more reason to put it (a highly useful unit of work) into an instruction.

    Ended up doing these with "slides", which end up eating roughly several
    kB of code space, but was more compact than using larger inline copies.


    Say (IIRC):
    128 bytes or less: Inline Ld/St sequence
    129 bytes to 512B: Slide
    Over 512B: Call "memcpy()" or similar.

    Versus::
    1-infinity: use MM instruction.

    The slide generally has entry points in multiples of 32 bytes, and
    operates in reverse order. So, if not a multiple of 32 bytes, the last
    bytes need to be handled externally prior to branching into the slide.

    Does this remain sequentially consistent ??

    Though, this is only used for fixed-size copies (or "memcpy()" when
    value is constant).


    Say:

    __memcpy64_512_ua:
    MOV.Q (R5, 480), R20
    MOV.Q (R5, 488), R21
    MOV.Q (R5, 496), R22
    MOV.Q (R5, 504), R23
    MOV.Q R20, (R4, 480)
    MOV.Q R21, (R4, 488)
    MOV.Q R22, (R4, 496)
    MOV.Q R23, (R4, 504)

    __memcpy64_480_ua:
    MOV.Q (R5, 448), R20
    MOV.Q (R5, 456), R21
    MOV.Q (R5, 464), R22
    MOV.Q (R5, 472), R23
    MOV.Q R20, (R4, 448)
    MOV.Q R21, (R4, 456)
    MOV.Q R22, (R4, 464)
    MOV.Q R23, (R4, 472)

    ....

    __memcpy64_32_ua:
    MOV.Q (R5), R20
    MOV.Q (R5, 8), R21
    MOV.Q (R5, 16), R22
    MOV.Q (R5, 24), R23
    MOV.Q R20, (R4)
    MOV.Q R21, (R4, 8)
    MOV.Q R22, (R4, 16)
    MOV.Q R23, (R4, 24)
    RTS

    Duff's device in any other name.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB-Alt on Wed Apr 10 21:19:20 2024
    BGB-Alt wrote:

    On 4/10/2024 12:12 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:


    Also the blob of constants needed to be within 512 bytes of the load
    instruction, which was also kind of an evil mess for branch handling
    (and extra bad if one needed to spill the constants in the middle of a
    basic block and then branch over it).

    In My 66000 case, the constant is the word following the instruction.
    Easy to find, easy to access, no register pollution, no DCache pollution.


    Yeah.

    This was why some of the first things I did when I started extending
    SH-4 were:
    Adding mechanisms to build constants inline;
    Adding Load/Store ops with a displacement (albeit with encodings
    borrowed from SH-2A);
    Adding 3R and 3RI encodings (originally Imm8 for 3RI).

    My suggestion is that:: "Now that you have screwed around for a while,
    Why not take that experience and do a new ISA without any of those
    mistakes in it" ??

    Did have a mess when I later extended the ISA to 32 GPRs, as (like with
    BJX2 Baseline+XGPR) only part of the ISA had access to R16..R31.


    Usually they were spilled between basic-blocks, with the basic-block
    needing to branch to the following basic-block in these cases.

    Also 8-bit branch displacements are kinda lame, ...

    Why do that to yourself ??


    I didn't design SuperH, Hitachi did...

    But you did not fix them en massé, and you complain about them
    at least once a week. There comes a time when it takes less time
    and less courage to do that big switch and clean up all that mess.


    But, with BJX1, I had added Disp16 branches.

    With BJX2, they were replaced with 20 bit branches. These have the merit
    of being able to branch anywhere within a Doom or Quake sized binary.


    And, if one wanted a 16-bit branch:
       MOV.W (PC, 4), R0  //load a 16-bit branch displacement
       BRA/F R0
       .L0:
       NOP    // delay slot
       .WORD $(Label - .L0)

    Also kinda bad...

    Can you say Yech !!


    Yeah.
    This sort of stuff created strong incentive for ISA redesign...

    Maybe consider now as the appropriate time to strt.

    Granted, it is possible had I instead started with RISC-V instead of
    SuperH, it is probable BJX2 wouldn't exist.


    Though, at the time, the original thinking was that SuperH having
    smaller instructions meant it would have better code density than RV32I
    or similar. Turns out not really, as the penalty of the 16 bit ops was needing almost twice as many on average.

    My 66000 only requires 70% the instruction count of RISC-V,
    Yours could too ................

    Things like memcpy/memmove/memset/etc, are function calls in cases
    when not directly transformed into register load/store sequences.

    My 66000 does not convert them into LD-ST sequences, MM is a single
    inst-
    ruction.


    I have no high-level memory move/copy/set instructions.
    Only loads/stores...

    You have the power to fix it.........


    But, at what cost...

    You would not have to spend hours a week defending the indefensible !!

    I had generally avoided anything that will have required microcode or
    shoving state-machines into the pipeline or similar.

    Things as simple as IDIV and FDIV require sequencers.
    But LDM, STM, MM require sequencers simpler than IDIV and FDIV !!

    Things like Load/Store-Multiple or

    If you like polluted ICaches..............

    For small copies, can encode them inline, but past a certain size this
    becomes too bulky.

    A copy loop makes more sense for bigger copies, but has a high
    overhead for small to medium copy.


    So, there is a size range where doing it inline would be too bulky,
    but a loop caries an undesirable level of overhead.

    All the more reason to put it (a highly useful unit of work) into an
    instruction.


    This is an area where "slides" work well, the main cost is mostly the
    bulk that the slide adds to the binary (albeit, it is one-off).

    Consider that the predictor getting into the slide the first time
    always mispredicts !!

    Which is why it is a 512B memcpy slide vs, say, a 4kB memcpy slide...

    What if you only wanted to copy 63 bytes ?? Your DW slide fails miserably,
    yet a HW sequencer only has to avoid asserting a single byte write enable
    once.

    For looping memcpy, it makes sense to copy 64 or 128 bytes per loop
    iteration or so to try to limit looping overhead.

    On low end machines, you want to operate at cache port width,
    On high end machines, you want to operate at cache line widths per port.
    This is essentially impossible using slides.....here, the same code is
    not optimal across a line of implementations.

    Though, leveraging the memcpy slide for the interior part of the copy
    could be possible in theory as well.

    What do you do when the STAT drive wants to write a whole page ??

    For LZ memcpy, it is typically smaller, as LZ copies tend to be a lot
    shorter (a big part of LZ decoder performance mostly being in
    fine-tuning the logic for the match copies).

    Though, this is part of why my runtime library had added a
    "_memlzcpy(dst, src, len)" and "_memlzcpyf(dst, src, len)" functions,
    which can consolidate this rather than needing to do it one-off for each
    LZ decoder (as I see it, it is a similar issue to not wanting code to endlessly re-roll stuff for functions like memcpy or malloc/free, *).


    *: Though, nevermind that the standard C interface for malloc is
    annoyingly minimal, and ends up requiring most non-trivial programs to
    roll their own memory management.


    Ended up doing these with "slides", which end up eating roughly
    several kB of code space, but was more compact than using larger
    inline copies.


    Say (IIRC):
       128 bytes or less: Inline Ld/St sequence
       129 bytes to 512B: Slide
       Over 512B: Call "memcpy()" or similar.

    Versus::
        1-infinity: use MM instruction.


    Yeah, but it makes the CPU logic more expensive.

    By what, 37-gates ??

    The slide generally has entry points in multiples of 32 bytes, and
    operates in reverse order. So, if not a multiple of 32 bytes, the last
    bytes need to be handled externally prior to branching into the slide.

    Does this remain sequentially consistent ??


    Within a thread, it is fine.

    What if a SATA drive is reading while you are writing !!
    That is, DMA is no different than multi-threaded applications--except
    DMA cannot perform locks.

    Main wonk is that it does start copying from the high address first. Presumably interrupts or similar wont be messing with application memory
    mid memcpy.

    The only things wanting high-low access patterns are dumping stuff to the stack. The fact you CAN get away with it most of the time is no excuse.

    The looping memcpy's generally work from low to high addresses though.

    As does all string processing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Apr 10 23:30:02 2024
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:


    In My 66000 case, the constant is the word following the instruction.
    Easy to find, easy to access, no register pollution, no DCache pollution.

    It does occupy some icache space, however; have you boosted the icache
    size to compensate?

    The space occupied in the ICache is freed up from being in the DCache
    so the overall hit rate goes up !! At typical sizes, ICache miss rate
    is about ¼ the miss rate of DCache.

    Besides:: if you had to LD the constant from memory, you use a LD instruction and 1 or 2 words in DCache, while consuming a GPR. So, overall, it takes
    fewer cycles, fewer GPRs, and fewer instructions.

    Alternatively:: if you paste constants together (LUI, AUPIC) you have no
    direct route to either 64-bit constants or 64-bit address spaces.

    It looks to be a win-win !!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Thu Apr 11 14:13:24 2024
    On Wed, 10 Apr 2024 23:30:02 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:


    In My 66000 case, the constant is the word following the
    instruction. Easy to find, easy to access, no register pollution,
    no DCache pollution.

    It does occupy some icache space, however; have you boosted the
    icache size to compensate?

    The space occupied in the ICache is freed up from being in the DCache
    so the overall hit rate goes up !! At typical sizes, ICache miss rate
    is about ¼ the miss rate of DCache.

    Besides:: if you had to LD the constant from memory, you use a LD
    instruction and 1 or 2 words in DCache, while consuming a GPR. So,
    overall, it takes fewer cycles, fewer GPRs, and fewer instructions.

    Alternatively:: if you paste constants together (LUI, AUPIC) you have
    no direct route to either 64-bit constants or 64-bit address spaces.

    It looks to be a win-win !!

    Win-win under constraints of Load-Store Arch. Otherwise, it depends.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Scott Lurndal on Thu Apr 11 12:22:47 2024
    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:

    On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:


    Also the blob of constants needed to be within 512 bytes of the load
    instruction, which was also kind of an evil mess for branch handling
    (and extra bad if one needed to spill the constants in the middle of a
    basic block and then branch over it).

    In My 66000 case, the constant is the word following the instruction.
    Easy to find, easy to access, no register pollution, no DCache pollution.

    It does occupy some icache space, however; have you boosted the icache
    size to compensate?

    Except it pretty rarely do so (increase icache pressure):

    mov temp_reg, offset const_table
    mov reg,qword ptr [temp_reg+const_offset]

    looks to me like at least 5 bytes for the first instruction and probably
    6 for the second, for a total of 11 (could be as low as 8 for a very
    small offset), all on top of the 8 bytes of dcache needed to hold the
    64-bit value loaded.

    In My 66000 this should be a single 32-bit instruction followed by the
    8-byte const, so 12 bytes total and no lookaside dcache inference.

    It is only when you do a lot of 64-bit data loads, all gathered in a
    single 256-byte buffer holding up to 32 such values, and you can afford
    to allocate a fixed register pointing to the middle of that range, that
    you actually gain some total space: Each load can now just do a

    mov reg,qword ptr [fixed_base_reg+byte_offset]

    which, due to the need for a 64-bit prefix, will probably need 4
    instruction bytes on top of the 8 bytes from dcache. At this point we
    are touching exactly the same number of bytes (12) as My 66000, but from
    two different caches, so much more likley to suffer dcache misses.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Paul A. Clayton on Thu Apr 11 14:30:27 2024
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 4/9/24 8:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:
    [snip]
    Things like memcpy/memmove/memset/etc, are function calls in
    cases when not directly transformed into register load/store
    sequences.

    My 66000 does not convert them into LD-ST sequences, MM is a
    single instruction.

    I wonder if it would be useful to have an immediate count form of
    memory move. Copying fixed-size structures would be able to use an
    immediate. Aside from not having to load an immediate for such
    cases, there might be microarchitectural benefits to using a
    constant. Since fixed-sized copies would likely be limited to
    smaller regions (with the possible exception of 8 MiB page copies)
    and the overhead of loading a constant for large sizes would be
    tiny, only providing a 16-bit immediate form might be reasonable.

    It seems to me that an offloaded DMA engine would be a far
    better way to do memmove (over some threshhold, perhaps a
    cache line) without trashing the caches. Likewise memset.



    Did end up with an intermediate "memcpy slide", which can handle
    medium size memcpy and memset style operations by branching into
    a slide.

    MMs and MSs that do not cross page boundaries are ATOMIC. The
    entire system
    sees only the before or only the after state and nothing in
    between.

    One might wonder how that atomicity is guaranteed in a
    SMP processor...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Thu Apr 11 18:46:54 2024
    BGB wrote:

    On 4/11/2024 6:13 AM, Michael S wrote:
    On Wed, 10 Apr 2024 23:30:02 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    It does occupy some icache space, however; have you boosted the
    icache size to compensate?

    The space occupied in the ICache is freed up from being in the DCache
    so the overall hit rate goes up !! At typical sizes, ICache miss rate
    is about ¼ the miss rate of DCache.

    Besides:: if you had to LD the constant from memory, you use a LD
    instruction and 1 or 2 words in DCache, while consuming a GPR. So,
    overall, it takes fewer cycles, fewer GPRs, and fewer instructions.

    Alternatively:: if you paste constants together (LUI, AUPIC) you have
    no direct route to either 64-bit constants or 64-bit address spaces.

    It looks to be a win-win !!

    Win-win under constraints of Load-Store Arch. Otherwise, it depends.

    Never seen a LD-OP architecture where the inbound memory can be in the
    Rs1 position of the instruction.



    FWIW:
    The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
    and needs less encoding space than the LUI route.

    MOV Imm16. Rn
    SHORI Imm16, Rn
    SHORI Imm16, Rn
    SHORI Imm16, Rn

    Granted, if each is a 1-cycle instruction, this still takes 4 clock cycles.

    As compared to::

    CALK Rd,Rs1,#imm64

    Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
    of the constant is free !! (0 cycles) !! {{The above example uses at least
    5 cycles to use the loaded/built constant.}}

    An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
    1-cycle, is preferable....

    A consuming instruction where you don't even use a register is better
    still !!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB-Alt on Thu Apr 11 23:06:05 2024
    BGB-Alt wrote:

    On 4/11/2024 1:46 PM, MitchAlsup1 wrote:
    BGB wrote:


    Win-win under constraints of Load-Store Arch. Otherwise, it depends.

    Never seen a LD-OP architecture where the inbound memory can be in the
    Rs1 position of the instruction.



    FWIW:
    The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
    and needs less encoding space than the LUI route.

       MOV Imm16. Rn
       SHORI Imm16, Rn
       SHORI Imm16, Rn
       SHORI Imm16, Rn

    Granted, if each is a 1-cycle instruction, this still takes 4 clock
    cycles.

    As compared to::

        CALK   Rd,Rs1,#imm64

    Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
    of the constant is free !! (0 cycles) !! {{The above example uses at least >> 5 cycles to use the loaded/built constant.}}


    The main reason one might want SHORI is that it can fit into a
    fixed-length 32-bit encoding.

    While 32-bit encoding is RISC mantra, it has NOT been shown to be best
    just simplest. Then, once you start widening the microarchitecture, it
    is better to fetch wider than decode-issue so that you suffer least
    from boundary conditions. Once you start fetching wide OR have wide decode-issue, you have ALL the infrastructure to do variable length instructions. Thus, complaining that VLE is hard has already been
    eradicated.

    Also technically could be retrofitted onto RISC-V without any significant change, unlike some other options (as
    noted, I don't argue for adding Jumbo prefixes to RV under the basis
    that there is no real viable way to add them to RV, *).

    The issue is that once you do VLE RISC-Vs ISA is no longer helping you
    get the job done, especially when you have to execute 40% more instructions

    Sadly, the closest option to viable for RV would be to add the SHORI instruction and optionally pattern match it in the fetch/decode.

    Or, say:
    LUI Xn, Imm20
    ADD Xn, Xn, Imm12
    SHORI Xn, Imm16
    SHORI Xn, Imm16

    Then, combine LUI+ADD into a 32-bit load in the decoder (though probably
    only if the Imm12 is positive), and 2x SHORI into a combined "Xn=(Xn<<32)|Imm32" operation.

    This could potentially get it down to 2 clock cycles.

    Universal constants gets this down to 0 cycles......

    *: To add a jumbo prefix, one needs an encoding that:
    Uses up a really big chunk of encoding space;
    Is otherwise illegal and unused.
    RISC-V doesn't have anything here.

    Which is WHY you should not jump ship from SH to RV, but jump to an
    ISA without these problems.

    Ironically, in XG2 mode, I still have 28x 24-bit chunks of encoding
    space that aren't yet used for anything, but aren't usable as normal
    encoding space mostly because if I put instructions in there (with the existing encoding schemes), I couldn't use all the registers (and they
    would not have predication or similar either). Annoyingly, the only
    types of encodings that would fit in there at present are 2RI Imm16 ops
    or similar (or maybe 3R 128-bit SIMD ops, where these ops only use
    encodings for R0..R31 anyways, interpreting the LSB of the register
    field as encoding R32..R63).

    Just another reason not to stay with what you have developed.

    In comparison, I reserve 6-major OpCodes so that a control transfer into
    data is highly likely to get Undefined OpCode exceptions rather than a
    try to execute what is in that data. Then, as it is, I still have 21-slots
    in the major OpCode group free (27 if you count the permanently reserved).

    Much of this comes from side effects of Universal Constants.


    An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
    1-cycle, is preferable....

    A consuming instruction where you don't even use a register is better
    still !!


    Can be done, but thus far 33-bit immediate values. Luckily, Imm33s seems
    to addresses around 99% of uses (for normal ALU ops and similar).

    What do you do when accessing data that the linker knows is more than 4GB
    away from IP ?? or known to be outside of 0-4GB ?? externs, GOT, PLT, ...

    Had considered allowing an Imm57s case for SIMD immediates (4x S.E5.F8
    or 2x S.E8.F19), which would have indirectly allowed the Imm57s case. By themselves though, the difference doesn't seem enough to justify the cost.

    While I admit that <basically> anything bigger than 50-bits will be fine
    as displacements, they are not fine for constants and especially FP
    constants and many bit twiddling constants.

    Don't have enough bits in the encoding scheme to pull off a 3RI Imm64 in
    12 bytes (and allowing a 16-byte encoding would have too steep of a cost increase to be worthwhile).

    And yet I did.

    So, alas...

    Yes, alas..........

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB-Alt on Thu Apr 11 23:22:16 2024
    BGB-Alt wrote:

    On 4/11/2024 9:30 AM, Scott Lurndal wrote:
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:


    One thing that is still needed is a good, fast, and semi-accurate way to
    pull off the Z=1.0/Z' calculation, as needed for perspective-correct rasterization (affine requires subdivision, which adds cost to the
    front-end, and interpolating Z directly adds significant distortion for geometry near the near plane).

    I saw a 10-cycle latency 1-cycle throughput divider at Samsumg::
    10 stages of 3-bit at a time SRT divider with some exponent stuff
    on the side. 1.0/z is a lot simpler than that (float only). A lot
    of these great big complicated calculations can be beaten into
    submission with a clever attack of brute force HW.....FMUL and FMAC
    being the most often cited cases.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Apr 11 23:12:25 2024
    Scott Lurndal wrote:

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 4/9/24 8:28 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:
    [snip]
    Things like memcpy/memmove/memset/etc, are function calls in
    cases when not directly transformed into register load/store
    sequences.

    My 66000 does not convert them into LD-ST sequences, MM is a
    single instruction.

    I wonder if it would be useful to have an immediate count form of
    memory move. Copying fixed-size structures would be able to use an >>immediate. Aside from not having to load an immediate for such
    cases, there might be microarchitectural benefits to using a
    constant. Since fixed-sized copies would likely be limited to
    smaller regions (with the possible exception of 8 MiB page copies)
    and the overhead of loading a constant for large sizes would be
    tiny, only providing a 16-bit immediate form might be reasonable.

    It seems to me that an offloaded DMA engine would be a far
    better way to do memmove (over some threshhold, perhaps a
    cache line) without trashing the caches. Likewise memset.

    Effectively, that is what HW does, even on the lower end machines,
    the AGEN unit of the Cache access pipeline is repeatedly cycled,
    and data is read and/or written. One can execute instructions not
    needing memory references while LDM, STM, ENTER, EXIT, MM, and MS
    are in progress.

    Moving this sequencer farther out would still require it to consume
    all L1 BW in any event (snooping) for memory consistency reasons.
    {Note: cache accesses are performed line-wide not register width wide}


    Did end up with an intermediate "memcpy slide", which can handle
    medium size memcpy and memset style operations by branching into
    a slide.

    MMs and MSs that do not cross page boundaries are ATOMIC. The
    entire system
    sees only the before or only the after state and nothing in
    between.

    One might wonder how that atomicity is guaranteed in a
    SMP processor...

    The entire chunk of data traverses the interconnect as a single
    transaction. All interested 3rd parties (not originator nor
    recipient) see either the memory state before the transfer or
    after the transfer.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Fri Apr 12 02:19:04 2024
    On Thu, 11 Apr 2024 18:46:54 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    On 4/11/2024 6:13 AM, Michael S wrote:
    On Wed, 10 Apr 2024 23:30:02 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    It does occupy some icache space, however; have you boosted the
    icache size to compensate?

    The space occupied in the ICache is freed up from being in the
    DCache so the overall hit rate goes up !! At typical sizes,
    ICache miss rate is about the miss rate of DCache.

    Besides:: if you had to LD the constant from memory, you use a LD
    instruction and 1 or 2 words in DCache, while consuming a GPR. So,
    overall, it takes fewer cycles, fewer GPRs, and fewer
    instructions.

    Alternatively:: if you paste constants together (LUI, AUPIC) you
    have no direct route to either 64-bit constants or 64-bit address
    spaces.

    It looks to be a win-win !!

    Win-win under constraints of Load-Store Arch. Otherwise, it
    depends.

    Never seen a LD-OP architecture where the inbound memory can be in
    the Rs1 position of the instruction.


    May be. But out of 6 major integer OPs it matters only for SUB.
    By now I don't remember for sure, but I think that I had seen LD-OP architecture that had SUBR instruction. May be, TI TMS320C30?
    It was 30 years ago and my memory is not what it used to be.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Apr 12 01:40:27 2024
    BGB wrote:

    On 4/11/2024 6:06 PM, MitchAlsup1 wrote:


    While I admit that <basically> anything bigger than 50-bits will be fine
    as displacements, they are not fine for constants and especially FP
    constants and many bit twiddling constants.


    The number of cases where this comes up is not statistically significant enough to have a meaningful impact on performance.

    Fraction of a percent edge-cases are not deal-breakers, as I see it.

    Idle speculation::

    .globl r8_erf ; -- Begin function r8_erf
    .type r8_erf,@function
    r8_erf: ; @r8_erf
    ; %bb.0:
    add sp,sp,#-128
    std #4614300636657501161,[sp,88] // a[0]
    std #4645348406721991307,[sp,104] // a[2]
    std #4659275911028085274,[sp,112] // a[3]
    std #4595861367557309218,[sp,120] // a[4]
    std #4599171895595656694,[sp,40] // p[0]
    std #4593699784569291823,[sp,56] // p[2]
    std #4580293056851789237,[sp,64] // p[3]
    std #4559215111867327292,[sp,72] // p[4]
    std #4580359811580069319,[sp,80] // p[4]
    std #4612966212090462427,[sp] // q[0]
    std #4602930165995154489,[sp,16] // q[2]
    std #4588882433176075751,[sp,24] // q[3]
    std #4567531038595922641,[sp,32] // q[4]
    fabs r2,r1
    fcmp r3,r2,#0x3EF00000 // thresh
    bnlt r3,.LBB141_6
    ; %bb.1:
    fcmp r3,r2,#4 // xabs <= 4.0
    bnlt r3,.LBB141_7
    ; %bb.2:
    fcmp r3,r2,#0x403A8B020C49BA5E // xbig
    bngt r3,.LBB141_11
    ; %bb.3:
    fmul r3,r1,r1
    fdiv r3,#1,r3
    mov r4,#0x3F90B4FB18B485C7 // p[5]
    fmac r4,r3,r4,#0x3FD38A78B9F065F6 // p[0]
    fadd r5,r3,#0x40048C54508800DB // q[0]
    fmac r6,r3,r4,#0x3FD70FE40E2425B8 // p[1]
    fmac r4,r3,r5,#0x3FFDF79D6855F0AD // q[1]
    fmul r4,r3,r4
    fmul r6,r3,r6
    mov r5,#2
    add r7,sp,#40 // p[*]
    add r8,sp,#0 // q[*]
    LBB141_4: ; %._crit_edge11
    ; =>This Inner Loop Header: Depth=1
    vec r9,{r4,r6}
    ldd r10,[r7,r5<<3,0] // p[*]
    ldd r11,[r8,r5<<3,0] // q[*]
    fadd r6,r6,r10
    fadd r4,r4,r11
    fmul r4,r3,r4
    fmul r6,r3,r6
    loop ne,r5,#4,#1
    ; %bb.5:
    fadd r5,r6,#0x3F4595FD0D71E33C // p[4]
    fmul r3,r3,r5
    fadd r4,r4,#0x3F632147A014BAD1 // q[4]
    fdiv r3,r3,r4
    fadd r3,#0x3FE20DD750429B6D,-r3 // c[0]
    fdiv r3,r3,r2
    br .LBB141_10 // common tail
    LBB141_6: ; %._crit_edge
    fmul r3,r1,r1
    fcmp r2,r2,#0x3C9FFE5AB7E8AD5E // xsmall
    sra r2,r2,<1:13>
    cvtsd r4,#0
    mux r2,r2,r3,r4
    mov r3,#0x3FC7C7905A31C322 // a[4]
    fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
    fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
    ldd r4,[sp,104] // a[2]
    fmac r3,r2,r3,r4
    fadd r4,r2,#0x403799EE342FB2DE // b[0]
    fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
    fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
    fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
    fmul r1,r3,r1
    fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
    fdiv r2,r1,r2
    mov r1,r2
    add sp,sp,#128
    ret // 68
    LBB141_7:
    fmul r3,r2,#0x3E571E703C5F5815 // c[8]
    mov r5,#0
    mov r4,r2
    LBB141_8: ; =>This Inner Loop Header: Depth=1
    vec r6,{r3,r4}
    ldd r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
    fadd r3,r3,r7
    fmul r3,r2,r3
    ldd r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
    fadd r4,r4,r7
    fmul r4,r2,r4
    loop ne,r5,#7,#1
    ; %bb.9:
    fadd r3,r3,#0x4093395B7FD2FC8E // c[7]
    fadd r4,r4,#0x4093395B7FD35F61 // d[7]
    fdiv r3,r3,r4
    LBB141_10: // common tail
    fmul r4,r2,#0x41800000 // 16.0
    fmul r4,r4,#0x3D800000 // 1/16.0
    cvtds r4,r4 // (signed)double
    cvtsd r4,r4 // (double)signed
    fadd r5,r2,-r4
    fadd r2,r2,r4
    fmul r4,r4,-r4
    fexp r4,r4 // exp()
    fmul r2,r2,-r5
    fexp r2,r2 // exp()
    fmul r2,r4,r2
    fadd r2,#0,-r2
    fmac r2,r2,r3,#0x3F000000 // 0.5
    fadd r2,r2,#0x3F000000 // 0.5
    pflt r1,0,T
    fadd r2,#0,-r2
    mov r1,r2
    add sp,sp,#128
    ret
    LBB141_11:
    fcmp r1,r1,#0
    sra r1,r1,<1:13>
    cvtsd r2,#-1 // (double)-1
    cvtsd r3,#1 // (double)+1
    mux r2,r1,r3,r2
    mov r1,r2
    add sp,sp,#128
    ret
    Lfunc_end141:
    .size r8_erf, .Lfunc_end141-r8_erf
    ; -- End function

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Fri Apr 12 13:40:01 2024
    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 11 Apr 2024 18:46:54 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    It looks to be a win-win !! =20
    =20
    Win-win under constraints of Load-Store Arch. Otherwise, it
    depends. =20
    =20
    Never seen a LD-OP architecture where the inbound memory can be in
    the Rs1 position of the instruction.
    =20

    May be. But out of 6 major integer OPs it matters only for SUB.
    By now I don't remember for sure, but I think that I had seen LD-OP >architecture that had SUBR instruction. May be, TI TMS320C30?

    ARM has LDADD - negate one argument and it becomes a subtract.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Fri Apr 12 18:08:33 2024
    On Fri, 12 Apr 2024 13:40:01 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 11 Apr 2024 18:46:54 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    It looks to be a win-win !! =20
    =20
    Win-win under constraints of Load-Store Arch. Otherwise, it
    depends. =20
    =20
    Never seen a LD-OP architecture where the inbound memory can be in
    the Rs1 position of the instruction.
    =20

    May be. But out of 6 major integer OPs it matters only for SUB.
    By now I don't remember for sure, but I think that I had seen LD-OP >architecture that had SUBR instruction. May be, TI TMS320C30?

    ARM has LDADD - negate one argument and it becomes a subtract.


    ARM LDADD is not a LD-OP instruction. It is RMW.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Apr 12 23:46:33 2024
    BGB wrote:

    On 4/11/2024 8:40 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/11/2024 6:06 PM, MitchAlsup1 wrote:


    While I admit that <basically> anything bigger than 50-bits will be fine >>>> as displacements, they are not fine for constants and especially FP
    constants and many bit twiddling constants.


    The number of cases where this comes up is not statistically
    significant enough to have a meaningful impact on performance.

    Fraction of a percent edge-cases are not deal-breakers, as I see it.

    Idle speculation::

        .globl    r8_erf                          ; -- Begin function r8_erf
        .type    r8_erf,@function
    r8_erf:                                 ; @r8_erf
    ; %bb.0:
        add    sp,sp,#-128
        std    #4614300636657501161,[sp,88]    // a[0]
        std    #4645348406721991307,[sp,104]    // a[2]
        std    #4659275911028085274,[sp,112]    // a[3]
        std    #4595861367557309218,[sp,120]    // a[4]
        std    #4599171895595656694,[sp,40]    // p[0]
        std    #4593699784569291823,[sp,56]    // p[2]
        std    #4580293056851789237,[sp,64]    // p[3]
        std    #4559215111867327292,[sp,72]    // p[4]
        std    #4580359811580069319,[sp,80]    // p[4]
        std    #4612966212090462427,[sp]    // q[0]
        std    #4602930165995154489,[sp,16]    // q[2]
        std    #4588882433176075751,[sp,24]    // q[3]
        std    #4567531038595922641,[sp,32]    // q[4]
        fabs    r2,r1
        fcmp    r3,r2,#0x3EF00000        // thresh
        bnlt    r3,.LBB141_6
    ; %bb.1:
        fcmp    r3,r2,#4            // xabs <= 4.0
        bnlt    r3,.LBB141_7
    ; %bb.2:
        fcmp    r3,r2,#0x403A8B020C49BA5E    // xbig
        bngt    r3,.LBB141_11
    ; %bb.3:
        fmul    r3,r1,r1
        fdiv    r3,#1,r3
        mov    r4,#0x3F90B4FB18B485C7        // p[5]
        fmac    r4,r3,r4,#0x3FD38A78B9F065F6    // p[0]
        fadd    r5,r3,#0x40048C54508800DB    // q[0]
        fmac    r6,r3,r4,#0x3FD70FE40E2425B8    // p[1]
        fmac    r4,r3,r5,#0x3FFDF79D6855F0AD    // q[1]
        fmul    r4,r3,r4
        fmul    r6,r3,r6
        mov    r5,#2
        add    r7,sp,#40            // p[*]
        add    r8,sp,#0            // q[*]
    LBB141_4:                              ; %._crit_edge11
                                           ; =>This Inner Loop Header: Depth=1
        vec    r9,{r4,r6}
        ldd    r10,[r7,r5<<3,0]        // p[*]
        ldd    r11,[r8,r5<<3,0]        // q[*]
        fadd    r6,r6,r10
        fadd    r4,r4,r11
        fmul    r4,r3,r4
        fmul    r6,r3,r6
        loop    ne,r5,#4,#1
    ; %bb.5:
        fadd    r5,r6,#0x3F4595FD0D71E33C    // p[4]
        fmul    r3,r3,r5
        fadd    r4,r4,#0x3F632147A014BAD1    // q[4]
        fdiv    r3,r3,r4
        fadd    r3,#0x3FE20DD750429B6D,-r3    // c[0]
        fdiv    r3,r3,r2
        br    .LBB141_10            // common tail
    LBB141_6:                              ; %._crit_edge
        fmul    r3,r1,r1
        fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E    // xsmall
        sra    r2,r2,<1:13>
        cvtsd    r4,#0
        mux    r2,r2,r3,r4
        mov    r3,#0x3FC7C7905A31C322        // a[4]
        fmac    r3,r2,r3,#0x400949FB3ED443E9    // a[0]
        fmac    r3,r2,r3,#0x405C774E4D365DA3    // a[1]
        ldd    r4,[sp,104]            // a[2]
        fmac    r3,r2,r3,r4
        fadd    r4,r2,#0x403799EE342FB2DE    // b[0]
        fmac    r4,r2,r4,#0x406E80C9D57E55B8    // b[1]
        fmac    r4,r2,r4,#0x40940A77529CADC8    // b[2]
        fmac    r3,r2,r3,#0x40A912C1535D121A    // a[3]
        fmul    r1,r3,r1
        fmac    r2,r2,r4,#0x40A63879423B87AD    // b[3]
        fdiv    r2,r1,r2
        mov    r1,r2
        add    sp,sp,#128
        ret                // 68
    LBB141_7:
        fmul    r3,r2,#0x3E571E703C5F5815    // c[8]
        mov    r5,#0
        mov    r4,r2
    LBB141_8:                              ; =>This Inner Loop Header: Depth=1
        vec    r6,{r3,r4}
        ldd    r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
        fadd    r3,r3,r7
        fmul    r3,r2,r3
        ldd    r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
        fadd    r4,r4,r7
        fmul    r4,r2,r4
        loop    ne,r5,#7,#1
    ; %bb.9:
        fadd    r3,r3,#0x4093395B7FD2FC8E    // c[7]
        fadd    r4,r4,#0x4093395B7FD35F61    // d[7]
        fdiv    r3,r3,r4
    LBB141_10:                // common tail
        fmul    r4,r2,#0x41800000        // 16.0
        fmul    r4,r4,#0x3D800000        // 1/16.0
        cvtds    r4,r4                // (signed)double >>     cvtsd    r4,r4                // (double)signed >>     fadd    r5,r2,-r4
        fadd    r2,r2,r4
        fmul    r4,r4,-r4
        fexp    r4,r4                // exp()
        fmul    r2,r2,-r5
        fexp    r2,r2                // exp()
        fmul    r2,r4,r2
        fadd    r2,#0,-r2
        fmac    r2,r2,r3,#0x3F000000        // 0.5
        fadd    r2,r2,#0x3F000000        // 0.5
        pflt    r1,0,T
        fadd    r2,#0,-r2
        mov    r1,r2
        add    sp,sp,#128
        ret
    LBB141_11:
        fcmp    r1,r1,#0
        sra    r1,r1,<1:13>
        cvtsd    r2,#-1                // (double)-1
        cvtsd    r3,#1                // (double)+1
        mux    r2,r1,r3,r2
        mov    r1,r2
        add    sp,sp,#128
        ret
    Lfunc_end141:
        .size    r8_erf, .Lfunc_end141-r8_erf
                                           ; -- End function

    These patterns seem rather unusual...
    Don't really know the ABI.

    Patterns don't really fit observations for typical compiler output
    though (mostly in the FP constants, and particular ones that fall
    outside the scope of what can be exactly represented as Binary16 or
    similar, are rare).


    .globl r8_erf ; -- Begin function r8_erf
    .type r8_erf,@function
    r8_erf: ; @r8_erf
    ; %bb.0:
    add sp,sp,#-128
    ADD -128, SP
    std #4614300636657501161,[sp,88] // a[0]
    MOV 0x400949FB3ED443E9, R3
    MOV.Q R3, (SP, 88)
    std #4645348406721991307,[sp,104] // a[2]
    MOV 0x407797C38897528B, R3
    MOV.Q R3, (SP, 104)
    std #4659275911028085274,[sp,112] // a[3]
    std #4595861367557309218,[sp,120] // a[4]
    std #4599171895595656694,[sp,40] // p[0]
    std #4593699784569291823,[sp,56] // p[2]
    std #4580293056851789237,[sp,64] // p[3]
    std #4559215111867327292,[sp,72] // p[4]
    std #4580359811580069319,[sp,80] // p[4]
    std #4612966212090462427,[sp] // q[0]
    std #4602930165995154489,[sp,16] // q[2]
    std #4588882433176075751,[sp,24] // q[3]
    std #4567531038595922641,[sp,32] // q[4]
    .... pattern is obvious enough.
    Each constant needs 12 bytes, so 16 bytes/store.

    But 2 instructions instead of 1 and 16 bytes instead of 12.

    fabs r2,r1
    fcmp r3,r2,#0x3EF00000 // thresh
    bnlt r3,.LBB141_6
    FABS R5, R6
    FLDH 0x3780, R3 //A
    FCMPGT R3, R6 //A
    BT .LBB141_6 //A

    Or (FP-IMM extension):

    FABS R5, R6
    FCMPGE 0x0DE, R6 //B (FP-IMM)
    BF .LBB141_6 //B

    ; %bb.1:
    fcmp r3,r2,#4 // xabs <= 4.0
    bnlt r3,.LBB141_7

    FCMPGE 0x110, R6
    BF .LBB141_7

    ; %bb.2:
    fcmp r3,r2,#0x403A8B020C49BA5E // xbig
    bngt r3,.LBB141_11

    MOV 0x403A8B020C49BA5E, R3
    FCMPGT R3, R6
    BT .LBB141_11

    Where, FP-IMM wont work with that value.

    Value came from source code.

    ; %bb.3:
    fmul r3,r1,r1
    FMUL R5, R5, R7
    fdiv r3,#1,r3
    Skip, operation gives identity?...

    It is a reciprocate R3 = #1.0/R3

    mov r4,#0x3F90B4FB18B485C7 // p[5]
    Similar.

    fmac r4,r3,r4,#0x3FD38A78B9F065F6 // p[0]
    fadd r5,r3,#0x40048C54508800DB // q[0]
    fmac r6,r3,r4,#0x3FD70FE40E2425B8 // p[1]
    fmac r4,r3,r5,#0x3FFDF79D6855F0AD // q[1]

    Turns into 4 constants, 7 FPU instructions (if no FMAC extension, 4 with FMAC). Though, at present, FMAC is slower than separate FMUL+FADD.

    So, between 8 and 11 instructions.

    Instead of 4.....

    fmul r4,r3,r4
    fmul r6,r3,r6
    mov r5,#2
    add r7,sp,#40 // p[*]
    add r8,sp,#0 // q[*]

    These can map 1:1.

    LBB141_4: ; %._crit_edge11
    ; =>This Inner Loop Header:
    Depth=1
    vec r9,{r4,r6}
    ldd r10,[r7,r5<<3,0] // p[*]
    ldd r11,[r8,r5<<3,0] // q[*]
    fadd r6,r6,r10
    fadd r4,r4,r11
    fmul r4,r3,r4
    fmul r6,r3,r6
    loop ne,r5,#4,#1

    Could be mapped to a scalar loop, pretty close to 1:1.

    I have 7 instructions in the loop, you would have 9.

    Could possibly also be mapped over to 2x Binary64 SIMD ops, I am
    guessing 2 copies for a 4-element vector?...


    ; %bb.5:
    fadd r5,r6,#0x3F4595FD0D71E33C // p[4]
    fmul r3,r3,r5
    fadd r4,r4,#0x3F632147A014BAD1 // q[4]
    fdiv r3,r3,r4
    fadd r3,#0x3FE20DD750429B6D,-r3 // c[0]
    fdiv r3,r3,r2
    br .LBB141_10 // common tail

    Same patterns as before.
    Would need ~ 10 ops.

    Well, could be expressed with fewer ops via jumbo-prefixed FP-IMM ops,
    but this would only give "Binary32 truncated to 29 bits" precision for
    the immediate values.

    Theoretically, could allow an FE-FE-F0 encoding for FP-IMM, which could
    give ~ 53 bits of precision. But, if one needs full Binary64, this will
    not gain much in this case.


    LBB141_6: ; %._crit_edge
    fmul r3,r1,r1
    fcmp r2,r2,#0x3C9FFE5AB7E8AD5E // xsmall
    sra r2,r2,<1:13>
    cvtsd r4,#0
    mux r2,r2,r3,r4
    mov r3,#0x3FC7C7905A31C322 // a[4]
    fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
    fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
    ldd r4,[sp,104] // a[2]
    fmac r3,r2,r3,r4
    fadd r4,r2,#0x403799EE342FB2DE // b[0]
    fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
    fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
    fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
    fmul r1,r3,r1
    fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
    fdiv r2,r1,r2
    mov r1,r2
    add sp,sp,#128
    ret // 68
    LBB141_7:
    fmul r3,r2,#0x3E571E703C5F5815 // c[8]
    mov r5,#0
    mov r4,r2
    LBB141_8: ; =>This Inner Loop Header:
    Depth=1
    vec r6,{r3,r4}
    ldd r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
    fadd r3,r3,r7
    fmul r3,r2,r3
    ldd r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
    fadd r4,r4,r7
    fmul r4,r2,r4
    loop ne,r5,#7,#1
    ; %bb.9:
    fadd r3,r3,#0x4093395B7FD2FC8E // c[7]
    fadd r4,r4,#0x4093395B7FD35F61 // d[7]
    fdiv r3,r3,r4
    LBB141_10: // common tail
    fmul r4,r2,#0x41800000 // 16.0
    fmul r4,r4,#0x3D800000 // 1/16.0
    cvtds r4,r4 // (signed)double
    cvtsd r4,r4 // (double)signed
    fadd r5,r2,-r4
    fadd r2,r2,r4
    fmul r4,r4,-r4
    fexp r4,r4 // exp()
    fmul r2,r2,-r5
    fexp r2,r2 // exp()
    fmul r2,r4,r2
    fadd r2,#0,-r2
    fmac r2,r2,r3,#0x3F000000 // 0.5
    fadd r2,r2,#0x3F000000 // 0.5
    pflt r1,0,T
    fadd r2,#0,-r2
    mov r1,r2
    add sp,sp,#128
    ret
    LBB141_11:
    fcmp r1,r1,#0
    sra r1,r1,<1:13>
    cvtsd r2,#-1 // (double)-1
    cvtsd r3,#1 // (double)+1
    mux r2,r1,r3,r2
    mov r1,r2
    add sp,sp,#128
    ret
    Lfunc_end141:
    .size r8_erf, .Lfunc_end141-r8_erf
    ; -- End function

    Don't really have time at the moment to comment on the rest of this...


    In other news, found a bug in the function dependency-walking code.

    Fixing this bug got things a little closer to beak-even with RV64G GCC
    output regarding ".text" size (though, was still not sufficient to
    entirely close the gap).


    This was mostly based on noting that the compiler output had included
    some things that were not reachable from within the program being
    compiled (namely, noticing that the Doom build had included a copy of
    the MS-CRAM video decoder and similar, which was not reachable from
    anywhere within Doom).

    Some more analysis may be needed.

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sat Apr 13 03:17:43 2024
    BGB wrote:

    On 4/11/2024 8:40 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 4/11/2024 6:06 PM, MitchAlsup1 wrote:


    While I admit that <basically> anything bigger than 50-bits will be fine >>>> as displacements, they are not fine for constants and especially FP
    constants and many bit twiddling constants.


    The number of cases where this comes up is not statistically
    significant enough to have a meaningful impact on performance.

    Fraction of a percent edge-cases are not deal-breakers, as I see it.

    Idle speculation::

        .globl    r8_erf                          ; -- Begin function r8_erf
        .type    r8_erf,@function
    r8_erf:                                 ; @r8_erf
    ; %bb.0:
        add    sp,sp,#-128
        std    #4614300636657501161,[sp,88]    // a[0]
        std    #4645348406721991307,[sp,104]    // a[2]
        std    #4659275911028085274,[sp,112]    // a[3]
        std    #4595861367557309218,[sp,120]    // a[4]
        std    #4599171895595656694,[sp,40]    // p[0]
        std    #4593699784569291823,[sp,56]    // p[2]
        std    #4580293056851789237,[sp,64]    // p[3]
        std    #4559215111867327292,[sp,72]    // p[4]
        std    #4580359811580069319,[sp,80]    // p[4]
        std    #4612966212090462427,[sp]    // q[0]
        std    #4602930165995154489,[sp,16]    // q[2]
        std    #4588882433176075751,[sp,24]    // q[3]
        std    #4567531038595922641,[sp,32]    // q[4]
        fabs    r2,r1
        fcmp    r3,r2,#0x3EF00000        // thresh
        bnlt    r3,.LBB141_6
    ; %bb.1:
        fcmp    r3,r2,#4            // xabs <= 4.0
        bnlt    r3,.LBB141_7
    ; %bb.2:
        fcmp    r3,r2,#0x403A8B020C49BA5E    // xbig
        bngt    r3,.LBB141_11
    ; %bb.3:
        fmul    r3,r1,r1
        fdiv    r3,#1,r3
        mov    r4,#0x3F90B4FB18B485C7        // p[5]
        fmac    r4,r3,r4,#0x3FD38A78B9F065F6    // p[0]
        fadd    r5,r3,#0x40048C54508800DB    // q[0]
        fmac    r6,r3,r4,#0x3FD70FE40E2425B8    // p[1]
        fmac    r4,r3,r5,#0x3FFDF79D6855F0AD    // q[1]
        fmul    r4,r3,r4
        fmul    r6,r3,r6
        mov    r5,#2
        add    r7,sp,#40            // p[*]
        add    r8,sp,#0            // q[*]
    LBB141_4:                              ; %._crit_edge11
                                           ; =>This Inner Loop Header: Depth=1
        vec    r9,{r4,r6}
        ldd    r10,[r7,r5<<3,0]        // p[*]
        ldd    r11,[r8,r5<<3,0]        // q[*]
        fadd    r6,r6,r10
        fadd    r4,r4,r11
        fmul    r4,r3,r4
        fmul    r6,r3,r6
        loop    ne,r5,#4,#1
    ; %bb.5:
        fadd    r5,r6,#0x3F4595FD0D71E33C    // p[4]
        fmul    r3,r3,r5
        fadd    r4,r4,#0x3F632147A014BAD1    // q[4]
        fdiv    r3,r3,r4
        fadd    r3,#0x3FE20DD750429B6D,-r3    // c[0]
        fdiv    r3,r3,r2
        br    .LBB141_10            // common tail
    LBB141_6:                              ; %._crit_edge
        fmul    r3,r1,r1
        fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E    // xsmall
        sra    r2,r2,<1:13>
        cvtsd    r4,#0
        mux    r2,r2,r3,r4
        mov    r3,#0x3FC7C7905A31C322        // a[4]
        fmac    r3,r2,r3,#0x400949FB3ED443E9    // a[0]
        fmac    r3,r2,r3,#0x405C774E4D365DA3    // a[1]
        ldd    r4,[sp,104]            // a[2]
        fmac    r3,r2,r3,r4
        fadd    r4,r2,#0x403799EE342FB2DE    // b[0]
        fmac    r4,r2,r4,#0x406E80C9D57E55B8    // b[1]
        fmac    r4,r2,r4,#0x40940A77529CADC8    // b[2]
        fmac    r3,r2,r3,#0x40A912C1535D121A    // a[3]
        fmul    r1,r3,r1
        fmac    r2,r2,r4,#0x40A63879423B87AD    // b[3]
        fdiv    r2,r1,r2
        mov    r1,r2
        add    sp,sp,#128
        ret                // 68
    LBB141_7:
        fmul    r3,r2,#0x3E571E703C5F5815    // c[8]
        mov    r5,#0
        mov    r4,r2
    LBB141_8:                              ; =>This Inner Loop Header: Depth=1
        vec    r6,{r3,r4}
        ldd    r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
        fadd    r3,r3,r7
        fmul    r3,r2,r3
        ldd    r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
        fadd    r4,r4,r7
        fmul    r4,r2,r4
        loop    ne,r5,#7,#1
    ; %bb.9:
        fadd    r3,r3,#0x4093395B7FD2FC8E    // c[7]
        fadd    r4,r4,#0x4093395B7FD35F61    // d[7]
        fdiv    r3,r3,r4
    LBB141_10:                // common tail
        fmul    r4,r2,#0x41800000        // 16.0
        fmul    r4,r4,#0x3D800000        // 1/16.0
        cvtds    r4,r4                // (signed)double >>     cvtsd    r4,r4                // (double)signed >>     fadd    r5,r2,-r4
        fadd    r2,r2,r4
        fmul    r4,r4,-r4
        fexp    r4,r4                // exp()
        fmul    r2,r2,-r5
        fexp    r2,r2                // exp()
        fmul    r2,r4,r2
        fadd    r2,#0,-r2
        fmac    r2,r2,r3,#0x3F000000        // 0.5
        fadd    r2,r2,#0x3F000000        // 0.5
        pflt    r1,0,T
        fadd    r2,#0,-r2
        mov    r1,r2
        add    sp,sp,#128
        ret
    LBB141_11:
        fcmp    r1,r1,#0
        sra    r1,r1,<1:13>
        cvtsd    r2,#-1                // (double)-1
        cvtsd    r3,#1                // (double)+1
        mux    r2,r1,r3,r2
        mov    r1,r2
        add    sp,sp,#128
        ret
    Lfunc_end141:
        .size    r8_erf, .Lfunc_end141-r8_erf
                                           ; -- End function

    These patterns seem rather unusual...
    Don't really know the ABI.

    Patterns don't really fit observations for typical compiler output
    though (mostly in the FP constants, and particular ones that fall
    outside the scope of what can be exactly represented as Binary16 or
    similar, are rare).

    You are N E V E R going to find the coefficients of a Chebyshev
    polynomial to fit in a small FP container; excepting the very
    occasional C0 or C1 term {which are mostly 1.0 and 0.0}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Sun Apr 14 22:58:22 2024
    Stephen Fuld wrote:
    <snip>

    I think this all works fine for a single compilation unit, as the
    compiler certainly knows the type of the data. But what happens with separate compilations? The called function probably doesn’t know the
    tag value for callee saved registers. Fortunately, the My 66000
    architecture comes to the rescue here. You would modify the Enter and
    Exit instructions to save/restore the tag bits of the registers they are saving or restoring in the same data structure it uses for the registers (yes, it adds 32 bits to that structure – minimal cost). The same mechanism works for interrupts that take control away from a running
    process.

    I had missed this until now:: The stack remains 64-bit aligned at all times,
    so if you add 32-bits to the stack you actually add 64-bits to the stack.

    Given this, you an effectively use a 2-bit tag {integral, floating, pointing, describing}. The difference between pointing and describing is that pointing
    is C-like, while describing is dope-vector-like. {{Although others may find something else to put in the 4-th slot.}}

    Any comments are welcome.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sun Apr 14 23:25:52 2024
    Anton Ertl wrote:

    I have a similar problem for the carry and overflow bits in
    < http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to
    let those bits not survive across calls; if there was a cheap solution
    for the problem, it would eliminate this drawback of my idea.

    My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
    whereas RISC-V encodes the inner loop in 11 instructions.

    Source code:

    void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
    {
    uint64_t c = 0;
    for( int i = 0; i < n; i++ )
    {
    {c, sum[i]} = a[i] + b[i] + c;
    }
    return
    }

    Assembly code::

    .global mpn_add_n
    mpn_add_n:
    MOV R5,#0 // c
    MOV R6,#0 // i

    VEC R7,{}
    LDD R8,[R2,Ri<<3]
    LDD R9,[R3,Ri<<3]
    CARRY R5,{{IO}}
    ADD R10,R8,R9
    STD R10,[R1,Ri<<3]
    LOOP LT,R6,#1,R4
    RET

    So, adding a few "bells and whistles" to RISC-V does give you a
    performance gain (1.38×); using a well designed ISA gives you a
    performance gain of 2.00× !! {{moral: don't stop too early}}

    Note that all the register bookkeeping has disappeared !! because
    of the indexed memory reference form.

    As I count executing instructions, VEC does not execute, nor does
    CARRY--CARRY causes the subsequent ADD to take C input as carry and
    the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
    BC sequence in a single instruction and in a single clock.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Mon Apr 15 10:02:46 2024
    MitchAlsup1 wrote:
    Anton Ertl wrote:

    I have a similar problem for the carry and overflow bits in
    < http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to
    let those bits not survive across calls; if there was a cheap solution
    for the problem, it would eliminate this drawback of my idea.

    My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
    whereas RISC-V encodes the inner loop in 11 instructions.

    Source code:

    void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
    {
        uint64_t c = 0;
        for( int i = 0; i < n; i++ )
        {
             {c, sum[i]} = a[i] + b[i] + c;
        }
        return
    }

    Assembly code::

        .global mpn_add_n
    mpn_add_n:
        MOV   R5,#0     // c
        MOV   R6,#0     // i

        VEC   R7,{}
        LDD   R8,[R2,Ri<<3]
        LDD   R9,[R3,Ri<<3]
        CARRY R5,{{IO}}
        ADD   R10,R8,R9
        STD   R10,[R1,Ri<<3]
        LOOP  LT,R6,#1,R4
        RET

    So, adding a few "bells and whistles" to RISC-V does give you a
    performance gain (1.38×); using a well designed ISA gives you a performance gain of 2.00× !! {{moral: don't stop too early}}

    Note that all the register bookkeeping has disappeared !! because
    of the indexed memory reference form.

    As I count executing instructions, VEC does not execute, nor does CARRY--CARRY causes the subsequent ADD to take C input as carry and
    the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
    BC sequence in a single instruction and in a single clock.

    ; RSI->a[n], RDX->b[n], RDI->sum[n], RCX=-n
    xor rax,rax ;; Clear carry
    next:
    mov rax,[rsi+rcx*8]
    adc rax,[rdx+rcx*8]
    mov [rdi+rcx*8],rax
    inc rcx
    jnz next

    The code above is 5 instructions, or 6 if we avoid the load-op, doing
    two loads and one store, so it should only be limited by the latency of
    the ADC, i.e. one or two cycles.

    In the non-OoO (i.e Pentium) days, I would have inverted the loop in
    order to hide the latencies as much as possible, resulting in an inner
    loop something like this:

    next:
    adc eax,ebx
    mov ebx,[edx+ecx*4] ; First cycle

    mov [edi+ecx*4],eax
    mov eax,[esi+ecx*4] ; Second cycle

    inc ecx
    jnz next ; Third cycle

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Terje Mathisen on Mon Apr 15 11:16:15 2024
    Terje Mathisen wrote:
    MitchAlsup1 wrote:
    Anton Ertl wrote:

    I have a similar problem for the carry and overflow bits in
    < http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to
    let those bits not survive across calls; if there was a cheap solution
    for the problem, it would eliminate this drawback of my idea.

    My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
    whereas RISC-V encodes the inner loop in 11 instructions.

    Source code:

    void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
    {
     Â Â Â  uint64_t c = 0;
     Â Â Â  for( int i = 0; i < n; i++ )
     Â Â Â  {
     Â Â Â Â Â Â Â Â  {c, sum[i]} = a[i] + b[i] + c;
     Â Â Â  }
     Â Â Â  return
    }

    Assembly code::

     Â Â Â  .global mpn_add_n
    mpn_add_n:
     Â Â Â  MOV   R5,#0     // c
     Â Â Â  MOV   R6,#0     // i

     Â Â Â  VEC   R7,{}
     Â Â Â  LDD   R8,[R2,Ri<<3]
     Â Â Â  LDD   R9,[R3,Ri<<3]
     Â Â Â  CARRY R5,{{IO}}
     Â Â Â  ADD   R10,R8,R9
     Â Â Â  STD   R10,[R1,Ri<<3]
     Â Â Â  LOOP  LT,R6,#1,R4
     Â Â Â  RET

    So, adding a few "bells and whistles" to RISC-V does give you a
    performance gain (1.38×); using a well designed ISA gives you a >> performance gain of 2.00× !! {{moral: don't stop too early}}

    Note that all the register bookkeeping has disappeared !! because
    of the indexed memory reference form.

    As I count executing instructions, VEC does not execute, nor does
    CARRY--CARRY causes the subsequent ADD to take C input as carry and
    the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
    BC sequence in a single instruction and in a single clock.

      ; RSI->a[n], RDX->b[n], RDI->sum[n], RCX=-n
      xor rax,rax ;; Clear carry
    next:
      mov rax,[rsi+rcx*8]
      adc rax,[rdx+rcx*8]
      mov [rdi+rcx*8],rax
      inc rcx
       jnz next

    The code above is 5 instructions, or 6 if we avoid the load-op, doing
    two loads and one store, so it should only be limited by the latency of
    the ADC, i.e. one or two cycles.

    In the non-OoO (i.e Pentium) days, I would have inverted the loop in
    order to hide the latencies as much as possible, resulting in an inner
    loop something like this:

     next:
      adc eax,ebx
      mov ebx,[edx+ecx*4]    ; First cycle

      mov [edi+ecx*4],eax
      mov eax,[esi+ecx*4]    ; Second cycle

      inc ecx
       jnz next        ; Third cycle


    In the same bad old days, the standard way to speed it up would have
    used unrolling, but until we got more registers, it would have stopped
    itself very quickly. With AVX2 we could use 4 64-bit slots in a 32-byte register, but then we would have needed to handle the carry propagation manually, and that would take longer than a series of ADC/ADX instructions.

    next4:
    mov eax,[esi]
    adc eax,[esi+edx]
    mov [esi+edi],eax
    mov eax,[esi+4]
    adc eax,[esi+edx+4]
    mov [esi+edi+4],eax
    mov eax,[esi+8]
    adc eax,[esi+edx+8]
    mov [esi+edi+8],eax
    mov eax,[esi+12]
    adc eax,[esi+edx+12]
    mov [esi+edi+12],eax
    lea esi,[esi+16]
    dec ecx
    jnz next4

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Mon Apr 15 19:03:34 2024
    Michael S wrote:

    On Thu, 11 Apr 2024 18:46:54 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    On 4/11/2024 6:13 AM, Michael S wrote:
    On Wed, 10 Apr 2024 23:30:02 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    It does occupy some icache space, however; have you boosted the
    icache size to compensate?

    The space occupied in the ICache is freed up from being in the
    DCache so the overall hit rate goes up !! At typical sizes,
    ICache miss rate is about ¼ the miss rate of DCache.

    Besides:: if you had to LD the constant from memory, you use a LD
    instruction and 1 or 2 words in DCache, while consuming a GPR. So,
    overall, it takes fewer cycles, fewer GPRs, and fewer
    instructions.

    Alternatively:: if you paste constants together (LUI, AUPIC) you
    have no direct route to either 64-bit constants or 64-bit address
    spaces.

    It looks to be a win-win !!

    Win-win under constraints of Load-Store Arch. Otherwise, it
    depends.

    Never seen a LD-OP architecture where the inbound memory can be in
    the Rs1 position of the instruction.


    May be. But out of 6 major integer OPs it matters only for SUB.
    By now I don't remember for sure, but I think that I had seen LD-OP architecture that had SUBR instruction. May be, TI TMS320C30?
    It was 30 years ago and my memory is not what it used to be.

    That a SUBR instruction exists does not disavow my statement that
    the inbound memory reference was never in the Rs1 position.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Mon Apr 15 20:55:53 2024
    Terje Mathisen wrote:

    MitchAlsup1 wrote:


    In the non-OoO (i.e Pentium) days, I would have inverted the loop in
    order to hide the latencies as much as possible, resulting in an inner
    loop something like this:

    next:
    adc eax,ebx
    mov ebx,[edx+ecx*4] ; First cycle

    mov [edi+ecx*4],eax
    mov eax,[esi+ecx*4] ; Second cycle

    inc ecx
    jnz next ; Third cycle

    Terje

    As opposed to::

    .global mpn_add_n
    mpn_add_n:
    MOV R5,#0 // c
    MOV R6,#0 // i

    VEC R7,{}
    LDD R8,[R2,Ri<<3] // Load 128-to-512 bits
    LDD R9,[R3,Ri<<3] // Load 128-to-512 bits
    CARRY R5,{{IO}}
    ADD R10,R8,R9 // Add pair to add octal
    STD R10,[R1,Ri<<3] // Store 128-to-512 bits
    LOOP LT,R6,#1,R4 // increment 2-to-8 times
    RET

    --------------------------------------------------------

    LDD R8,[R2,Ri<<3] // AGEN cycle 1
    LDD R9,[R3,Ri<<3] // AGEN cycle 2 data cycle 4
    CARRY R5,{{IO}}
    ADD R10,R8,R9 // cycle 4
    STD R10,[R1,Ri<<3] // AGEN cycle 3 write cycle 5
    LOOP LT,R6,#1,R4 // cycle 3

    OR

    LDD LDd
    LDD LDd
    ADD
    ST STd
    LOOP
    LDD LDd
    LDD LDd
    ADD
    ST STd
    LOOP

    10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM machine !! without code scheduling heroics.

    40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM machine !!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Tue Apr 16 08:44:26 2024
    MitchAlsup1 wrote:
    Terje Mathisen wrote:

    MitchAlsup1 wrote:


    In the non-OoO (i.e Pentium) days, I would have inverted the loop in
    order to hide the latencies as much as possible, resulting in an inner
    loop something like this:

      next:
       adc eax,ebx
       mov ebx,[edx+ecx*4]    ; First cycle

       mov [edi+ecx*4],eax
       mov eax,[esi+ecx*4]    ; Second cycle

       inc ecx
       jnz next        ; Third cycle

    Terje

    As opposed to::

        .global mpn_add_n
    mpn_add_n:
        MOV   R5,#0     // c
        MOV   R6,#0     // i

        VEC   R7,{}
        LDD   R8,[R2,Ri<<3]       // Load 128-to-512 bits
        LDD   R9,[R3,Ri<<3]       // Load 128-to-512 bits
        CARRY R5,{{IO}}
        ADD   R10,R8,R9           // Add pair to add octal
        STD   R10,[R1,Ri<<3]      // Store 128-to-512 bits
        LOOP  LT,R6,#1,R4         // increment 2-to-8 times
        RET

    --------------------------------------------------------

        LDD   R8,[R2,Ri<<3]       // AGEN cycle 1
        LDD   R9,[R3,Ri<<3]       // AGEN cycle 2 data cycle 4
        CARRY R5,{{IO}}
        ADD   R10,R8,R9           // cycle 4
        STD   R10,[R1,Ri<<3]      // AGEN cycle 3 write cycle 5
        LOOP  LT,R6,#1,R4         // cycle 3

    OR

        LDD       LDd
             LDD       LDd                    ADD
                  ST        STd
                  LOOP
                       LDD       LDd
                            LDD       LDd
    ADD
                                 ST        STd
                                 LOOP

    10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM
    machine !!
    without code scheduling heroics.

    40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM
    machine !!

    It all comes down to the carry propagation, right?

    The way I understood the original code, you are doing a very wide
    unsigned add, so you need a carry to propagate from each and every block
    to the next, right?

    If you can do that at half a clock cycle per 64 bit ADD, then consider
    me very impressed!

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Tue Apr 16 18:14:39 2024
    Terje Mathisen wrote:

    MitchAlsup1 wrote:
    Terje Mathisen wrote:

    MitchAlsup1 wrote:


    In the non-OoO (i.e Pentium) days, I would have inverted the loop in
    order to hide the latencies as much as possible, resulting in an inner
    loop something like this:

      next:
       adc eax,ebx
       mov ebx,[edx+ecx*4]    ; First cycle

       mov [edi+ecx*4],eax
       mov eax,[esi+ecx*4]    ; Second cycle

       inc ecx
       jnz next        ; Third cycle

    Terje

    As opposed to::

        .global mpn_add_n
    mpn_add_n:
        MOV   R5,#0     // c
        MOV   R6,#0     // i

        VEC   R7,{}
        LDD   R8,[R2,Ri<<3]       // Load 128-to-512 bits
        LDD   R9,[R3,Ri<<3]       // Load 128-to-512 bits
        CARRY R5,{{IO}}
        ADD   R10,R8,R9           // Add pair to add octal
        STD   R10,[R1,Ri<<3]      // Store 128-to-512 bits
        LOOP  LT,R6,#1,R4         // increment 2-to-8 times
        RET

    --------------------------------------------------------

        LDD   R8,[R2,Ri<<3]       // AGEN cycle 1
        LDD   R9,[R3,Ri<<3]       // AGEN cycle 2 data cycle 4
        CARRY R5,{{IO}}
        ADD   R10,R8,R9           // cycle 4
        STD   R10,[R1,Ri<<3]      // AGEN cycle 3 write cycle 5
        LOOP  LT,R6,#1,R4         // cycle 3

    OR

        LDD       LDd
             LDD       LDd                    ADD
                  ST        STd
                  LOOP
                       LDD       LDd
                            LDD       LDd
    ADD
                                 ST        STd
                                 LOOP

    10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM
    machine !!
    without code scheduling heroics.

    40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM
    machine !!

    It all comes down to the carry propagation, right?

    The way I understood the original code, you are doing a very wide
    unsigned add, so you need a carry to propagate from each and every block
    to the next, right?

    Most ST pipelines have an align stage to align the data to be stored to where it needs to be stored, one can extend the carry into this stage if needed,
    and capture the a+b and a+b+1 and use carry in to select one or the other.

    If you can do that at half a clock cycle per 64 bit ADD, then consider
    me very impressed!

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to EricP on Tue Apr 16 15:06:48 2024
    On 4/3/2024 11:44 AM, EricP wrote:
    Stephen Fuld wrote:
    There has been discussion here about the benefits of reducing the
    number of op codes.  One reason not mentioned before is if you have
    fixed length instructions, you may want to leave as many codes as
    possible available for future use.  Of course, if you are doing a
    16-bit instruction design, where instruction bits are especially
    tight, you may save enough op-codes to save a bit, perhaps allowing a
    larger register specifier field, or to allow more instructions in the
    smaller subset.

    It is in this spirit that I had an idea, partially inspired by Mill’s
    use of tags in registers, but not memory.  I worked through this idea
    using the My 6600 as an example “substrate” for two reasons.  First,
    it has several features that are “friendly” to the idea.  Second, I
    know Mitch cares about keeping the number of op codes low.

    Please bear in mind that this is just the germ of an idea.  It is
    certainly not fully worked out.  I present it here to stimulate
    discussions, and because it has been fun to think about.

    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag.
    If set, the bit indicates that the corresponding register contains a
    floating-point value.  Clear indicates not floating point (integer,
    address, etc.).  There would be two additional instructions, load
    single floating and load double floating, which work the same as the
    other 32- and 64-bit loads, but in addition to loading the value, set
    the tag bit for the destination register.  Non-floating-point loads
    would clear the tag bit.  As I show below, I don’t think you need any
    special "store tag" instructions.

    If you are adding a float/int data type flag you might as well
    also add operand size for floats at least, though some ISA's
    have both int32 and int64 ALU operations for result compatibility.

    Not needed for My 66000, as all floating point loads convert the loaded
    value to double precision.

    big snip

    Currently the opcode data type can tell the uArch how to route
    the  operands internally without knowing the data values.
    For example, FPU reservation stations monitor float operands
    and schedule for just the FPU FADD or FMUL units.

    Dynamic data typing would change that to be data dependent routing.
    It means, for example, you can't begin to schedule a uOp
    until you know all its operand types and opcode.

    Seems right.



    Looks like it makes such distributed decisions impossible.
    Probably everything winds up in a big pile of logic in the center,
    which might be problematic for those things whose complexity grows N^2.
    Not sure how significant that is.

    Could be. Again, IANAHG.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Thomas Koenig on Tue Apr 16 15:08:58 2024
    On 4/3/2024 1:02 PM, Thomas Koenig wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    [saving opcodes]


    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag. If
    set, the bit indicates that the corresponding register contains a
    floating-point value. Clear indicates not floating point (integer,
    address, etc.).

    I don't think this would save a lot of opcode space, which
    is the important thing.

    A typical RISC design has a six-bit major opcode.
    Having three registers takes away fifteen bits, leaving
    eleven, which is far more than anybody would ever want as
    minor opdoce for arithmetic instructions. Compare with https://en.wikipedia.org/wiki/DEC_Alpha#Instruction_formats
    where DEC actually left out three bits because they did not
    need them.

    I think that is probably true for 32 bit instructions, but what about 16
    bit?

    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Anton Ertl on Tue Apr 16 15:02:13 2024
    On 4/3/2024 10:24 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag. If
    set, the bit indicates that the corresponding register contains a
    floating-point value. Clear indicates not floating point (integer,
    address, etc.). There would be two additional instructions, load single
    floating and load double floating, which work the same as the other 32-
    and 64-bit loads, but in addition to loading the value, set the tag bit
    for the destination register. Non-floating-point loads would clear the
    tag bit. As I show below, I don’t think you need any special "store
    tag" instructions.
    ...
    But we can go further. There are some opcodes that only make sense for
    FP operands, e.g. the transcendental instructions. And there are some
    operations that probably only make sense for non-FP operands, e.g. POP,
    FF1, probably shifts. Given the tag bit, these could share the same
    op-code. There may be several more of these.

    Certainly makes reading disassembler output fun (or writing the disassembler).

    Good point. It probably isn't too bad for the arithmetic operations,
    etc, but once you extend it as I suggested in the last paragraph it gets
    ugly. :-(


    big snip

    That is as far as I got. I think you could net save perhaps 8-12 op
    codes, which is about 10% of the existing op codes - not bad. Is it
    worth it? To me, a major question is the effect on performance. What
    is the cost of having to decode the source registers and reading their
    respective tag bits before knowing which FU to use?

    In in OoO CPU, that's pretty heavy.

    OK, but in the vast majority of cases (i.e. unless there is something
    like a conditional branch that uses floating point or integer depending
    upon whether the branch is taken.) the flag bit that a register will
    have can be known well in advance. As I said, IANAHG, but that might
    make it easier.



    But actually, your idea does not need any computation results for
    determining the tag bits of registers (except during EXIT),

    But even here, you almost certainly know what the tag bit for any given register is long before you execute the EXIT instruction. And remember,
    on MY 66000 EXIT is performed lazily, so you have time and the mechanism
    is in place to wait if needed.


    so you
    probably can handle the tags in the front end (decoder and renamer).
    Then the tags are really separate and not part of the rgisters that
    have to be renamed, and you don't need to perform any waiting on
    ENTER.

    However, in EXIT the front end would have to wait for the result of
    the load/store unit loading the 32 bits, unless you add a special
    mechanism for that. So EXIT would become expensive, one way or the
    other.

    Yes.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Tue Apr 16 16:44:27 2024
    On 4/3/2024 2:30 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:

    On 4/3/2024 11:43 AM, Stephen Fuld wrote:
    There has been discussion here about the benefits of reducing the
    number of op codes.  One reason not mentioned before is if you have
    fixed length instructions, you may want to leave as many codes as
    possible available for future use.  Of course, if you are doing a
    16-bit instruction design, where instruction bits are especially
    tight, you may save enough op-codes to save a bit, perhaps allowing a
    larger register specifier field, or to allow more instructions in the
    smaller subset.

    It is in this spirit that I had an idea, partially inspired by Mill’s
    use of tags in registers, but not memory.  I worked through this idea
    using the My 6600 as an example “substrate” for two reasons.  First, it
                   66000
    Sorry. Typo.



    has several features that are “friendly” to the idea.  Second, I know >>> Mitch cares about keeping the number of op codes low.

    Please bear in mind that this is just the germ of an idea.  It is
    certainly not fully worked out.  I present it here to stimulate
    discussions, and because it has been fun to think about.

    The idea is to add 32 bits to the processor state, one per register
    (though probably not physically part of the register file) as a tag.
    If set, the bit indicates that the corresponding register contains a
    floating-point value.  Clear indicates not floating point (integer,
    address, etc.).  There would be two additional instructions, load
    single floating and load double floating, which work the same as the
    other 32- and 64-bit loads, but in addition to loading the value, set
    the tag bit for the destination register.  Non-floating-point loads
    would clear the tag bit.  As I show below, I don’t think you need any >>> special "store tag" instructions.

    What do you do when you want a FP bit pattern interpreted as an integer,
    or vice versa.

    As I said below, if you need that, you can use an otherwise :"useless" instruction, such as ORing a register with itself the modify the tag bits.





    When executing arithmetic instructions, if the tag bits of both
    sources of an instruction are the same, do the appropriate operation
    (floating or integer), and set the tag bit of the result register
    appropriately.
    If the tag bits of the two sources are different, I see several
    possibilities.

    1.    Generate an exception.
    2.    Use the sense of source 1 for the arithmetic operation, but
    perform the appropriate conversion on the second operand first,
    potentially saving an instruction

    Conversions to/from FP often require a rounding mode. How do you specify that?

    Good point.




    3.    Always do the operation in floating point and convert the
    integer operand prior to the operation.  (Or, if you prefer, change
    floating point to integer in the above description.)
    4.    Same as 2 or 3 above, but don’t do the conversions.

    I suspect this is the least useful choice.  I am not sure which is
    the best option.

    Given that, use the same op code for the floating-point and fixed
    versions of the same operations.  So we can save eight op codes, the
    four arithmetic operations, max, min, abs and compare.  So far, a net
    savings of six opcodes.

    But we can go further.  There are some opcodes that only make sense
    for FP operands, e.g. the transcendental instructions.  And there are
    some operations that probably only make sense for non-FP operands,
    e.g. POP, FF1, probably shifts.  Given the tag bit, these could share
    the same op-code.  There may be several more of these.

    Hands waving:: "Danger Will Robinson, Danger" more waving of hands.

    Agreed.


    I think this all works fine for a single compilation unit, as the
    compiler certainly knows the type of the data.  But what happens with
    separate compilations?  The called function probably doesn’t know the

    The compiler will certainly have a function prototype. In any event, if FP and Integers share a register file the lack of prototype is much less
    stress-
    full to the compiler/linking system.

    tag value for callee saved registers.  Fortunately, the My 66000
    architecture comes to the rescue here.  You would modify the Enter
    and Exit instructions to save/restore the tag bits of the registers
    they are saving or restoring in the same data structure it uses for
    the registers (yes, it adds 32 bits to that structure – minimal
    cost).  The same mechanism works for interrupts that take control
    away from a running process.

    Yes, but we do just fine without the tag and without the stuff mentioned above. Neither ENTER nor EXIT care about the 64-bit pattern in the
    register.

    I think you need it for callee saved registers to insure the tag is set correctly for the calling program upon return to it.



    I don’t think you need to set or clear the tag bits without doing
    anything else, but if you do, I think you could “repurpose” some
    other instructions to do this, without requiring another op-code.
    For example, Oring a register with itself could be used to set the
    tag bit and Oring a register with zero could clear it.  These should
    be pretty rare.

    That is as far as I got.  I think you could net save perhaps 8-12 op
    codes, which is about 10% of the existing op codes - not bad.  Is it
    worth it?

    No.

               To me, a major question is the effect on performance.  What
    is the cost of having to decode the source registers and reading
    their respective tag bits before knowing which FU to use?

    The problem is you have put decode dependent on dynamic pipeline
    information.
    I suggest you don't want to do that. Consider a change from int to FP instruction
    as a predicated instruction, so the pipeline cannot DECODE the
    instruction at
    hand until the predicate resolves. Yech.

    Good point.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Wed Apr 17 01:11:12 2024
    Stephen Fuld wrote:

    On 4/3/2024 11:44 AM, EricP wrote:


    If you are adding a float/int data type flag you might as well
    also add operand size for floats at least, though some ISA's
    have both int32 and int64 ALU operations for result compatibility.

    Not needed for My 66000, as all floating point loads convert the loaded
    value to double precision.

    Insufficient verbal precision::

    My 66000 only cares about the size of a value being loaded from memory
    (or ST into memory).

    While (float) LDs load the 32-bit value from memory, they remain (float)
    while residing in the register; and the High Order 32-bits are ignored.
    The (float) register can be consumed by a (float) FP calculation and it
    remains (float) after processing.

    Small immediates, when consumed by FP instructions, are converted from
    integer to <sized> FP during DECODE. So::

    FADD R7,R7,#1

    adds 1.0D0 to the (double) value in R7 (and takes one 32-bit instruction), while:

    FADDs R7,R7,#1

    Adds 1.0E0 to the (float) value in R7.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to tkoenig@netcologne.de on Wed Apr 17 15:06:05 2024
    On Mon, 8 Apr 2024 17:25:38 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:

    John Savard <quadibloc@servername.invalid> schrieb:

    Well, when the computer fetches a 256-bit block of code, the first
    four bits indicates whether it is composed of 36-bit instructions or
    28-bit instructions.

    Do you think that instructions which require a certain size (almost)
    always happen to be situated together so they fit in a block?

    Well, floating-point and integer instructions of one size each can be arbitrarily mixed. And when different sizes need to mix, going to
    36-bit instructions is low overhead.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Wed Apr 17 15:07:18 2024
    On Mon, 8 Apr 2024 19:56:27 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    So, instead of using the branch target address, one rounds it down to
    a 256-bit boundary, reads 256-bits and looks at the first 4-bits to
    determine the format, nd then uses the branch offset to pick a cont-
    tainer which will become the first instruction executed.

    Sounds more complicated than necessary.

    Yes, I don't disagree. I'm just pointing out that it's possible to
    make the mini tags idea work that way, since it lets you easily turn
    mini tags off when you need to.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)