• Nitty-gritty aspects of register allocation

    From Elijah Stone@21:1/5 to All on Thu Sep 10 16:41:26 2020
    There is a lot of research and a lot of resources on the high-level
    aspects of register allocation--colouring interference graphs, lifetimes,
    CFGs, etc.

    But there are a lot of low-level architecture-specific things that these
    models don't account for on their own. Most notably, not all registers
    can be used for the same things. Just on amd64:

    - Multiplication and division use the rdx:rax register pair.

    - Bit shifts always use cl.

    - When calling functions, their arguments have to go in specific registers.

    In all of these cases, you can spill whatever happens to already be in
    those registers, but it would be nicer if you could arrange for the
    appropriate values to already be in the right places. But how? A naive solution to the first two problems just makes rax/rcx/rdx the
    lowest-priority registers except when doing a multiply/divide/shift; and
    spills only if necessary (rare). But that's not a general solution
    (imagine if every register has a few pieces of unique functionality), and function calls reserve too many registers for that strategy to be
    practical in that case.

    Bonus microconsiderations (these seem much easier to model, but still not trivial):

    - Some registers need a special prefix to be used (REX prefix). These
    registers are generally different from the special-purpose registers
    (for e.g. multiplication). Is it better to put a non-multiplied value
    in a REX-prefixed register, or keep it in an unprefixed register and
    spill it later when you need to multiply?

    - The second-lowest 8 bits of some registers can be addressed
    separately. When does it make sense to use them?

    (All of these are x86-specific, and most architectures admittedly have
    fewer esotericisms.)

    --
    time flies like an arrow;
    fruit flies like a banana

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alexei A. Frounze@21:1/5 to Elijah Stone on Thu Sep 10 18:01:30 2020
    On Thursday, September 10, 2020 at 5:14:12 PM UTC-7, Elijah Stone wrote:
    There is a lot of research and a lot of resources on the high-level
    aspects of register allocation--colouring interference graphs, lifetimes, CFGs, etc.

    But there are a lot of low-level architecture-specific things that these models don't account for on their own. Most notably, not all registers
    can be used for the same things. Just on amd64:

    - Multiplication and division use the rdx:rax register pair.

    There are 2- and 3-operand variants of the imul instruction,
    which are not restricted to using *DX:*AX.
    You can use imul instead of mul if you don't need to get the
    product that is twice as wide as any of the multiplicands.

    - Bit shifts always use cl.

    Not always. When you need to shift by a known constant
    amount, you don't need to place the amount in CL.

    - When calling functions, their arguments have to go in specific registers.

    True for a few of the first arguments. The rest end up on the stack.

    In all of these cases, you can spill whatever happens to already be in
    those registers, but it would be nicer if you could arrange for the appropriate values to already be in the right places.

    Certainly. I've given some thought to this problem in the
    context of the i80386 and earlier, where there are more
    restrictions, and here's what I think...

    - If you can transform a multiplication or a division into
    a shift by a constant (or shift + add) or the LEA
    instruction, do so.

    - Use 2- or 3-operand imul instead of mul whenever possible,
    so you don't need to involve the fixed *DX:*AX.

    - Widening multiplication should be rare. It should be OK
    to spill *DX:*AX.

    - Divisions are costly and should be rare. It should be OK
    to spill *DX:*AX.

    - If you need a shift by an unknown amount, spill *CX if
    needed. Or see if you can reorder what you're doing, so
    you can avoid spilling *CX.

    The rest of the regular instructions (ADD, ADDC, SUB, SBB,
    INC, DEC, CMP, TEST, NEG, OR, AND, XOR, NOT, MOV,
    SETcc, Jcc, JMP, CALL, RET) are not tied to specific registers
    and you can choose any registers for them.

    But how? A naive
    solution to the first two problems just makes rax/rcx/rdx the
    lowest-priority registers except when doing a multiply/divide/shift; and spills only if necessary (rare). But that's not a general solution
    (imagine if every register has a few pieces of unique functionality), and function calls reserve too many registers for that strategy to be
    practical in that case.

    I don't have a simple, optimal and general solution.
    But unless it's somehow critical, I would not try to
    solve this problem optimally.

    Bonus microconsiderations (these seem much easier to model, but still not trivial):

    - Some registers need a special prefix to be used (REX prefix). These
    registers are generally different from the special-purpose registers
    (for e.g. multiplication). Is it better to put a non-multiplied value
    in a REX-prefixed register, or keep it in an unprefixed register and
    spill it later when you need to multiply?

    There may be a very small decoding penalty for REX prefixes
    in instructions that use R8-R15. There will be a normal
    code-size penalty with regards to using longer instructions.
    I would not bother about this too much, just prefer "R0"-"R7"
    to R8-R15 when possible.

    - The second-lowest 8 bits of some registers can be addressed
    separately. When does it make sense to use them?

    AFAIR, on modern CPUs there are penalties in using subregisters.
    See e.g. Agner Fog's optimization manuals for details.

    Alex

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Elijah Stone on Fri Sep 11 10:35:51 2020
    Elijah Stone <elronnd@elronnd.net> writes:
    There is a lot of research and a lot of resources on the high-level
    aspects of register allocation--colouring interference graphs, lifetimes, >CFGs, etc.

    But there are a lot of low-level architecture-specific things that these >models don't account for on their own. Most notably, not all registers
    can be used for the same things. Just on amd64:

    - Multiplication and division use the rdx:rax register pair.

    - Bit shifts always use cl.

    - When calling functions, their arguments have to go in specific registers.

    In all of these cases, you can spill whatever happens to already be in
    those registers, but it would be nicer if you could arrange for the >appropriate values to already be in the right places. But how?

    A classical approach (at least for graph colouring) is to have a
    register coalescing phase before the actual register allocation.
    Register coalescing tries to eliminate move instructions by combining
    the two live ranges connected by the move into one live range (i.e.,
    the same register is used for both). If you have that, you can insert
    move instructions before doing the coalescing and register allocation,
    and coalescing will eliminate them if possible, and will keep them if
    not.

    So, for the problem with fixed registers you create live ranges for
    the fixed registers, and insert a move from another live range A to
    the fixed live range X just before the value in A is needed in the
    fixed register, and a move from the fixed live range Y to another live
    range B just after a value is put into a fixed register (by an
    instruction or by the calling convention). Coalescing will eliminate
    as many of these moves as it can; if it can eliminate both of the
    moves in the examples, A is in the register X and B is in the register
    Y, with no moves necessary.

    But that's not a general solution
    (imagine if every register has a few pieces of unique functionality)

    The technique above is a workaround for relatively rare cases; it
    probably does not work so well for an instruction set like the 8080,
    where every register has a special purpose. That's why we have seen architectures with general-purpose registers ("register machines")
    whenever it was expected that most of the code is generated by
    compilers rather than assembly programmers. That includes IA-32 and
    AMD64, where the instructions with implicit registers (like XLAT or
    LODS) were superseded by faster implementations of general
    instructions (e.g., on the 486 LODS takes 5 cycles, while the general replacement takes 2 cycles and has no register requirements).

    - Some registers need a special prefix to be used (REX prefix). These
    registers are generally different from the special-purpose registers
    (for e.g. multiplication). Is it better to put a non-multiplied value
    in a REX-prefixed register, or keep it in an unprefixed register and
    spill it later when you need to multiply?

    Coalescing should answer that.

    One other aspect of REX is that it is necessary to get 64-bit width
    for many instructions. So if you have a 64-bit data type, you can
    just as well put it into R8-R15, because you often need a REX anyway.
    I would do this with preferencing (i.e., in the register allocation
    phase, when you have to choose a register and one in the preferred
    class is available, choose it).

    - The second-lowest 8 bits of some registers can be addressed
    separately. When does it make sense to use them?

    Probably never. The 16 other 8-bit registers should be enough; and
    even if you want to use them, I recommend checking if using AH BH CH
    DH is not a performance pitfall.

    - anton
    --
    M. Anton Ertl
    anton@mips.complang.tuwien.ac.at
    http://www.complang.tuwien.ac.at/anton/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bo Persson@21:1/5 to Alexei A. Frounze on Fri Sep 11 11:44:08 2020
    On 2020-09-11 at 03:01, Alexei A. Frounze wrote:
    On Thursday, September 10, 2020 at 5:14:12 PM UTC-7, Elijah Stone wrote:

    - The second-lowest 8 bits of some registers can be addressed
    separately. When does it make sense to use them?

    AFAIR, on modern CPUs there are penalties in using subregisters.
    See e.g. Agner Fog's optimization manuals for details.


    A long time ago, when translating 8085 code to 8086 - and mapping a pair
    of 8-bit registers onto a 16-bit register - it was an advantage to have separate access to the two halves.

    Nowadays you just don't do that, so when more registers were added this
    feature was not propagated to those. That's why these instructions are
    only available for some registers.


    Bo Persson

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)