Forum: >>> Magnum BBS <<<

Nitty-gritty aspects of register allocation

From Elijah Stone@21:1/5 to All on Thu Sep 10 16:41:26 2020

There is a lot of research and a lot of resources on the high-level
aspects of register allocation--colouring interference graphs, lifetimes,
CFGs, etc.

But there are a lot of low-level architecture-specific things that these
models don't account for on their own. Most notably, not all registers
can be used for the same things. Just on amd64:

- Multiplication and division use the rdx:rax register pair.

- Bit shifts always use cl.

- When calling functions, their arguments have to go in specific registers.

In all of these cases, you can spill whatever happens to already be in
those registers, but it would be nicer if you could arrange for the
appropriate values to already be in the right places. But how? A naive solution to the first two problems just makes rax/rcx/rdx the
lowest-priority registers except when doing a multiply/divide/shift; and
spills only if necessary (rare). But that's not a general solution
(imagine if every register has a few pieces of unique functionality), and function calls reserve too many registers for that strategy to be
practical in that case.

Bonus microconsiderations (these seem much easier to model, but still not trivial):

- Some registers need a special prefix to be used (REX prefix). These
registers are generally different from the special-purpose registers
(for e.g. multiplication). Is it better to put a non-multiplied value
in a REX-prefixed register, or keep it in an unprefixed register and
spill it later when you need to multiply?

- The second-lowest 8 bits of some registers can be addressed
separately. When does it make sense to use them?

(All of these are x86-specific, and most architectures admittedly have
fewer esotericisms.)

--
time flies like an arrow;
fruit flies like a banana

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Alexei A. Frounze@21:1/5 to Elijah Stone on Thu Sep 10 18:01:30 2020

On Thursday, September 10, 2020 at 5:14:12 PM UTC-7, Elijah Stone wrote:

There is a lot of research and a lot of resources on the high-level
aspects of register allocation--colouring interference graphs, lifetimes, CFGs, etc.

But there are a lot of low-level architecture-specific things that these models don't account for on their own. Most notably, not all registers
can be used for the same things. Just on amd64:

- Multiplication and division use the rdx:rax register pair.

There are 2- and 3-operand variants of the imul instruction,
which are not restricted to using *DX:*AX.
You can use imul instead of mul if you don't need to get the
product that is twice as wide as any of the multiplicands.

- Bit shifts always use cl.

Not always. When you need to shift by a known constant
amount, you don't need to place the amount in CL.

- When calling functions, their arguments have to go in specific registers.

True for a few of the first arguments. The rest end up on the stack.

In all of these cases, you can spill whatever happens to already be in
those registers, but it would be nicer if you could arrange for the appropriate values to already be in the right places.

Certainly. I've given some thought to this problem in the
context of the i80386 and earlier, where there are more
restrictions, and here's what I think...

- If you can transform a multiplication or a division into
a shift by a constant (or shift + add) or the LEA
instruction, do so.

- Use 2- or 3-operand imul instead of mul whenever possible,
so you don't need to involve the fixed *DX:*AX.

- Widening multiplication should be rare. It should be OK
to spill *DX:*AX.

- Divisions are costly and should be rare. It should be OK
to spill *DX:*AX.

- If you need a shift by an unknown amount, spill *CX if
needed. Or see if you can reorder what you're doing, so
you can avoid spilling *CX.

The rest of the regular instructions (ADD, ADDC, SUB, SBB,
INC, DEC, CMP, TEST, NEG, OR, AND, XOR, NOT, MOV,
SETcc, Jcc, JMP, CALL, RET) are not tied to specific registers
and you can choose any registers for them.

But how? A naive
solution to the first two problems just makes rax/rcx/rdx the
lowest-priority registers except when doing a multiply/divide/shift; and spills only if necessary (rare). But that's not a general solution
(imagine if every register has a few pieces of unique functionality), and function calls reserve too many registers for that strategy to be
practical in that case.

I don't have a simple, optimal and general solution.
But unless it's somehow critical, I would not try to
solve this problem optimally.

Bonus microconsiderations (these seem much easier to model, but still not trivial):

- Some registers need a special prefix to be used (REX prefix). These
registers are generally different from the special-purpose registers
(for e.g. multiplication). Is it better to put a non-multiplied value
in a REX-prefixed register, or keep it in an unprefixed register and
spill it later when you need to multiply?

There may be a very small decoding penalty for REX prefixes
in instructions that use R8-R15. There will be a normal
code-size penalty with regards to using longer instructions.
I would not bother about this too much, just prefer "R0"-"R7"
to R8-R15 when possible.

- The second-lowest 8 bits of some registers can be addressed
separately. When does it make sense to use them?

AFAIR, on modern CPUs there are penalties in using subregisters.
See e.g. Agner Fog's optimization manuals for details.

Alex

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Elijah Stone on Fri Sep 11 10:35:51 2020

Elijah Stone <elronnd@elronnd.net> writes:

There is a lot of research and a lot of resources on the high-level
aspects of register allocation--colouring interference graphs, lifetimes, >CFGs, etc.

But there are a lot of low-level architecture-specific things that these >models don't account for on their own. Most notably, not all registers
can be used for the same things. Just on amd64:

- Multiplication and division use the rdx:rax register pair.

- Bit shifts always use cl.

- When calling functions, their arguments have to go in specific registers.

In all of these cases, you can spill whatever happens to already be in
those registers, but it would be nicer if you could arrange for the >appropriate values to already be in the right places. But how?

A classical approach (at least for graph colouring) is to have a
register coalescing phase before the actual register allocation.
Register coalescing tries to eliminate move instructions by combining
the two live ranges connected by the move into one live range (i.e.,
the same register is used for both). If you have that, you can insert
move instructions before doing the coalescing and register allocation,
and coalescing will eliminate them if possible, and will keep them if
not.

So, for the problem with fixed registers you create live ranges for
the fixed registers, and insert a move from another live range A to
the fixed live range X just before the value in A is needed in the
fixed register, and a move from the fixed live range Y to another live
range B just after a value is put into a fixed register (by an
instruction or by the calling convention). Coalescing will eliminate
as many of these moves as it can; if it can eliminate both of the
moves in the examples, A is in the register X and B is in the register
Y, with no moves necessary.

But that's not a general solution
(imagine if every register has a few pieces of unique functionality)

The technique above is a workaround for relatively rare cases; it
probably does not work so well for an instruction set like the 8080,
where every register has a special purpose. That's why we have seen architectures with general-purpose registers ("register machines")
whenever it was expected that most of the code is generated by
compilers rather than assembly programmers. That includes IA-32 and
AMD64, where the instructions with implicit registers (like XLAT or
LODS) were superseded by faster implementations of general
instructions (e.g., on the 486 LODS takes 5 cycles, while the general replacement takes 2 cycles and has no register requirements).

- Some registers need a special prefix to be used (REX prefix). These
registers are generally different from the special-purpose registers
(for e.g. multiplication). Is it better to put a non-multiplied value
in a REX-prefixed register, or keep it in an unprefixed register and
spill it later when you need to multiply?

Coalescing should answer that.

One other aspect of REX is that it is necessary to get 64-bit width
for many instructions. So if you have a 64-bit data type, you can
just as well put it into R8-R15, because you often need a REX anyway.
I would do this with preferencing (i.e., in the register allocation
phase, when you have to choose a register and one in the preferred
class is available, choose it).

- The second-lowest 8 bits of some registers can be addressed
separately. When does it make sense to use them?

Probably never. The 16 other 8-bit registers should be enough; and
even if you want to use them, I recommend checking if using AH BH CH
DH is not a performance pitfall.

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bo Persson@21:1/5 to Alexei A. Frounze on Fri Sep 11 11:44:08 2020

On 2020-09-11 at 03:01, Alexei A. Frounze wrote:

On Thursday, September 10, 2020 at 5:14:12 PM UTC-7, Elijah Stone wrote:

- The second-lowest 8 bits of some registers can be addressed
separately. When does it make sense to use them?

AFAIR, on modern CPUs there are penalties in using subregisters.
See e.g. Agner Fog's optimization manuals for details.

A long time ago, when translating 8085 code to 8086 - and mapping a pair
of 8-bit registers onto a 16-bit register - it was an advantage to have separate access to the two halves.

Nowadays you just don't do that, so when more registers were added this
feature was not propagated to those. That's why these instructions are
only available for some registers.

Bo Persson

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Briels
  Tue Apr 23 20:54:03 2024
  from Uk via SSH
- Cronus
  Tue Apr 23 19:46:51 2024
  from Provo, Ut via SSH
- Keyop
  Tue Apr 23 19:40:37 2024
  from Huddersfield, West Yorkshire via SSH
- Guest
  Wed Apr 24 01:40:10 2024
  from A via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	296
Nodes:	16 (2 / 14)
Uptime:	26:41:44
Calls:	6,646
Calls today:	1
Files:	12,193
Messages:	5,327,955

Nitty-gritty aspects of register allocation

Who's Online

Recent Visitors

System Info