Forum: >>> Magnum BBS <<<

Re: Modern compilers for ye olde architectures

From David Brown@21:1/5 to Luke A. Guest on Tue Oct 5 19:59:50 2021

On 05/10/2021 14:22, Luke A. Guest wrote:

I have a Z80 project in mind and would like to build a compiler for a
Z80. I was wondering if modern backend techniques can be applied successfully for these old CPU's, i.e. SSA.

I know GCC has backends for some older architectures, but these do weird gymnastics such as implementing a virtual cpu in rtl and then lowering further.

Certainly modern techniques /can/ be applied to a Z80 compiler. And the
Z80 is a lot more powerful than many 8-bit CISC microcontrollers.

As far as I know, the only 8-bit target for gcc (in the mainline at
least) is the AVR. This is a RISC processor, with 32 8-bit registers,
and a much more orthogonal architecture. gcc does do some "weird
gymnastics", however - it allocates registers in pairs so that it can
pretend to be 16-bit, and then code gets simplified in peephole passes.

gcc also supports some 16-bit CISC devices.

For the Z80, however, there are a number of compiler options. There are
some commercial toolchains, such as HiSoft (IIRC), and there is the
"Rabbit" microcontroller which is a derivative of the Z80. It is
supported by a compiler for a somewhat extended version of C. Such
compilers are probably quite old, however, and might not be easily
available.

Your best bet is probably SDCC. This is a multi-target open source
compiler that is in regular development, and aimed specifically for
small CISC microcontroller cores. The Z80 is one of its targets.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luke A. Guest@21:1/5 to All on Tue Oct 5 13:22:47 2021

Hi,

I have a Z80 project in mind and would like to build a compiler for a
Z80. I was wondering if modern backend techniques can be applied
successfully for these old CPU's, i.e. SSA.

I know GCC has backends for some older architectures, but these do weird gymnastics such as implementing a virtual cpu in rtl and then lowering
further.

Thanks,
Luke.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans-Peter Diettrich@21:1/5 to Luke A. Guest on Tue Oct 5 22:12:46 2021

On 10/5/21 2:22 PM, Luke A. Guest wrote:

I have a Z80 project in mind and would like to build a compiler for a
Z80. I was wondering if modern backend techniques can be applied successfully for these old CPU's, i.e. SSA.

I know GCC has backends for some older architectures, but these do weird gymnastics such as implementing a virtual cpu in rtl and then lowering further.

I wonder what's the purpose of such compilers. Let a Z80 emulate some
supported CPU and be happy :-)

DoDi
[Maybe it's retrocomputing, maybe it's some device with an embedded Z80.
If that's all the computing you need, why pay for more? -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Derek Jones@21:1/5 to All on Wed Oct 6 01:26:55 2021

Luke,

Z80. I was wondering if modern backend techniques can be applied successfully for these old CPU's, i.e. SSA.

Modern, as in invented in 1988.

Modern, as in post-1985 techniques require lots of memory.
So they are not of any use if you plan to host your compiler
on a Z80, otherwise your next problem is mapping techniques
that are designed to work well with orthogonal architectures
(which the Z80 is certainly not).
[I believe the plan is to cross-compile with Z80 as the target. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Luke A. Guest on Wed Oct 6 07:56:59 2021

"Luke A. Guest" <laguest@archeia.com> writes:

I have a Z80 project in mind and would like to build a compiler for a
Z80. I was wondering if modern backend techniques can be applied >successfully for these old CPU's, i.e. SSA.

SSA can certainly be used in a compiler for the Z80, but SSA is not a
back-end technique; it's a way to represent data flow in the
intermediate representation.

As for the back-end, it seems to me that the major problem with the
Z80 is that it does not have general-purpose registers; instead, many instructions deal with specific registers. Many early architectures
were like that, and assembly programmers could puzzle out good
register assignments, but compilers were not particularly good at it.
So eventually computer architects introduced machines with
general-purpose registers like the PDP-11, the VAX, and the RISCs; and
compiler writers developed techniques like graph colouring to make
good use of these architectures.

Maybe with the increased memory and processing power available now,
one could do better, but given that special-purpose registers are
mostly a thing of the past, there has not been much research into
that, that I am aware of. Maybe the puzzle-solving approach to
register allocation can help, but I am not familiar enough with that
to say for sure.

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luke A. Guest@21:1/5 to David Brown on Wed Oct 6 09:00:04 2021

On 05/10/2021 18:59, David Brown wrote:

For the Z80, however, there are a number of compiler options. ...

I'm not talking about using an existing compiler.

Your best bet is probably SDCC. This is a multi-target open source
compiler that is in regular development, and aimed specifically for
small CISC microcontroller cores. The Z80 is one of its targets.

I know of SDCC. I'm not talking about using C.

On 06/10/2021 01:26, Derek Jones wrote:

Modern, as in post-1985 techniques require lots of memory.
So they are not of any use if you plan to host your compiler
on a Z80, otherwise your next problem is mapping techniques
that are designed to work well with orthogonal architectures
(which the Z80 is certainly not).
[I believe the plan is to cross-compile with Z80 as the target. -John]

I'm not talking about running a compiler on a Z80, just targetting one. [Perhaps you could give us more hints about what you're doing so we can
offer more usefule answers. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Theo@21:1/5 to Luke A. Guest on Wed Oct 6 10:36:29 2021

Luke A. Guest <laguest@archeia.com> wrote:

I have a Z80 project in mind and would like to build a compiler for a
Z80. I was wondering if modern backend techniques can be applied successfully for these old CPU's, i.e. SSA.

I know GCC has backends for some older architectures, but these do weird gymnastics such as implementing a virtual cpu in rtl and then lowering further.

There is, it seems, an LLVM backend for Z80: https://github.com/jacobly0/llvm-project
(see the 'z80' branch)
It appears TI calculators are the main use case.

I don't know the current status/functionality, but it would be fun to see
what the various LLVM passes do to the generated code.

It seems like there's been some work done on Rust for Z80 (and 6502): https://github.com/jacobly0/llvm-project/issues/15

Theo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luke A. Guest@21:1/5 to Theo on Wed Oct 6 16:20:53 2021

On 06/10/2021 10:36, Theo wrote:

Luke A. Guest <laguest@archeia.com> wrote:

I have a Z80 project in mind and would like to build a compiler for a
Z80. I was wondering if modern backend techniques can be applied
successfully for these old CPU's, i.e. SSA.

I know GCC has backends for some older architectures, but these do weird
gymnastics such as implementing a virtual cpu in rtl and then lowering
further.

There is, it seems, an LLVM backend for Z80: https://github.com/jacobly0/llvm-project
(see the 'z80' branch)
It appears TI calculators are the main use case.

That is long dead and cannot be merged into a newer branch, they made a complete mess of the source tree there. That is pre llvm-3.

I don't know the current status/functionality, but it would be fun to see what the various LLVM passes do to the generated code.

It seems like there's been some work done on Rust for Z80 (and 6502): https://github.com/jacobly0/llvm-project/issues/15

Ok, just looked and seems there has been something done recently, last I
looked it was dead and ancient.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Philipp Klaus Krause@21:1/5 to All on Wed Oct 6 18:20:34 2021

Am 06.10.21 um 09:56 schrieb Anton Ertl:

So eventually computer architects introduced machines with
general-purpose registers like the PDP-11, the VAX, and the RISCs; and compiler writers developed techniques like graph colouring to make
good use of these architectures.

Maybe with the increased memory and processing power available now,
one could do better, but given that special-purpose registers are
mostly a thing of the past, there has not been much research into
that, that I am aware of.

See "Optimal Register Allocation in Polynomial Time". A graph-coloring
approach that can handle irregularities well (as long as there are not
too many registers). SDCC uses such a register allocator for some
backends, including z80.

Philipp

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Philipp Klaus Krause on Fri Oct 15 07:37:30 2021

Philipp Klaus Krause <pkk@spth.de> writes:

See "Optimal Register Allocation in Polynomial Time". A graph-coloring >approach that can handle irregularities well (as long as there are not
too many registers). SDCC uses such a register allocator for some
backends, including z80.

Interesting paper. I had trouble following the theoretical part, but
looking at the practical part and taking the part that I understood
about the theory, you try out all possible assignments of all
registers to the maximum clique (live ranges that are alive at the
same time), and because the number of registers is a constant (for a
given architecture), this is a polynomial.

You discuss splitting live ranges at control-flow-graph boundaries,
but it seems to me that in some cases you want to split live ranges
between instructions within a control-flow node; this does not change
the complexity-theoretic result, but of course the implementation.

If only the maximum clique size plays a role, it's as if the
assignments to the cliques are independent; but then you have to
reconcile the assignments for different cliques by introducing reg-reg
move instructions, and take these costs into account, and optimize
them, too; so I don't see that the cliques can be treated as
independent. And you probably don't, because there are additional
factors in the complexity. In any case, I am missing something, and
maybe you have an idea what it is.

I am wondering about one thing in the empirical results in your paper:
Why is the code size not monotonously falling with increased numbers
of assignments? Are these independent runs with different
(pseudo-random) assignments?

I had some questions which were mostly answered by the paper, but
maybe you can offer additional insights:

* Am I right that earlier register allocators were bad for irregular
register sets, and that's why general-purpose registers won once
compilers became dominant? Why did general-purpose registers become
dominant?

The advantage of general-purpose registers is given in that paper as
reducing the complexity of the algorithm by a factor of:

(2*(tree-width(G)+1)*#registers)!

E.g., for tree-width 2 and 9 registers, it's 54! = 2.30843697*10^71

Which poses the question: In your empirical work you stopped at 10^8 assignments (in some cases, less). How did you get provably optimal assignments on the Z80 with its 9 registers?

* What are the key points why your work can deal with irregular
register sets, and earlier approaches are pretty bad at that?

It seems to me that you use the CPU power available now to try out
many different assignments, while earlier work has balked at that.

* Do you have any idea why no good approach for dealing with irregular
instruction sets has been found in, say, the 1970s and 1980s when
irregular register sets were more common (e.g. on the Z80 and the
8086).

Your approach is an (ideally exhaustive) search that uses more CPU
power (and memory?) than was available then. At the time, one would
have resorted to heuristics, but apparently no general effective
heuristics have been found.

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Philipp Klaus Krause@21:1/5 to All on Mon Oct 18 08:35:34 2021

Am 15.10.21 um 09:37 schrieb Anton Ertl:

Philipp Klaus Krause <pkk@spth.de> writes:

I am wondering about one thing in the empirical results in your paper:
Why is the code size not monotonously falling with increased numbers
of assignments? Are these independent runs with different
(pseudo-random) assignments?

The basic idea is that for some subsets of nodes in the control-flow
graph, we try all possible assignments to registers. The
tree-decomposition tells us which subsets that are, and how we can
assemble the locally optimal solutions into globally optimal ones.

Since the cost function gives us the real costs from code generation, we
are optimal wrt. to the optimization goal, e.g. code size. However, thre
ar three aspects, why code size might not fall monotonously with an
increased number of tried assignments:

1) If the peephole optimizer is active: The register allocation approach
can only consider code size after code generation, it doesn't know what
the peephole optimizer might do with the result.

2) The register allocator considers variables / parts of variables as
assigned to certain registers, or spilt. It doesn't know where spilt
variables will end up on the stack. It can thus consider the benefits of coalescing for variables assigned to registers, but not for spilt variables.

3) We know that the number of assignments we need to consider to get a
provably optimal result is polynomial if the input is a structured
program. But the polynomial bound might be too big for practical
compilation, so we limit the number of assignments considered (via the --max-allocs-per-node parameter to SDCC). In that case the allocator can
not give a provably optimal result. In some cases, it needs to make a
decision, on which assignments to condier further (and does so based on incomplete information using a heuristic). Here, considering more
assignments can lead to a certain assignment considered previously no
lonjger being considered. E.g. when going from --max-allocs-per-node 10
to --max-allocs-per-node 100, we will now consider 100 assignments
instead of 10. But not all of the previously considered 10 might make it
into the 100 considered now. And in the end it might turn out that one
of the 10 might have given us a better result globally. The decision
which assignments to consider further us not random, but based on a
heuristic that mostly takes local information into account.

Philipp

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Philipp Klaus Krause@21:1/5 to All on Mon Oct 18 09:17:34 2021

Am 15.10.21 um 09:37 schrieb Anton Ertl:

Which poses the question: In your empirical work you stopped at 10^8 assignments (in some cases, less). How did you get provably optimal assignments on the Z80 with its 9 registers?

Castañeda Lozano

On one hand, we have the theoretical bound on the number of assignments,
which is useful for proving that we can be optimal in polynomial time.

On the other hand, getting a provably optimal result when compiling an individual function is something that is easier to achieve, as the
theoretical bound is a worst case.

* In real programs, most bags in the tree-decomposition will be
smaller than the tree-width allows.
* In real programs, there will be parts of the control-flow graph, where
the number of variables alive is less than (tw(G) + 1)*r.
* In the implementation, we can throw away some assignments early,
without sacrificing optimality, when we know that code generation for
them would be impossible (e.g. in the z80 port, unlike the stm8 port,
code generation cannot handle variables that have some of their bytes
allocated to registers, but other bytes spilt).

In ports of SDCC that use this register allocator (which now is most of
them), when enabling extra comments in the generated assembly code via --fverbose-asm, SDCC will place a comment at the beginning of each
function stating if the register allocation was provably optimal.

Philipp

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Philipp Klaus Krause@21:1/5 to All on Mon Oct 18 08:56:58 2021

Am 15.10.21 um 09:37 schrieb Anton Ertl:

I had some questions which were mostly answered by the paper, but
maybe you can offer additional insights:

* Am I right that earlier register allocators were bad for irregular
register sets, and that's why general-purpose registers won once
compilers became dominant? Why did general-purpose registers become
dominant?

Indeed earlier register allocators could not handle irregular
architectures well. In particular, Chaitin-style graph coloring register allocators are simple, and efficient if we have enough general-purpose registers. Chaiting-style register allocators and RISC-style
architectufres are a good match.

* What are the key points why your work can deal with irregular
register sets, and earlier approaches are pretty bad at that?

It seems to me that you use the CPU power available now to try out
many different assignments, while earlier work has balked at that.

* Do you have any idea why no good approach for dealing with irregular
instruction sets has been found in, say, the 1970s and 1980s when
irregular register sets were more common (e.g. on the Z80 and the
8086).

Your approach is an (ideally exhaustive) search that uses more CPU
power (and memory?) than was available then. At the time, one would
have resorted to heuristics, but apparently no general effective
heuristics have been found.

Indeed my approach uses more CPU power and memory than was available
then. There is another newer approach that claims to handle
irregularitites well by Roberto Castañeda Lozano, but it also uses more
CPU time and memory than was available in the past.

Besides the CPU power and memory, both my and his approach also build on theoretical advances that simply weren't there back then.
I use tree-decompositions. While the basic idea is there in Wagner's
work on S-functions in the 1970s, it did not get applied to register
allocation until Thorup's and Bodlaender's works in 1998. Those two
works considered a quite abstract version of register allocation (graph-coloring with all variables the same size). Thorup also offered
for the first time, a practical way to get tree-decompositions of
control-flow graphs (there are flaws in what he did, but it was still
good enough to build on for me, then).

Philipp

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to Philipp Klaus Krause on Thu Oct 21 21:53:48 2021

On Thursday, October 21, 2021 at 5:08:46 PM UTC-7, Philipp Klaus Krause wrote:

(snip)

On one hand, we have the theoretical bound on the number of assignments, which is useful for proving that we can be optimal in polynomial time.

On the other hand, getting a provably optimal result when compiling an individual function is something that is easier to achieve, as the theoretical bound is a worst case.

This is reminding me that early Fortran compilers had a FREQUENCY statement (optionally) telling the compiler the relative probability of branching for IF statements, and the estimated iterations for DO loops. It was removed before the first standard in 1966.

The above methods might be fine on a small scale, but for more global optimization you need the relative probabilities. I suspect that everyone assumes equal probabilities for everything.

I suspect that there is no interest in bringing FREQUENCY back to Fortran,
or any other language, though.
[Legend says that in at least one compiler, FREQUENCY was implemented backward and nobody noticed. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to gah4@u.washington.edu on Fri Oct 22 17:28:19 2021

On 2021-10-22, gah4 <gah4@u.washington.edu> wrote:

I suspect that there is no interest in bringing FREQUENCY back to Fortran,
or any other language, though.

GNU C has built-in primitives for expressing the likelihood of branches
being taken: __builtin_expect and __builtin_expect_with_probability.

Plus other builtins related to optimization such as that you can request
cache prefetch with __builtin_prefetch.

Goto labels can have attributes which indicate the likelihood of their
use:

again:
/* we jump back here all the freakin' time */
__attribute__((hot));

...

err:
/* oh crap! this happens, though rarely */
__attribute__((cold));

These things are smartly out of the way in a protected namespace,
so you can easily use macros to write code that takes advantage of
these things yet remains portable.

[Legend says that in at least one compiler, FREQUENCY was implemented backward and nobody noticed. -John]

The problem is that programmers sometimes optimize things just as a fun
chrome plating exercise, and not as a change to the code accompanied by
a unit test which somehow proves that the change had the required
effect.

It's hard to do and annoying; you can't test absolute speeds of anything because machines change. The test case may have to locally duplicate and preserve the old version of the entire module of code, and compare its performance to the new version. Then if things change in that code, that
test is going to be annoying to maintain; its reference now has to be a fictitious old version of the code that never existed which doesn't have
the optimization, so you can keep proving that the optimization provides
a testable benefit.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From dave_thompson_2@comcast.net@21:1/5 to Anton Ertl on Sun Nov 14 15:04:37 2021

On Wed, 06 Oct 2021 07:56:59 GMT, anton@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:
...

As for the back-end, it seems to me that the major problem with the
Z80 is that it does not have general-purpose registers; instead, many instructions deal with specific registers. Many early architectures
were like that, and assembly programmers could puzzle out good
register assignments, but compilers were not particularly good at it.
So eventually computer architects introduced machines with
general-purpose registers like the PDP-11, the VAX, and the RISCs; ...

Eventually? PDP-11 was 7 years before Z-80, and 5 years before that
both PDP-6 and S/360 had 16 GPRs (& none 'wasted' as PC SP FP).

S/360 and PDP-11 did have floating-point registers separate, and at
least on the latter optional. (I believe there were 360 models listed
without FP, but heard that actual instances were about as rare as
PDP-6 without the 'option' for 0-17 to be registers instead of core.)
[Floating point was optional on the low end 360/22, /25, /30, and /40. Considering what they were used for and how slow the FP was, e.g.,
on the /30 floating add was over 50us, multiply up to 400us, I expect
a lot of them skipped the floating point. Larger models all had it. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Michal Wronka
  Wed Apr 24 14:13:57 2024
  from Wroclaw, Poland via SSH
- Michal Wronka
  Wed Apr 24 14:02:51 2024
  from Wroclaw, Poland via SSH
- Michal Wronka
  Thu Apr 25 14:02:21 2024
  from Wroclaw, Poland via SSH
- Bob Worm
  Thu Apr 25 11:52:12 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	296
Nodes:	16 (2 / 14)
Uptime:	55:32:07
Calls:	6,650
Calls today:	2
Files:	12,200
Messages:	5,330,754

Re: Modern compilers for ye olde architectures

Who's Online

Recent Visitors

System Info