• Re: Modern compilers for ye olde architectures

    From David Brown@21:1/5 to Luke A. Guest on Tue Oct 5 19:59:50 2021
    On 05/10/2021 14:22, Luke A. Guest wrote:
    I have a Z80 project in mind and would like to build a compiler for a
    Z80.  I was wondering if modern backend techniques can be applied successfully for these old CPU's, i.e. SSA.

    I know GCC has backends for some older architectures, but these do weird gymnastics such as implementing a virtual cpu in rtl and then lowering further.


    Certainly modern techniques /can/ be applied to a Z80 compiler. And the
    Z80 is a lot more powerful than many 8-bit CISC microcontrollers.

    As far as I know, the only 8-bit target for gcc (in the mainline at
    least) is the AVR. This is a RISC processor, with 32 8-bit registers,
    and a much more orthogonal architecture. gcc does do some "weird
    gymnastics", however - it allocates registers in pairs so that it can
    pretend to be 16-bit, and then code gets simplified in peephole passes.

    gcc also supports some 16-bit CISC devices.

    For the Z80, however, there are a number of compiler options. There are
    some commercial toolchains, such as HiSoft (IIRC), and there is the
    "Rabbit" microcontroller which is a derivative of the Z80. It is
    supported by a compiler for a somewhat extended version of C. Such
    compilers are probably quite old, however, and might not be easily
    available.

    Your best bet is probably SDCC. This is a multi-target open source
    compiler that is in regular development, and aimed specifically for
    small CISC microcontroller cores. The Z80 is one of its targets.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luke A. Guest@21:1/5 to All on Tue Oct 5 13:22:47 2021
    Hi,

    I have a Z80 project in mind and would like to build a compiler for a
    Z80. I was wondering if modern backend techniques can be applied
    successfully for these old CPU's, i.e. SSA.

    I know GCC has backends for some older architectures, but these do weird gymnastics such as implementing a virtual cpu in rtl and then lowering
    further.

    Thanks,
    Luke.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans-Peter Diettrich@21:1/5 to Luke A. Guest on Tue Oct 5 22:12:46 2021
    On 10/5/21 2:22 PM, Luke A. Guest wrote:
    I have a Z80 project in mind and would like to build a compiler for a
    Z80.  I was wondering if modern backend techniques can be applied successfully for these old CPU's, i.e. SSA.

    I know GCC has backends for some older architectures, but these do weird gymnastics such as implementing a virtual cpu in rtl and then lowering further.

    I wonder what's the purpose of such compilers. Let a Z80 emulate some
    supported CPU and be happy :-)

    DoDi
    [Maybe it's retrocomputing, maybe it's some device with an embedded Z80.
    If that's all the computing you need, why pay for more? -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Derek Jones@21:1/5 to All on Wed Oct 6 01:26:55 2021
    Luke,

    Z80.  I was wondering if modern backend techniques can be applied successfully for these old CPU's, i.e. SSA.

    Modern, as in invented in 1988.

    Modern, as in post-1985 techniques require lots of memory.
    So they are not of any use if you plan to host your compiler
    on a Z80, otherwise your next problem is mapping techniques
    that are designed to work well with orthogonal architectures
    (which the Z80 is certainly not).
    [I believe the plan is to cross-compile with Z80 as the target. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Luke A. Guest on Wed Oct 6 07:56:59 2021
    "Luke A. Guest" <laguest@archeia.com> writes:
    I have a Z80 project in mind and would like to build a compiler for a
    Z80. I was wondering if modern backend techniques can be applied >successfully for these old CPU's, i.e. SSA.

    SSA can certainly be used in a compiler for the Z80, but SSA is not a
    back-end technique; it's a way to represent data flow in the
    intermediate representation.

    As for the back-end, it seems to me that the major problem with the
    Z80 is that it does not have general-purpose registers; instead, many instructions deal with specific registers. Many early architectures
    were like that, and assembly programmers could puzzle out good
    register assignments, but compilers were not particularly good at it.
    So eventually computer architects introduced machines with
    general-purpose registers like the PDP-11, the VAX, and the RISCs; and
    compiler writers developed techniques like graph colouring to make
    good use of these architectures.

    Maybe with the increased memory and processing power available now,
    one could do better, but given that special-purpose registers are
    mostly a thing of the past, there has not been much research into
    that, that I am aware of. Maybe the puzzle-solving approach to
    register allocation can help, but I am not familiar enough with that
    to say for sure.

    - anton
    --
    M. Anton Ertl
    anton@mips.complang.tuwien.ac.at
    http://www.complang.tuwien.ac.at/anton/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luke A. Guest@21:1/5 to David Brown on Wed Oct 6 09:00:04 2021
    On 05/10/2021 18:59, David Brown wrote:
    For the Z80, however, there are a number of compiler options. ...

    I'm not talking about using an existing compiler.

    Your best bet is probably SDCC. This is a multi-target open source
    compiler that is in regular development, and aimed specifically for
    small CISC microcontroller cores. The Z80 is one of its targets.

    I know of SDCC. I'm not talking about using C.

    On 06/10/2021 01:26, Derek Jones wrote:
    Modern, as in post-1985 techniques require lots of memory.
    So they are not of any use if you plan to host your compiler
    on a Z80, otherwise your next problem is mapping techniques
    that are designed to work well with orthogonal architectures
    (which the Z80 is certainly not).
    [I believe the plan is to cross-compile with Z80 as the target. -John]

    I'm not talking about running a compiler on a Z80, just targetting one. [Perhaps you could give us more hints about what you're doing so we can
    offer more usefule answers. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Theo@21:1/5 to Luke A. Guest on Wed Oct 6 10:36:29 2021
    Luke A. Guest <laguest@archeia.com> wrote:
    I have a Z80 project in mind and would like to build a compiler for a
    Z80. I was wondering if modern backend techniques can be applied successfully for these old CPU's, i.e. SSA.

    I know GCC has backends for some older architectures, but these do weird gymnastics such as implementing a virtual cpu in rtl and then lowering further.

    There is, it seems, an LLVM backend for Z80: https://github.com/jacobly0/llvm-project
    (see the 'z80' branch)
    It appears TI calculators are the main use case.

    I don't know the current status/functionality, but it would be fun to see
    what the various LLVM passes do to the generated code.

    It seems like there's been some work done on Rust for Z80 (and 6502): https://github.com/jacobly0/llvm-project/issues/15

    Theo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luke A. Guest@21:1/5 to Theo on Wed Oct 6 16:20:53 2021
    On 06/10/2021 10:36, Theo wrote:
    Luke A. Guest <laguest@archeia.com> wrote:
    I have a Z80 project in mind and would like to build a compiler for a
    Z80. I was wondering if modern backend techniques can be applied
    successfully for these old CPU's, i.e. SSA.

    I know GCC has backends for some older architectures, but these do weird
    gymnastics such as implementing a virtual cpu in rtl and then lowering
    further.

    There is, it seems, an LLVM backend for Z80: https://github.com/jacobly0/llvm-project
    (see the 'z80' branch)
    It appears TI calculators are the main use case.

    That is long dead and cannot be merged into a newer branch, they made a complete mess of the source tree there. That is pre llvm-3.

    I don't know the current status/functionality, but it would be fun to see what the various LLVM passes do to the generated code.

    It seems like there's been some work done on Rust for Z80 (and 6502): https://github.com/jacobly0/llvm-project/issues/15

    Ok, just looked and seems there has been something done recently, last I
    looked it was dead and ancient.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Philipp Klaus Krause@21:1/5 to All on Wed Oct 6 18:20:34 2021
    Am 06.10.21 um 09:56 schrieb Anton Ertl:
    So eventually computer architects introduced machines with
    general-purpose registers like the PDP-11, the VAX, and the RISCs; and compiler writers developed techniques like graph colouring to make
    good use of these architectures.

    Maybe with the increased memory and processing power available now,
    one could do better, but given that special-purpose registers are
    mostly a thing of the past, there has not been much research into
    that, that I am aware of.

    See "Optimal Register Allocation in Polynomial Time". A graph-coloring
    approach that can handle irregularities well (as long as there are not
    too many registers). SDCC uses such a register allocator for some
    backends, including z80.

    Philipp

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Philipp Klaus Krause on Fri Oct 15 07:37:30 2021
    Philipp Klaus Krause <pkk@spth.de> writes:
    See "Optimal Register Allocation in Polynomial Time". A graph-coloring >approach that can handle irregularities well (as long as there are not
    too many registers). SDCC uses such a register allocator for some
    backends, including z80.

    Interesting paper. I had trouble following the theoretical part, but
    looking at the practical part and taking the part that I understood
    about the theory, you try out all possible assignments of all
    registers to the maximum clique (live ranges that are alive at the
    same time), and because the number of registers is a constant (for a
    given architecture), this is a polynomial.

    You discuss splitting live ranges at control-flow-graph boundaries,
    but it seems to me that in some cases you want to split live ranges
    between instructions within a control-flow node; this does not change
    the complexity-theoretic result, but of course the implementation.

    If only the maximum clique size plays a role, it's as if the
    assignments to the cliques are independent; but then you have to
    reconcile the assignments for different cliques by introducing reg-reg
    move instructions, and take these costs into account, and optimize
    them, too; so I don't see that the cliques can be treated as
    independent. And you probably don't, because there are additional
    factors in the complexity. In any case, I am missing something, and
    maybe you have an idea what it is.

    I am wondering about one thing in the empirical results in your paper:
    Why is the code size not monotonously falling with increased numbers
    of assignments? Are these independent runs with different
    (pseudo-random) assignments?

    I had some questions which were mostly answered by the paper, but
    maybe you can offer additional insights:

    * Am I right that earlier register allocators were bad for irregular
    register sets, and that's why general-purpose registers won once
    compilers became dominant? Why did general-purpose registers become
    dominant?

    The advantage of general-purpose registers is given in that paper as
    reducing the complexity of the algorithm by a factor of:

    (2*(tree-width(G)+1)*#registers)!

    E.g., for tree-width 2 and 9 registers, it's 54! = 2.30843697*10^71

    Which poses the question: In your empirical work you stopped at 10^8 assignments (in some cases, less). How did you get provably optimal assignments on the Z80 with its 9 registers?

    * What are the key points why your work can deal with irregular
    register sets, and earlier approaches are pretty bad at that?

    It seems to me that you use the CPU power available now to try out
    many different assignments, while earlier work has balked at that.

    * Do you have any idea why no good approach for dealing with irregular
    instruction sets has been found in, say, the 1970s and 1980s when
    irregular register sets were more common (e.g. on the Z80 and the
    8086).

    Your approach is an (ideally exhaustive) search that uses more CPU
    power (and memory?) than was available then. At the time, one would
    have resorted to heuristics, but apparently no general effective
    heuristics have been found.

    - anton
    --
    M. Anton Ertl
    anton@mips.complang.tuwien.ac.at
    http://www.complang.tuwien.ac.at/anton/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Philipp Klaus Krause@21:1/5 to All on Mon Oct 18 08:35:34 2021
    Am 15.10.21 um 09:37 schrieb Anton Ertl:
    Philipp Klaus Krause <pkk@spth.de> writes:

    I am wondering about one thing in the empirical results in your paper:
    Why is the code size not monotonously falling with increased numbers
    of assignments? Are these independent runs with different
    (pseudo-random) assignments?

    The basic idea is that for some subsets of nodes in the control-flow
    graph, we try all possible assignments to registers. The
    tree-decomposition tells us which subsets that are, and how we can
    assemble the locally optimal solutions into globally optimal ones.

    Since the cost function gives us the real costs from code generation, we
    are optimal wrt. to the optimization goal, e.g. code size. However, thre
    ar three aspects, why code size might not fall monotonously with an
    increased number of tried assignments:

    1) If the peephole optimizer is active: The register allocation approach
    can only consider code size after code generation, it doesn't know what
    the peephole optimizer might do with the result.

    2) The register allocator considers variables / parts of variables as
    assigned to certain registers, or spilt. It doesn't know where spilt
    variables will end up on the stack. It can thus consider the benefits of coalescing for variables assigned to registers, but not for spilt variables.

    3) We know that the number of assignments we need to consider to get a
    provably optimal result is polynomial if the input is a structured
    program. But the polynomial bound might be too big for practical
    compilation, so we limit the number of assignments considered (via the --max-allocs-per-node parameter to SDCC). In that case the allocator can
    not give a provably optimal result. In some cases, it needs to make a
    decision, on which assignments to condier further (and does so based on incomplete information using a heuristic). Here, considering more
    assignments can lead to a certain assignment considered previously no
    lonjger being considered. E.g. when going from --max-allocs-per-node 10
    to --max-allocs-per-node 100, we will now consider 100 assignments
    instead of 10. But not all of the previously considered 10 might make it
    into the 100 considered now. And in the end it might turn out that one
    of the 10 might have given us a better result globally. The decision
    which assignments to consider further us not random, but based on a
    heuristic that mostly takes local information into account.

    Philipp

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Philipp Klaus Krause@21:1/5 to All on Mon Oct 18 09:17:34 2021
    Am 15.10.21 um 09:37 schrieb Anton Ertl:


    Which poses the question: In your empirical work you stopped at 10^8 assignments (in some cases, less). How did you get provably optimal assignments on the Z80 with its 9 registers?

    Castañeda Lozano

    On one hand, we have the theoretical bound on the number of assignments,
    which is useful for proving that we can be optimal in polynomial time.

    On the other hand, getting a provably optimal result when compiling an individual function is something that is easier to achieve, as the
    theoretical bound is a worst case.

    * In real programs, most bags in the tree-decomposition will be
    smaller than the tree-width allows.
    * In real programs, there will be parts of the control-flow graph, where
    the number of variables alive is less than (tw(G) + 1)*r.
    * In the implementation, we can throw away some assignments early,
    without sacrificing optimality, when we know that code generation for
    them would be impossible (e.g. in the z80 port, unlike the stm8 port,
    code generation cannot handle variables that have some of their bytes
    allocated to registers, but other bytes spilt).

    In ports of SDCC that use this register allocator (which now is most of
    them), when enabling extra comments in the generated assembly code via --fverbose-asm, SDCC will place a comment at the beginning of each
    function stating if the register allocation was provably optimal.

    Philipp

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Philipp Klaus Krause@21:1/5 to All on Mon Oct 18 08:56:58 2021
    Am 15.10.21 um 09:37 schrieb Anton Ertl:

    I had some questions which were mostly answered by the paper, but
    maybe you can offer additional insights:

    * Am I right that earlier register allocators were bad for irregular
    register sets, and that's why general-purpose registers won once
    compilers became dominant? Why did general-purpose registers become
    dominant?

    Indeed earlier register allocators could not handle irregular
    architectures well. In particular, Chaitin-style graph coloring register allocators are simple, and efficient if we have enough general-purpose registers. Chaiting-style register allocators and RISC-style
    architectufres are a good match.

    * What are the key points why your work can deal with irregular
    register sets, and earlier approaches are pretty bad at that?

    It seems to me that you use the CPU power available now to try out
    many different assignments, while earlier work has balked at that.

    * Do you have any idea why no good approach for dealing with irregular
    instruction sets has been found in, say, the 1970s and 1980s when
    irregular register sets were more common (e.g. on the Z80 and the
    8086).

    Your approach is an (ideally exhaustive) search that uses more CPU
    power (and memory?) than was available then. At the time, one would
    have resorted to heuristics, but apparently no general effective
    heuristics have been found.

    Indeed my approach uses more CPU power and memory than was available
    then. There is another newer approach that claims to handle
    irregularitites well by Roberto Castañeda Lozano, but it also uses more
    CPU time and memory than was available in the past.

    Besides the CPU power and memory, both my and his approach also build on theoretical advances that simply weren't there back then.
    I use tree-decompositions. While the basic idea is there in Wagner's
    work on S-functions in the 1970s, it did not get applied to register
    allocation until Thorup's and Bodlaender's works in 1998. Those two
    works considered a quite abstract version of register allocation (graph-coloring with all variables the same size). Thorup also offered
    for the first time, a practical way to get tree-decompositions of
    control-flow graphs (there are flaws in what he did, but it was still
    good enough to build on for me, then).

    Philipp

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Philipp Klaus Krause on Thu Oct 21 21:53:48 2021
    On Thursday, October 21, 2021 at 5:08:46 PM UTC-7, Philipp Klaus Krause wrote:

    (snip)

    On one hand, we have the theoretical bound on the number of assignments, which is useful for proving that we can be optimal in polynomial time.

    On the other hand, getting a provably optimal result when compiling an individual function is something that is easier to achieve, as the theoretical bound is a worst case.

    This is reminding me that early Fortran compilers had a FREQUENCY statement (optionally) telling the compiler the relative probability of branching for IF statements, and the estimated iterations for DO loops. It was removed before the first standard in 1966.

    The above methods might be fine on a small scale, but for more global optimization you need the relative probabilities. I suspect that everyone assumes equal probabilities for everything.

    I suspect that there is no interest in bringing FREQUENCY back to Fortran,
    or any other language, though.
    [Legend says that in at least one compiler, FREQUENCY was implemented backward and nobody noticed. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to gah4@u.washington.edu on Fri Oct 22 17:28:19 2021
    On 2021-10-22, gah4 <gah4@u.washington.edu> wrote:
    I suspect that there is no interest in bringing FREQUENCY back to Fortran,
    or any other language, though.

    GNU C has built-in primitives for expressing the likelihood of branches
    being taken: __builtin_expect and __builtin_expect_with_probability.

    Plus other builtins related to optimization such as that you can request
    cache prefetch with __builtin_prefetch.

    Goto labels can have attributes which indicate the likelihood of their
    use:

    again:
    /* we jump back here all the freakin' time */
    __attribute__((hot));

    ...

    err:
    /* oh crap! this happens, though rarely */
    __attribute__((cold));


    These things are smartly out of the way in a protected namespace,
    so you can easily use macros to write code that takes advantage of
    these things yet remains portable.

    [Legend says that in at least one compiler, FREQUENCY was implemented backward and nobody noticed. -John]

    The problem is that programmers sometimes optimize things just as a fun
    chrome plating exercise, and not as a change to the code accompanied by
    a unit test which somehow proves that the change had the required
    effect.

    It's hard to do and annoying; you can't test absolute speeds of anything because machines change. The test case may have to locally duplicate and preserve the old version of the entire module of code, and compare its performance to the new version. Then if things change in that code, that
    test is going to be annoying to maintain; its reference now has to be a fictitious old version of the code that never existed which doesn't have
    the optimization, so you can keep proving that the optimization provides
    a testable benefit.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dave_thompson_2@comcast.net@21:1/5 to Anton Ertl on Sun Nov 14 15:04:37 2021
    On Wed, 06 Oct 2021 07:56:59 GMT, anton@mips.complang.tuwien.ac.at
    (Anton Ertl) wrote:
    ...
    As for the back-end, it seems to me that the major problem with the
    Z80 is that it does not have general-purpose registers; instead, many instructions deal with specific registers. Many early architectures
    were like that, and assembly programmers could puzzle out good
    register assignments, but compilers were not particularly good at it.
    So eventually computer architects introduced machines with
    general-purpose registers like the PDP-11, the VAX, and the RISCs; ...

    Eventually? PDP-11 was 7 years before Z-80, and 5 years before that
    both PDP-6 and S/360 had 16 GPRs (& none 'wasted' as PC SP FP).

    S/360 and PDP-11 did have floating-point registers separate, and at
    least on the latter optional. (I believe there were 360 models listed
    without FP, but heard that actual instances were about as rare as
    PDP-6 without the 'option' for 0-17 to be registers instead of core.)
    [Floating point was optional on the low end 360/22, /25, /30, and /40. Considering what they were used for and how slow the FP was, e.g.,
    on the /30 floating add was over 50us, multiply up to 400us, I expect
    a lot of them skipped the floating point. Larger models all had it. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)