• New performance features in gforth-fast

    From Anton Ertl@21:1/5 to All on Sat Oct 14 11:08:59 2023
    gforth-fast has acquired two performance features this summer:

    1) Many of the ip updates are now optimized away (all architectures).

    2) On AMD64 gforth-fast can now use stack caching with up to 3
    registers (previously 1).

    For the word

    : cubed dup dup * * ;

    this results in the following differences in the resulting code:

    Before: without ip updates and with 3 regs
    $7F75EC8FB240 dup
    add $0x8,%rbx
    mov %r8,0x0(%r13) mov %r8,0x0(%r13) mov %r8,%r15
    sub $0x8,%r13 sub $0x8,%r13
    $7F75EC8FB248 dup
    add $0x8,%rbx
    mov %r8,0x0(%r13) mov %r8,0x0(%r13)
    sub $0x8,%r13 sub $0x8,%r13 mov %r15,%r9
    $7F75EC8FB250 *
    add $0x8,%rbx
    imul 0x8(%r13),%r8 imul 0x8(%r13),%r8 imul %r9,%r15
    add $0x8,%r13 add $0x8,%r13
    $7F75EC8FB258 *
    add $0x8,%rbx
    imul 0x8(%r13),%r8 imul 0x8(%r13),%r8 imul %r15,%r8
    add $0x8,%r13 add $0x8,%r13
    $7F75EC8FB260 ;s
    mov (%r14),%rbx mov (%r14),%rbx mov (%r14),%rbx
    add $0x8,%r14 add $0x8,%r14 add $0x8,%r14
    mov (%rbx),%rax mov (%rbx),%rax mov (%rbx),%rax
    jmp *%rax jmp *%rax jmp *%rax

    (Actually, the real Before variant used a different register
    allocation, but the same number of instructions. The shown version is
    the engine with optimization, but

    Here's a comparison with some other Forth systems on AMD64:

    gforth-fast iforth SwiftForth x64 VFX Forth 64
    mov %r8,%r15 pop rbx -8 [RBP] RBP LEA MOV RDX, RBX
    mov %r15,%r9 mov rdi, rbx RBX 0 [RBP] MOV IMUL RBX, RDX
    imul %r9,%r15 imul rdi, rbx -8 [RBP] RBP LEA IMUL RBX, RDX
    imul %r15,%r8 imul rbx, rdi RBX 0 [RBP] MOV RET/NEXT
    mov (%r14),%rbx push rbx 0 [RBP] RAX MOV
    add $0x8,%r14 ; RBX MUL
    mov (%rbx),%rax RAX RBX MOV
    jmp *%rax 8 [RBP] RBP LEA
    0 [RBP] RAX MOV
    RBX MUL
    RAX RBX MOV
    8 [RBP] RBP LEA
    RET

    1) Optimize ip updates:

    At its heart, gforth (including gforth-fast) is still a threaded-code
    system and falls back to threaded code when needed; in particular,
    it's control flow works through the threaded-code mechanism; e.g., the
    ;S in the example above loads the threaded-code address of the next
    (primitive) word in the caller, and performs a direct-threaded
    dispatch. Also immediate arguments (e.g. for literals) are accessed
    through the threaded-code instruction pointer. Therefore Gforth
    maintains a threaded-code instruction pointer (ip).

    But it does not need to maintain the ip everywhere. In the CUBED
    example, no primitive uses the ip of the threaded code cell of the
    primitive, so no ip updates are necessary except for restoring the
    caller's ip at the end.

    And this is what this optimization does, in a nutshell.

    This optimization is controlled with --opt-ip-updates=n, where n=0
    means no ip-update optimization, and higher n mean more optimization;
    currently the highest level is n=4 IIRC, and the highest level is the
    default.

    2) 3 registers for stack caching:

    Up until this summer I believed that we would not convince gcc to use caller-saved registers as additional stack cache registers, and the
    dearth of callee-saved registers on AMD64 meant that we were limited
    to using 1 register as a stack cache (we have been using 3 registers
    on ARM A64 and RISC-V for quite some time). This summer I got an idea
    on how to do it, and, with the help of Bernd Paysan, did it; if you
    want to read more about it, posting <23-10-001@comp.compilers> in comp.compilers (in the web: <https://compilers.iecc.com/comparch/article/23-10-001> or <http://al.howardknight.net/?ID=169728532800>) discusses the topic in
    more depth.

    Here are results for small benchmarks on a Xeon W-1370P (5.2GHz Rocket
    Lake):

    sieve bubble matrix fib fft
    0.089 0.131 0.048 0.084 0.031 gforth
    0.058 0.066 0.033 0.043 0.014 gforth-fast --ss-states=2 --opt-ip-updates=0
    0.052 0.053 0.018 0.036 0.014 gforth-fast --ss-states=2
    0.038 0.042 0.014 0.032 0.014 gforth-fast

    The new optimizations provide good speedups on Rocket Lake. Sometimes
    the ip-update optimization alone helps a lot, sometimes the
    combination of both optimizations helps a lot more (for now
    --opt-ip-updates=0 does not work with --ss-states=4, so I cannot
    completely isolate the effects of the optimizations).

    gforth (the debugging engine) does not benefit from either
    optimization, because stack caching is disabled for better stack
    underflow reporting, and ip updates are disabled in order to get
    proper backtraces in case of exceptions.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Sun Oct 15 10:18:51 2023
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    gforth-fast has acquired two performance features this summer:

    1) Many of the ip updates are now optimized away (all architectures).

    2) On AMD64 gforth-fast can now use stack caching with up to 3
    registers (previously 1).
    ...
    Here are results for small benchmarks on a Xeon W-1370P (5.2GHz Rocket
    Lake):

    I fixed the bug that prevented "gforth-fast --opt-ip-updates=0" from
    working, resulting in:

    sieve bubble matrix fib fft
    0.089 0.131 0.048 0.084 0.031 gforth
    0.058 0.066 0.033 0.043 0.014 gforth-fast --ss-states=2 --opt-ip-updates=0
    0.057 0.062 0.032 0.042 0.014 gforth-fast --opt-ip-updates=0
    0.052 0.053 0.018 0.036 0.014 gforth-fast --ss-states=2
    0.038 0.042 0.014 0.032 0.014 gforth-fast

    So using 3 registers without ip-update optimization had a small
    effect, the ip-update optimization alone had a larger effect
    (especially on matrix and fib), but for sieve, the combination of both
    had a much larger effect than one might have suspected looking at the individual effects.

    Actually, apart from fft, which does not benefit from these
    optimizations at all, every benchmark performed better with both
    optimizations on than what one would expect by either multiplying the
    factors that we see for each individual optimization over having both optimizations off, or by subtracting both time differences of the
    results for the individual optimizations from the result without
    these optimizations.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)