• How stack caching and ip-update optimization influence the primitives

    From Anton Ertl@21:1/5 to All on Mon Feb 5 12:40:00 2024
    In June 2023 gforth-fast on AMD64 imlemented only stack caching with
    one register, and every primitive or superinstruction also had an
    instruction pointer (ip) update. Since then, we managed to have up to
    three stack items in a register and eliminated most ip-updates. As an
    example, let's look at DECIMAL. It's source code is (clarified):

    : decimal ( -- )
    #10 base ! ;

    Here's what SEE-CODE DECIMAL shows for two versions:

    June 2023 November 2023
    lit 1->1 lit 1->2
    #10 #10
    mov [r10],r8 mov r14,$08[r13]
    add rbx,$10
    mov r8,-$10[rbx]
    sub r10,$08
    useraddr ! 1->1 useraddr 2->3
    #112 #112
    ! add r13,$20
    mov rax,[rbx] mov r15,$08[rsp]
    mov rdi,$08[rsp] add r15,-$08[r13]
    add r10,$08 ! 3->1
    add rbx,$18 mov [r15],r14
    mov [rdi][rax],r8
    mov r8,[r10]
    ;s 1->1 ;s 1->1
    mov rbx,$00[r13] mov rax,[r10]
    add r13,$08 add r10,$08
    add rbx,$08 mov r13,rax
    mov rcx,-$08[rbx] mov rax,$00[r13]
    jmp ecx jmp eax

    We still used the same primitives. In November Gforth's code
    generator chose not to use the "useraddr !" static superinstruction,
    because static superinstructions as currently implemented only work
    for 1->1 stack state transitions. So this static superinstruction is
    less valuable than it used to be. In the meantime, with these
    enhanced and new optimizations, we eliminated 13 superinstructions,
    leaving 36.

    In recent days we discovered that these optimizations also reduce the
    need for primitives with immediate arguments like USERADDR above. The
    November USERADDR starts with an ip update ("add r13, ..."), because
    the immediate argument requires the ip to be up-to-date. In the
    current version, USERADDR with immediate argument 112 is replaced with
    "UP@ 112 +", and here you see the two pieces of code:

    November 2023 February 2024
    lit 1->2 lit 1->2
    #10 #10
    mov r14,$08[r13] mov r15,$08[rbx]
    useraddr 2->3 up@ 2->3
    #112 mov r9,$18[rsp]
    add r13,$20 lit+ 3->3
    mov r15,$08[rsp] #112
    add r15,-$08[r13] add r9,$20[rbx]
    ! 3->1 ! 3->1
    mov [r15],r14 mov [r9],r15
    ;s 1->1 ;s 1->1
    mov rax,[r10] mov rbx,[r14]
    add r10,$08 add r14,$08
    mov r13,rax mov rax,[rbx]
    mov rax,$00[r13] jmp eax
    jmp eax

    If you delete the ip update from the start of USERADDR, you get the
    same code as for "UP@ LIT+" (if you ignore the different register
    allocation and the different offset of UP from rsp in memory.

    One other difference is that useraddr accessed its immediate argument
    with offset -8 (because ip points right after the immediate at this
    point), while LIT+ accesses its immediate argument with offset $20.
    This is because for a few frequent primitives (like LIT+) we have
    several versions for different offsets from the actual ip relative to
    the instruction, and we therefore just choose the right version of
    LIT+ rather than updating the ip in many cases, including the one in
    DECIMAL.

    With this extra optimization, UP@ LIT+ comes out ahead of USERADDR in
    this case. But even without it, they would both produce three
    instructions.

    Given that UP@ 112 + is proper Forth code that can be interpreted,
    while USERADDR with immediate argument 112 is a special-purpose
    primitive that is only useful in compiled code, UP@ 112 + looks
    preferable to me.

    So in the last few days we replaced a number of primitives having to
    do with the user pointer, the locals pointer, or the current object by sequences that usually involve UP@, LP@, or O.

    So having these optimizations also results in a cleaner set of
    primitives.

    You may also wonder why the code for ;S in the November version
    contains the register-register move that could be optimized away (and
    that is not present in the February 2024 version
    shown). Unfortunately, I have no idea. The gcc versions used were
    12.2 for the November version and 10.2 for the February version.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Mon Feb 5 15:50:30 2024
    With all due respect, is this really a suitable benchmark for the optimizer ??

    My Forth does:

    # see decimal
    C DECIMAL <137>
    0000000000401ee3 <mf_DECIMAL>:
    401ee3: 48 c7 05 82 64 02 00 0a 00 00 00 movq $0xa,0x26482(%rip) # 428370 <MFBASE>
    401eee: c3 retq

    or
    0000000000401eef <mf_HEX>:
    401eef: 48 c7 05 76 64 02 00 10 00 00 00 movq $0x10,0x26476(%rip) # 428370 <MFBASE>
    401efa: c3 retq

    I had expected gforth to compile similarly compact.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to minforth on Mon Feb 5 16:37:57 2024
    minforth@gmx.net (minforth) writes:
    With all due respect, is this really a suitable benchmark for the optimizer ??

    Benchmark? My article is about a specific point, and DECIMAL is a
    good example for demonstrating this point.

    For benchmarking I run various programs, e.g., those from <http://www.complang.tuwien.ac.at/forth/appbench.zip>.

    My Forth does:

    # see decimal
    C DECIMAL <137>
    0000000000401ee3 <mf_DECIMAL>:
    401ee3: 48 c7 05 82 64 02 00 0a 00 00 00 movq $0xa,0x26482(%rip) # 428370 <MFBASE>
    401eee: c3 retq

    Congratulations.

    I had expected gforth to compile similarly compact.

    Why? Gforth is still based on threaded code, and it shows in,
    e.g., needing threaded-code instruction pointer to access any immediate arguments.

    Looking at DECIMAL in proper native-code compilers is interesting:

    iforth:
    $10138348 : DECIMAL 488BC04883ED088F4500 H.@H.m..E. $10138352 mov rbx, #10 d# 48C7C30A000000 HGC.... $10138359 push rbx 53 S
    $1013835A lea rbx, [r15 #520 +] qword
    498D9F08020000 I...... $10138361 pop [rbx] qword 8F03 ..
    $10138363 ; 488B45004883C508FFE0 H.E.H.E..` ok

    lxf:
    80498D2 8D4620 lea eax , [esi+20h]
    80498D5 B90A000000 mov ecx , # Ah
    80498DA 8908 mov [eax] , ecx
    80498DC C3 ret near


    SwiftForth x64:
    40A392 A # 88 [RSI] MOV 48C786880000000A000000
    40A39D RET C3 ok

    VFX64:
    ( 0041CEA0 48C746300A000000 ) MOV QWord [RSI+30], # 0000000A
    ( 0041CEA8 C3 ) RET/NEXT
    ( 9 bytes, 2 instructions )

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)