• Stack caching, IP updates, and static superinstructions

    From Anton Ertl@21:1/5 to All on Thu Feb 1 08:52:59 2024
    Gforth has acquired two optimizations in the last year:

    * Stack caching now uses up to three registers on AMD64 (previously up
    to one).

    * It eliminates most threaded-code instruction-pointer updates.

    Gforth has combined sequences of primitives into static
    superinstructions for about 23 years, with the following benefits:

    * Stack items within a static superinstruction could be held in
    registers even if they don't fit into a 1-register stack cache; that
    benefit evaporates when we have more stack items in registers.

    * There was only one ip-update for a static superinstruction; that
    benefit not only evaporates with the ip-update optimization,
    actually static superinstructions that access a literal require an
    ip update.

    * Finally, there are benefits that are not covered by these two
    optimizations, e.g., the static superinstruction for "< ?branch" can
    communicate through the CPU flags register between the < and the
    ?branch, whereas without superinstruction the flag needs to be
    converted into its canonical Forth form (0 or -1).

    Because of the second reason, I have now removed all superinstructions
    that access a literal, and I have tested the result: The test code is
    (each definition with a sequence that used to form one
    superinstruction):

    : foo6 5 ! ;
    : foo15 base @ ;
    : foo19 5 @ and ;
    : foo23 base ! ;
    : foo26 5 @ + ;
    : foo31 5 f@ ;
    : foo32 5 f! ;
    : foo34 5 + @ ;
    : foo37 5 and ;
    : foo38 5 arshift ;
    : foo39 dup 5 and swap ;

    Yes, the words that use 5 as address cannot be run, but the native
    code looks the same whatever literal one uses.

    Here's the disassembled code (without the trailing ;s, but including
    any stack state transition before the ;s):

    lit ! 1->1 lit 1->2
    #5 #5
    ! mov r15,$08[rbx]
    add rbx,$18 ! 2->0 0->1
    mov rax,-$10[rbx] mov [r15],r8
    add r13,$08 mov r8,$08[r13]
    mov [rax],r8 add r13,$08
    mov r8,$00[r13]

    useraddr 1->1 useraddr 1->1
    #112 #112
    mov $00[r13],r8 mov $00[r13],r8
    sub r13,$08 sub r13,$08
    add rbx,$10 add rbx,$10
    mov r8,$08[rsp] mov r8,$10[rsp]
    add r8,-$08[rbx] add r8,-$08[rbx]
    @ 1->1 @ 1->1
    mov r8,[r8] mov r8,[r8]

    useraddr 1->2 useraddr 1->2
    #112 #112
    add rbx,$10 add rbx,$10
    mov r15,$10[rsp] mov r15,$10[rsp]
    add r15,-$08[rbx] add r15,-$08[rbx]
    ! 2->0 0->1 ! 2->0 0->1
    mov [r15],r8 mov [r15],r8
    mov r8,$08[r13] mov r8,$08[r13]
    add r13,$08 add r13,$08

    lit@ + 1->1 lit@ 1->2
    #5 #5
    + mov rax,$08[rbx]
    add rbx,$18 mov r15,[rax]
    mov rax,-$10[rbx] + 2->1
    add r8,[rax] add r8,r15

    lit f@ 1->1 lit 1->2
    #5 #5
    f@ mov r15,$08[rbx]
    add rbx,$18 f@ 2->1
    movsd [r12],xmm15 movsd [r12],xmm15
    mov rax,-$10[rbx] movsd xmm15,[r15]
    sub r12,$08 sub r12,$08
    movsd xmm15,[rax]

    lit f! 1->1 lit 1->2
    #5 #5
    f! mov r15,$08[rbx]
    add rbx,$18 f! 2->1
    mov rdx,-$10[rbx] mov rax,r12
    mov rax,r12 movsd [r15],xmm15
    lea r12,$08[r12] lea r12,$08[r12]
    movsd [rdx],xmm15 movsd xmm15,$08[rax]
    movsd xmm15,$08[rax]

    lit+ 1->1 lit+ 1->1
    #5 #5
    add r8,$08[rbx] add r8,$08[rbx]
    @ 1->1 @ 1->1
    mov r8,[r8] mov r8,[r8]

    lit and 1->1 lit 1->2
    #5 #5
    and mov r15,$08[rbx]
    add rbx,$18 and 2->1
    and r8,-$10[rbx] and r8,r15

    lit arshift 1->1 lit 1->2
    #5 #5
    arshift mov r15,$08[rbx]
    add rbx,$18 arshift 2->1
    mov rcx,-$10[rbx] mov ecx,r15d
    sar r8,CL sar r8,CL

    dup lit and swap 1->1 dup 1->2
    lit mov r15,r8
    #5 lit 2->3
    and #5
    swap mov r9,$10[rbx]
    add rbx,$28 and 3->2
    mov rax,-$18[rbx] and r15,r9
    sub r13,$08 swap 2->1
    and rax,r8 mov $00[r13],r15
    mov $08[r13],rax sub r13,$08

    In every case, the result without superinstructions has at most as
    many instructions as the result with superinstructions; in three cases
    (lit !, lit f@, lit f!) the number of instructions is one less without
    the superinstruction (the ip update).

    In three cases (useraddr @, useraddr !, lit+ @) already the version
    with these superinstructions preferred the non-superinstruction
    version over the version with the superinstruction (the selection is
    based on code length, but does not take the code for the ip update
    into account).

    In the remaining four cases, the static superinstruction would reduce
    the instruction count (by combining a load and an ALU instruction into
    a load-and-operate instruction (lit@ +, lit and), or by eliminating a
    mov (lit arshift, dup lit and swap), but the additional IP update
    compensates this benefit.

    - anton

























































































    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)