In June 2023 gforth-fast on AMD64 imlemented only stack caching with
one register, and every primitive or superinstruction also had an
instruction pointer (ip) update. Since then, we managed to have up to
three stack items in a register and eliminated most ip-updates. As an
example, let's look at DECIMAL. It's source code is (clarified):
: decimal ( -- )
#10 base ! ;
Here's what SEE-CODE DECIMAL shows for two versions:
June 2023 November 2023
lit 1->1 lit 1->2
#10 #10
mov [r10],r8 mov r14,$08[r13]
add rbx,$10
mov r8,-$10[rbx]
sub r10,$08
useraddr ! 1->1 useraddr 2->3
#112 #112
! add r13,$20
mov rax,[rbx] mov r15,$08[rsp]
mov rdi,$08[rsp] add r15,-$08[r13]
add r10,$08 ! 3->1
add rbx,$18 mov [r15],r14
mov [rdi][rax],r8
mov r8,[r10]
;s 1->1 ;s 1->1
mov rbx,$00[r13] mov rax,[r10]
add r13,$08 add r10,$08
add rbx,$08 mov r13,rax
mov rcx,-$08[rbx] mov rax,$00[r13]
jmp ecx jmp eax
We still used the same primitives. In November Gforth's code
generator chose not to use the "useraddr !" static superinstruction,
because static superinstructions as currently implemented only work
for 1->1 stack state transitions. So this static superinstruction is
less valuable than it used to be. In the meantime, with these
enhanced and new optimizations, we eliminated 13 superinstructions,
leaving 36.
In recent days we discovered that these optimizations also reduce the
need for primitives with immediate arguments like USERADDR above. The
November USERADDR starts with an ip update ("add r13, ..."), because
the immediate argument requires the ip to be up-to-date. In the
current version, USERADDR with immediate argument 112 is replaced with
"UP@ 112 +", and here you see the two pieces of code:
November 2023 February 2024
lit 1->2 lit 1->2
#10 #10
mov r14,$08[r13] mov r15,$08[rbx]
useraddr 2->3 up@ 2->3
#112 mov r9,$18[rsp]
add r13,$20 lit+ 3->3
mov r15,$08[rsp] #112
add r15,-$08[r13] add r9,$20[rbx]
! 3->1 ! 3->1
mov [r15],r14 mov [r9],r15
;s 1->1 ;s 1->1
mov rax,[r10] mov rbx,[r14]
add r10,$08 add r14,$08
mov r13,rax mov rax,[rbx]
mov rax,$00[r13] jmp eax
jmp eax
If you delete the ip update from the start of USERADDR, you get the
same code as for "UP@ LIT+" (if you ignore the different register
allocation and the different offset of UP from rsp in memory.
One other difference is that useraddr accessed its immediate argument
with offset -8 (because ip points right after the immediate at this
point), while LIT+ accesses its immediate argument with offset $20.
This is because for a few frequent primitives (like LIT+) we have
several versions for different offsets from the actual ip relative to
the instruction, and we therefore just choose the right version of
LIT+ rather than updating the ip in many cases, including the one in
DECIMAL.
With this extra optimization, UP@ LIT+ comes out ahead of USERADDR in
this case. But even without it, they would both produce three
instructions.
Given that UP@ 112 + is proper Forth code that can be interpreted,
while USERADDR with immediate argument 112 is a special-purpose
primitive that is only useful in compiled code, UP@ 112 + looks
preferable to me.
So in the last few days we replaced a number of primitives having to
do with the user pointer, the locals pointer, or the current object by sequences that usually involve UP@, LP@, or O.
So having these optimizations also results in a cleaner set of
primitives.
You may also wonder why the code for ;S in the November version
contains the register-register move that could be optimized away (and
that is not present in the February 2024 version
shown). Unfortunately, I have no idea. The gcc versions used were
12.2 for the November version and 10.2 for the February version.
- anton
--
M. Anton Ertl
http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs:
http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard:
https://forth-standard.org/
EuroForth 2023:
https://euro.theforth.net/2023
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)