Forum: >>> Magnum BBS <<<

How stack caching and ip-update optimization influence the primitives

From Anton Ertl@21:1/5 to All on Mon Feb 5 12:40:00 2024

In June 2023 gforth-fast on AMD64 imlemented only stack caching with
one register, and every primitive or superinstruction also had an
instruction pointer (ip) update. Since then, we managed to have up to
three stack items in a register and eliminated most ip-updates. As an
example, let's look at DECIMAL. It's source code is (clarified):

: decimal ( -- )
#10 base ! ;

Here's what SEE-CODE DECIMAL shows for two versions:

June 2023 November 2023
lit 1->1 lit 1->2
#10 #10
mov [r10],r8 mov r14,$08[r13]
add rbx,$10
mov r8,-$10[rbx]
sub r10,$08
useraddr ! 1->1 useraddr 2->3
#112 #112
! add r13,$20
mov rax,[rbx] mov r15,$08[rsp]
mov rdi,$08[rsp] add r15,-$08[r13]
add r10,$08 ! 3->1
add rbx,$18 mov [r15],r14
mov [rdi][rax],r8
mov r8,[r10]
;s 1->1 ;s 1->1
mov rbx,$00[r13] mov rax,[r10]
add r13,$08 add r10,$08
add rbx,$08 mov r13,rax
mov rcx,-$08[rbx] mov rax,$00[r13]
jmp ecx jmp eax

We still used the same primitives. In November Gforth's code
generator chose not to use the "useraddr !" static superinstruction,
because static superinstructions as currently implemented only work
for 1->1 stack state transitions. So this static superinstruction is
less valuable than it used to be. In the meantime, with these
enhanced and new optimizations, we eliminated 13 superinstructions,
leaving 36.

In recent days we discovered that these optimizations also reduce the
need for primitives with immediate arguments like USERADDR above. The
November USERADDR starts with an ip update ("add r13, ..."), because
the immediate argument requires the ip to be up-to-date. In the
current version, USERADDR with immediate argument 112 is replaced with
"UP@ 112 +", and here you see the two pieces of code:

November 2023 February 2024
lit 1->2 lit 1->2
#10 #10
mov r14,$08[r13] mov r15,$08[rbx]
useraddr 2->3 up@ 2->3
#112 mov r9,$18[rsp]
add r13,$20 lit+ 3->3
mov r15,$08[rsp] #112
add r15,-$08[r13] add r9,$20[rbx]
! 3->1 ! 3->1
mov [r15],r14 mov [r9],r15
;s 1->1 ;s 1->1
mov rax,[r10] mov rbx,[r14]
add r10,$08 add r14,$08
mov r13,rax mov rax,[rbx]
mov rax,$00[r13] jmp eax
jmp eax

If you delete the ip update from the start of USERADDR, you get the
same code as for "UP@ LIT+" (if you ignore the different register
allocation and the different offset of UP from rsp in memory.

One other difference is that useraddr accessed its immediate argument
with offset -8 (because ip points right after the immediate at this
point), while LIT+ accesses its immediate argument with offset $20.
This is because for a few frequent primitives (like LIT+) we have
several versions for different offsets from the actual ip relative to
the instruction, and we therefore just choose the right version of
LIT+ rather than updating the ip in many cases, including the one in
DECIMAL.

With this extra optimization, UP@ LIT+ comes out ahead of USERADDR in
this case. But even without it, they would both produce three
instructions.

Given that UP@ 112 + is proper Forth code that can be interpreted,
while USERADDR with immediate argument 112 is a special-purpose
primitive that is only useful in compiled code, UP@ 112 + looks
preferable to me.

So in the last few days we replaced a number of primitives having to
do with the user pointer, the locals pointer, or the current object by sequences that usually involve UP@, LP@, or O.

So having these optimizations also results in a cleaner set of
primitives.

You may also wonder why the code for ;S in the November version
contains the register-register move that could be optimized away (and
that is not present in the February 2024 version
shown). Unfortunately, I have no idea. The gcc versions used were
12.2 for the November version and 10.2 for the February version.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to All on Mon Feb 5 15:50:30 2024

With all due respect, is this really a suitable benchmark for the optimizer ??

My Forth does:

# see decimal
C DECIMAL <137>
0000000000401ee3 <mf_DECIMAL>:
401ee3: 48 c7 05 82 64 02 00 0a 00 00 00 movq $0xa,0x26482(%rip) # 428370 <MFBASE>
401eee: c3 retq

or
0000000000401eef <mf_HEX>:
401eef: 48 c7 05 76 64 02 00 10 00 00 00 movq $0x10,0x26476(%rip) # 428370 <MFBASE>
401efa: c3 retq

I had expected gforth to compile similarly compact.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to minforth on Mon Feb 5 16:37:57 2024

minforth@gmx.net (minforth) writes:

With all due respect, is this really a suitable benchmark for the optimizer ??

Benchmark? My article is about a specific point, and DECIMAL is a
good example for demonstrating this point.

For benchmarking I run various programs, e.g., those from <http://www.complang.tuwien.ac.at/forth/appbench.zip>.

My Forth does:

# see decimal
C DECIMAL <137>
0000000000401ee3 <mf_DECIMAL>:
401ee3: 48 c7 05 82 64 02 00 0a 00 00 00 movq $0xa,0x26482(%rip) # 428370 <MFBASE>
401eee: c3 retq

Congratulations.

I had expected gforth to compile similarly compact.

Why? Gforth is still based on threaded code, and it shows in,
e.g., needing threaded-code instruction pointer to access any immediate arguments.

Looking at DECIMAL in proper native-code compilers is interesting:

iforth:
$10138348 : DECIMAL 488BC04883ED088F4500 H.@H.m..E. $10138352 mov rbx, #10 d# 48C7C30A000000 HGC.... $10138359 push rbx 53 S
$1013835A lea rbx, [r15 #520 +] qword
498D9F08020000 I...... $10138361 pop [rbx] qword 8F03 ..
$10138363 ; 488B45004883C508FFE0 H.E.H.E..` ok

lxf:
80498D2 8D4620 lea eax , [esi+20h]
80498D5 B90A000000 mov ecx , # Ah
80498DA 8908 mov [eax] , ecx
80498DC C3 ret near

SwiftForth x64:
40A392 A # 88 [RSI] MOV 48C786880000000A000000
40A39D RET C3 ok

VFX64:
( 0041CEA0 48C746300A000000 ) MOV QWord [RSI+30], # 0000000A
( 0041CEA8 C3 ) RET/NEXT
( 9 bytes, 2 instructions )

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	300
Nodes:	16 (2 / 14)
Uptime:	52:34:56
Calls:	6,712
Files:	12,243
Messages:	5,355,180

How stack caching and ip-update optimization influence the primitives

Who's Online

System Info