Forum: >>> Magnum BBS <<<

Stack caching, IP updates, and static superinstructions

From Anton Ertl@21:1/5 to All on Thu Feb 1 08:52:59 2024

Gforth has acquired two optimizations in the last year:

* Stack caching now uses up to three registers on AMD64 (previously up
to one).

* It eliminates most threaded-code instruction-pointer updates.

Gforth has combined sequences of primitives into static
superinstructions for about 23 years, with the following benefits:

* Stack items within a static superinstruction could be held in
registers even if they don't fit into a 1-register stack cache; that
benefit evaporates when we have more stack items in registers.

* There was only one ip-update for a static superinstruction; that
benefit not only evaporates with the ip-update optimization,
actually static superinstructions that access a literal require an
ip update.

* Finally, there are benefits that are not covered by these two
optimizations, e.g., the static superinstruction for "< ?branch" can
communicate through the CPU flags register between the < and the
?branch, whereas without superinstruction the flag needs to be
converted into its canonical Forth form (0 or -1).

Because of the second reason, I have now removed all superinstructions
that access a literal, and I have tested the result: The test code is
(each definition with a sequence that used to form one
superinstruction):

: foo6 5 ! ;
: foo15 base @ ;
: foo19 5 @ and ;
: foo23 base ! ;
: foo26 5 @ + ;
: foo31 5 f@ ;
: foo32 5 f! ;
: foo34 5 + @ ;
: foo37 5 and ;
: foo38 5 arshift ;
: foo39 dup 5 and swap ;

Yes, the words that use 5 as address cannot be run, but the native
code looks the same whatever literal one uses.

Here's the disassembled code (without the trailing ;s, but including
any stack state transition before the ;s):

lit ! 1->1 lit 1->2
#5 #5
! mov r15,$08[rbx]
add rbx,$18 ! 2->0 0->1
mov rax,-$10[rbx] mov [r15],r8
add r13,$08 mov r8,$08[r13]
mov [rax],r8 add r13,$08
mov r8,$00[r13]

useraddr 1->1 useraddr 1->1
#112 #112
mov $00[r13],r8 mov $00[r13],r8
sub r13,$08 sub r13,$08
add rbx,$10 add rbx,$10
mov r8,$08[rsp] mov r8,$10[rsp]
add r8,-$08[rbx] add r8,-$08[rbx]
@ 1->1 @ 1->1
mov r8,[r8] mov r8,[r8]

useraddr 1->2 useraddr 1->2
#112 #112
add rbx,$10 add rbx,$10
mov r15,$10[rsp] mov r15,$10[rsp]
add r15,-$08[rbx] add r15,-$08[rbx]
! 2->0 0->1 ! 2->0 0->1
mov [r15],r8 mov [r15],r8
mov r8,$08[r13] mov r8,$08[r13]
add r13,$08 add r13,$08

lit@ + 1->1 lit@ 1->2
#5 #5
+ mov rax,$08[rbx]
add rbx,$18 mov r15,[rax]
mov rax,-$10[rbx] + 2->1
add r8,[rax] add r8,r15

lit f@ 1->1 lit 1->2
#5 #5
f@ mov r15,$08[rbx]
add rbx,$18 f@ 2->1
movsd [r12],xmm15 movsd [r12],xmm15
mov rax,-$10[rbx] movsd xmm15,[r15]
sub r12,$08 sub r12,$08
movsd xmm15,[rax]

lit f! 1->1 lit 1->2
#5 #5
f! mov r15,$08[rbx]
add rbx,$18 f! 2->1
mov rdx,-$10[rbx] mov rax,r12
mov rax,r12 movsd [r15],xmm15
lea r12,$08[r12] lea r12,$08[r12]
movsd [rdx],xmm15 movsd xmm15,$08[rax]
movsd xmm15,$08[rax]

lit+ 1->1 lit+ 1->1
#5 #5
add r8,$08[rbx] add r8,$08[rbx]
@ 1->1 @ 1->1
mov r8,[r8] mov r8,[r8]

lit and 1->1 lit 1->2
#5 #5
and mov r15,$08[rbx]
add rbx,$18 and 2->1
and r8,-$10[rbx] and r8,r15

lit arshift 1->1 lit 1->2
#5 #5
arshift mov r15,$08[rbx]
add rbx,$18 arshift 2->1
mov rcx,-$10[rbx] mov ecx,r15d
sar r8,CL sar r8,CL

dup lit and swap 1->1 dup 1->2
lit mov r15,r8
#5 lit 2->3
and #5
swap mov r9,$10[rbx]
add rbx,$28 and 3->2
mov rax,-$18[rbx] and r15,r9
sub r13,$08 swap 2->1
and rax,r8 mov $00[r13],r15
mov $08[r13],rax sub r13,$08

In every case, the result without superinstructions has at most as
many instructions as the result with superinstructions; in three cases
(lit !, lit f@, lit f!) the number of instructions is one less without
the superinstruction (the ip update).

In three cases (useraddr @, useraddr !, lit+ @) already the version
with these superinstructions preferred the non-superinstruction
version over the version with the superinstruction (the selection is
based on code length, but does not take the code for the ip update
into account).

In the remaining four cases, the static superinstruction would reduce
the instruction count (by combining a load and an ALU instruction into
a load-and-operate instruction (lit@ +, lit and), or by eliminating a
mov (lit arshift, dup lit and swap), but the additional IP update
compensates this benefit.

- anton

--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Keyop
  Sun May 5 19:26:27 2024
  from Huddersfield, West Yorkshire via SSH
- Keyop
  Sun May 5 19:26:11 2024
  from Huddersfield, West Yorkshire via SSH
- Bob Worm
  Mon May 6 11:44:29 2024
  from Wales, Uk via Telnet
- Bob Worm
  Tue May 7 09:06:52 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	300
Nodes:	16 (2 / 14)
Uptime:	40:48:33
Calls:	6,708
Calls today:	1
Files:	12,243
Messages:	5,353,727

Stack caching, IP updates, and static superinstructions

Who's Online

Recent Visitors

System Info