gforth-fast has acquired two performance features this summer:
1) Many of the ip updates are now optimized away (all architectures).
2) On AMD64 gforth-fast can now use stack caching with up to 3
registers (previously 1).
For the word
: cubed dup dup * * ;
this results in the following differences in the resulting code:
Before: without ip updates and with 3 regs
$7F75EC8FB240 dup
add $0x8,%rbx
mov %r8,0x0(%r13) mov %r8,0x0(%r13) mov %r8,%r15
sub $0x8,%r13 sub $0x8,%r13
$7F75EC8FB248 dup
add $0x8,%rbx
mov %r8,0x0(%r13) mov %r8,0x0(%r13)
sub $0x8,%r13 sub $0x8,%r13 mov %r15,%r9
$7F75EC8FB250 *
add $0x8,%rbx
imul 0x8(%r13),%r8 imul 0x8(%r13),%r8 imul %r9,%r15
add $0x8,%r13 add $0x8,%r13
$7F75EC8FB258 *
add $0x8,%rbx
imul 0x8(%r13),%r8 imul 0x8(%r13),%r8 imul %r15,%r8
add $0x8,%r13 add $0x8,%r13
$7F75EC8FB260 ;s
mov (%r14),%rbx mov (%r14),%rbx mov (%r14),%rbx
add $0x8,%r14 add $0x8,%r14 add $0x8,%r14
mov (%rbx),%rax mov (%rbx),%rax mov (%rbx),%rax
jmp *%rax jmp *%rax jmp *%rax
(Actually, the real Before variant used a different register
allocation, but the same number of instructions. The shown version is
the engine with optimization, but
Here's a comparison with some other Forth systems on AMD64:
gforth-fast iforth SwiftForth x64 VFX Forth 64
mov %r8,%r15 pop rbx -8 [RBP] RBP LEA MOV RDX, RBX
mov %r15,%r9 mov rdi, rbx RBX 0 [RBP] MOV IMUL RBX, RDX
imul %r9,%r15 imul rdi, rbx -8 [RBP] RBP LEA IMUL RBX, RDX
imul %r15,%r8 imul rbx, rdi RBX 0 [RBP] MOV RET/NEXT
mov (%r14),%rbx push rbx 0 [RBP] RAX MOV
add $0x8,%r14 ; RBX MUL
mov (%rbx),%rax RAX RBX MOV
jmp *%rax 8 [RBP] RBP LEA
0 [RBP] RAX MOV
RBX MUL
RAX RBX MOV
8 [RBP] RBP LEA
RET
1) Optimize ip updates:
At its heart, gforth (including gforth-fast) is still a threaded-code
system and falls back to threaded code when needed; in particular,
it's control flow works through the threaded-code mechanism; e.g., the
;S in the example above loads the threaded-code address of the next
(primitive) word in the caller, and performs a direct-threaded
dispatch. Also immediate arguments (e.g. for literals) are accessed
through the threaded-code instruction pointer. Therefore Gforth
maintains a threaded-code instruction pointer (ip).
But it does not need to maintain the ip everywhere. In the CUBED
example, no primitive uses the ip of the threaded code cell of the
primitive, so no ip updates are necessary except for restoring the
caller's ip at the end.
And this is what this optimization does, in a nutshell.
This optimization is controlled with --opt-ip-updates=n, where n=0
means no ip-update optimization, and higher n mean more optimization;
currently the highest level is n=4 IIRC, and the highest level is the
default.
2) 3 registers for stack caching:
Up until this summer I believed that we would not convince gcc to use caller-saved registers as additional stack cache registers, and the
dearth of callee-saved registers on AMD64 meant that we were limited
to using 1 register as a stack cache (we have been using 3 registers
on ARM A64 and RISC-V for quite some time). This summer I got an idea
on how to do it, and, with the help of Bernd Paysan, did it; if you
want to read more about it, posting <
23-10-001@comp.compilers> in comp.compilers (in the web: <
https://compilers.iecc.com/comparch/article/23-10-001> or <
http://al.howardknight.net/?ID=169728532800>) discusses the topic in
more depth.
Here are results for small benchmarks on a Xeon W-1370P (5.2GHz Rocket
Lake):
sieve bubble matrix fib fft
0.089 0.131 0.048 0.084 0.031 gforth
0.058 0.066 0.033 0.043 0.014 gforth-fast --ss-states=2 --opt-ip-updates=0
0.052 0.053 0.018 0.036 0.014 gforth-fast --ss-states=2
0.038 0.042 0.014 0.032 0.014 gforth-fast
The new optimizations provide good speedups on Rocket Lake. Sometimes
the ip-update optimization alone helps a lot, sometimes the
combination of both optimizations helps a lot more (for now
--opt-ip-updates=0 does not work with --ss-states=4, so I cannot
completely isolate the effects of the optimizations).
gforth (the debugging engine) does not benefit from either
optimization, because stack caching is disabled for better stack
underflow reporting, and ip updates are disabled in order to get
proper backtraces in case of exceptions.
- anton
--
M. Anton Ertl
http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs:
http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard:
https://forth-standard.org/
EuroForth 2023:
https://euro.theforth.net/2023
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)