• Another look at Gforth's locals implementation

    From Anton Ertl@21:1/5 to All on Tue Apr 9 15:59:58 2024
    30 years ago I wrote a paper about Gforth's locals and also gave some performance results [ertl94l]. Since then the implementation of
    Gforth, and also of locals has changed quite a bit; in particular,
    recently we have improved the code generated for TO <local>.

    So here I look at it again. As a first example, I look at the fib
    benchmark (which has a one-off error, but that makes little difference):

    no locals with a local

    : fib ( n1 -- n2 ) : fib { n -- n2 }
    dup 2 < if n 2 < if
    drop 1 1
    else else
    dup n 1- recurse
    1- recurse n 2 - recurse
    swap 2 - recurse +
    + then ;
    then ;

    For 43 fib the performance results (AMD64 instructions, Skylake
    cycles) are:

    no locals with a local factor
    13_805_561_653 15_012_002_369 1.09
    41_475_949_910 48_490_119_217 1.17

    So those of you who dislike locals can still point to worse
    performance of code with locals on Gforth.

    Another example is string comparison, where I compared three versions:

    \ no locals
    : strcmp1 ( addr1 u1 addr2 u2 -- n )
    rot 2dup 2>r min 0 ?do ( a1 a2 )
    over c@ over c@ - dup if
    nip nip 2rdrop unloop exit then
    drop
    char+ swap char+ swap
    loop
    2drop r> r> - ;

    \ locals with TO
    : strcmp2
    { addr1 u1 addr2 u2 -- n }
    u1 u2 min 0
    ?do
    addr1 c@ addr2 c@ - dup if
    unloop exit then
    drop
    addr1 char+ TO addr1
    addr2 char+ TO addr2
    loop
    u1 u2 - ;

    \ locals without TO
    : strcmp3
    { addr1 u1 addr2 u2 -- n }
    addr1 addr2
    u1 u2 min 0
    ?do { s1 s2 }
    s1 c@ s2 c@ - dup if
    unloop exit then
    drop
    s1 char+ s2 char+
    loop
    2drop
    u1 u2 - ;

    I am too lazy to benchmark this, but here's the number of
    native-instruction bytes in the inner loop:

    129 strcmp1
    128 strcmp2
    149 strcmp3

    Here's the code for body of the loop:

    strcmp1 strcmp2 strcmp3
    over 1->1 @local0 1->1 >l 1->1
    mov $00[r13],r8 mov $00[r13],r8 mov rax,rbp
    sub r13,$08 sub r13,$08 add r13,$08
    mov r8,$10[r13] mov r8,$00[rbp] lea rbp,-$08[rbp]
    c@ 1->1 c@ 1->1 mov -$08[rax],r8
    movzx r8d,bytePTR[r8] movzx r8d,bytePTR[r8] mov r8,$00[r13]
    over 1->2 @local2 1->2 >l @local0 1->1
    mov r15,$08[r13] mov r15,$10[rbp] @local0
    c@ 2->2 c@ 2->2 mov rax,rbp
    movzx r15d,bytePTR[r15] movzx r15d,bytePTR[r15] lea rbp,-$08[rbp]
    - 2->1 - 2->1 mov -$08[rax],r8
    sub r8,r15 sub r8,r15 c@ 1->1
    dup 1->2 dup 1->2 movzx r8d,bytePTR[r8]
    mov r15,r8 mov r15,r8 @local1 1->2
    ?branch 2->1 ?branch 2->1 mov r15,$08[rbp] <strcmp1+$A8> <strcmp2+$C0> c@ 2->2
    add rbx,$68 add rbx,$68 movzx r15d,bytePTR[r15]
    mov rax,[rbx] mov rax,[rbx] - 2->1
    test r15,r15 test r15,r15 sub r8,r15
    jnz $7FED3DB2C317 jnz $7FED3DB2C0DC dup 1->2
    jmp eax jmp eax mov r15,r8
    nip 1->1 unloop 1->1 ?branch 2->1
    add r13,$08 add r14,$10 <strcmp3+$E0>
    nip 1->1 lit 1->2 add rbx,$78
    add r13,$08 #32 mov rax,[rbx]
    2rdrop 1->1 sub rbx,$10 test r15,r15
    add r14,$10 mov r15,-$08[rbx] jnz $7FED3DB2C208 unloop 1->1 lp+! 2->1 jmp eax
    add r14,$10 add rbp,r15 unloop 1->1
    ;s 1->1 ;s 1->1 add r14,$10
    mov rbx,[r14] mov rbx,[r14] lit 1->2
    add r14,$08 add r14,$08 #48
    mov rax,[rbx] mov rax,[rbx] sub rbx,$10
    jmp eax jmp eax mov r15,-$08[rbx] drop 1->1 drop 1->1 lp+! 2->1
    mov r8,$08[r13] mov r8,$08[r13] add rbp,r15
    add r13,$08 add r13,$08 ;s 1->1
    char+ 1->1 @local0 1->2 mov rbx,[r14]
    add r8,$01 mov r15,$00[rbp] add r14,$08
    swap 1->2 char+ 2->2 mov rax,[rbx]
    mov r15,$08[r13] add r15,$01 jmp eax
    add r13,$08 !local0 2->1 drop 1->0
    char+ 2->2 mov $00[rbp],r15 @local0 0->1
    add r15,$01 @local2 1->2 mov r8,$00[rbp]
    swap 2->1 mov r15,$10[rbp] char+ 1->1
    mov $00[r13],r15 char+ 2->2 add r8,$01
    sub r13,$08 add r15,$01 @local1 1->1
    (loop) 1->1 !local2 2->1 mov $00[r13],r8 <strcmp1+$40> mov $10[rbp],r15 sub r13,$08
    sub rbx,$68 (loop) 1->1 mov r8,$08[rbp]
    mov rax,[r14] <strcmp2+$58> char+ 1->1
    add rax,$01 sub rbx,$68 add r8,$01
    cmp $08[r14],rax mov rax,[r14] (loop)-lp+!# 1->1
    mov [r14],rax add rax,$01 <strcmp3+$68>
    jz $7FED3DB2C36C cmp $08[r14],rax #16
    mov rax,[rbx] mov [r14],rax add rbx,$40
    jmp eax jz $7FED3DB2C130 mov rax,[r14]
    mov rax,[rbx] mov rsi,-$10[rbx]
    jmp eax add rax,$01
    cmp $08[r14],rax
    jz $7FED3DB2C25F
    add rbp,-$08[rbx]
    mov [r14],rax
    mov rbx,rsi
    mov rax,[rsi]
    jmp eax

    @InProceedings{ertl94l,
    author = "M. Anton Ertl",
    title = "Automatic Scoping of Local Variables",
    booktitle = "EuroForth~'94 Conference Proceedings",
    year = "1994",
    address = "Winchester, UK",
    pages = "31--37",
    url = "http://www.complang.tuwien.ac.at/papers/ertl94l.ps.gz",
    abstract = "In the process of lifting the restrictions on using
    locals in Forth, an interesting problem poses
    itself: What does it mean if a local is defined in a
    control structure? Where is the local visible? Since
    the user can create every possible control structure
    in ANS Forth, the answer is not as simple as it may
    seem. Ideally, the local is visible at a place if
    the control flow {\em must} pass through the
    definition of the local to reach this place. This
    paper discusses locals in general, the visibility
    problem, its solution, the consequences and the
    implementation as well as related programming style
    questions."
    }

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Tue Apr 9 19:43:55 2024
    This is consistent with my observations. There is typically
    a speed difference of around 10 per cent between code variants
    with stack juggling and with locals. The difference is irrelevant
    for the vast majority of tasks.

    The gain in readability, on the other hand, is often enormous.
    But that's another old, worn-out discussion.

    With locals in CPU registers, I expect an even smaller speed
    difference. However, I have not yet implemented and tested this.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to minforth on Tue Apr 9 20:54:47 2024
    minforth wrote:

    This is consistent with my observations. There is typically
    a speed difference of around 10 per cent between code variants
    with stack juggling and with locals. The difference is irrelevant
    for the vast majority of tasks.

    The gain in readability, on the other hand, is often enormous.
    But that's another old, worn-out discussion.

    With locals in CPU registers, I expect an even smaller speed
    difference. However, I have not yet implemented and tested this.

    Even without putting them in registers it is faster (Ryzen 7 5800X):

    : bench
    cr timer-reset #43 fib .elapsed ." ( " . ." )"
    cr timer-reset #43 fib2 .elapsed ." ( " . ." )" ;

    FORTH> bench
    1.658 seconds elapsed. ( 701408733 )
    1.564 seconds elapsed. ( 701408733 ) ok

    Here is fib2:
    : fib2 params| n |
    n 2 < if 1
    else n 1- recurse
    n 2- recurse
    +
    endif ;

    FORTH> ' fib2 idis
    $014588C0 : fib2
    $014588CA pop rbx
    $014588CB cmp rbx, 2 b#
    $014588CF lea rsi, [rsi #-16 +] qword
    $014588D3 mov [rsi] qword, rbx
    $014588D6 mov rbx, rcx
    $014588D9 jge $014588EB offset NEAR
    $014588DF mov rbx, 1 d#
    $014588E6 jmp $01458921 offset NEAR
    $014588EB mov rbx, [rsi] qword
    $014588EE lea rbx, [rbx -1 +] qword
    $014588F2 lea rbp, [rbp -8 +] qword
    $014588F6 mov [rbp 0 +] qword, $01458903 d#
    $014588FE jmp $014588CB offset NEAR
    $01458903 mov rbx, [rsi] qword
    $01458906 lea rbx, [rbx -2 +] qword
    $0145890A lea rbp, [rbp -8 +] qword
    $0145890E mov [rbp 0 +] qword, $0145891B d#
    $01458916 jmp $014588CB offset NEAR
    $0145891B pop rbx
    $0145891C pop rdi
    $0145891D lea rbx, [rdi rbx*1] qword
    $01458921 push rbx
    $01458922 lea rsi, [rsi #16 +] qword
    $01458926 ;

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to dxf on Wed Apr 10 05:25:48 2024
    dxf wrote:

    On 10/04/2024 6:54 am, mhx wrote:
    minforth wrote:
    [..]
    VFX doesn't put much effort into optimizing locals code. Should they have? Or would that have encouraged users to write code that lacks thought and leave it to the compiler to make efficient? Which is the C approach.

    I wouldn't know what those authors think, believe, or try to enforce.

    There has been no effort spent in optimizing iForth's locals. The shown
    result is simply a byproduct of the architecture of the compiler.

    It is not my task to guide the users, they can do whatever they ask.
    I believe my actions are consistent with that.

    Locals are completely OK if you want to get stuff done. I use them for
    the boring parts.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to dxf on Wed Apr 10 06:58:17 2024
    dxf wrote:

    On 10/04/2024 3:25 pm, mhx wrote:
    dxf wrote:

    Do you have a disassembly for Fib1? I ask because NT/Forth which purports
    to do exactly that produces notably different results - despite there being little 'stack juggling' to resolve.

    Here it is:

    FORTH> : fib ( n1 -- n2 ) dup 2 < if drop 1 else dup 1- recurse swap 2 - recurse + then ; ok
    FORTH> 10 fib . 89 ok
    FORTH> see fib
    Flags: ANSI
    $01340A40 : fib
    $01340A4A pop rbx
    $01340A4B cmp rbx, 2 b#
    $01340A4F jge $01340A61 offset NEAR
    $01340A55 mov rbx, 1 d#
    $01340A5C jmp $01340A95 offset NEAR
    $01340A61 push rbx
    $01340A62 lea rbx, [rbx -1 +] qword
    $01340A66 lea rbp, [rbp -8 +] qword
    $01340A6A mov [rbp 0 +] qword, $01340A77 d#
    $01340A72 jmp $01340A4B offset NEAR
    $01340A77 pop rbx
    $01340A78 pop rdi
    $01340A79 push rbx
    $01340A7A lea rbx, [rdi -2 +] qword
    $01340A7E lea rbp, [rbp -8 +] qword
    $01340A82 mov [rbp 0 +] qword, $01340A8F d#
    $01340A8A jmp $01340A4B offset NEAR
    $01340A8F pop rbx
    $01340A90 pop rdi
    $01340A91 lea rbx, [rdi rbx*1] qword
    $01340A95 push rbx
    $01340A96 ;

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to minforth on Wed Apr 10 07:00:38 2024
    minforth@gmx.net (minforth) writes:
    With locals in CPU registers, I expect an even smaller speed
    difference. However, I have not yet implemented and tested this.

    Stack items and locals are just different ways of expressing the same
    data flow. Ideally a Forth system maps both expressions to the same
    optimal way of expressing the data flow in machine code. There is one
    example where this actually happens: Consider:

    : 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
    : 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
    : 3dup.3 {: a b c :} a b c a b c ;
    : 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;

    These four ways of expressing 3DUP are all commpiled to exactly the
    same code by lxf/ntf:

    804FC0A 8B4500 mov eax , [ebp]
    804FC0D 8945F4 mov [ebp-Ch] , eax
    804FC10 8B4504 mov eax , [ebp+4h]
    804FC13 8945F8 mov [ebp-8h] , eax
    804FC16 895DFC mov [ebp-4h] , ebx
    804FC19 8D6DF4 lea ebp , [ebp-Ch]
    804FC1C C3 ret near

    However, it's more difficult when control flow is involved, and most native-code compilers register-allocate only in straight-line code.
    For comparison, here's what gforth-fast currently gives you:

    3dup.1 3dup.2 3dup.3 3dup.4
    r 1->0 third 1->2 >l >l 1->1 dup 1->1
    mov -$08[r14],r8 mov r15,$10[r13] >l mov $00[r13],r8
    sub r14,$08 third 2->3 mov -$08[rbp],r8 sub r13,$08
    2dup 0->2 mov r9,$08[r13] mov rdx,$08[r13] 2over 1->3
    mov r8,$10[r13] third 3->1 mov rax,rbp mov r15,$18[r13]
    mov r15,$08[r13] mov $00[r13],r8 add r13,$10 mov r9,$10[r13]
    i 2->3 sub r13,$18 lea rbp,-$10[rbp] rot 3->1
    mov r9,[r14] mov $10[r13],r15 mov -$10[rax],rdx mov $00[r13],r15
    -rot 3->2 mov $08[r13],r9 mov r8,$00[r13] sub r13,$10
    mov $00[r13],r9 ;s 1->1 >l @local0 1->1 mov $08[r13],r9
    sub r13,$08 mov rbx,[r14] @local0 ;s 1->1
    2->1 add r14,$08 mov rax,rbp mov rbx,[r14]
    mov -$08[r13],r15 mov rax,[rbx] lea rbp,-$08[rbp] add r14,$08
    sub r13,$10 jmp eax mov -$08[rax],r8 mov rax,[rbx]
    mov $10[r13],r8 @local1 1->2 jmp eax
    mov r8,[r14] mov r15,$08[rbp]
    add r14,$08 @local2 2->1
    ;s 1->1 mov -$08[r13],r15
    mov rbx,[r14] sub r13,$10
    add r14,$08 mov $10[r13],r8
    mov rax,[rbx] mov r8,$10[rbp]
    jmp eax @local0 1->2
    mov r15,$00[rbp]
    @local1 2->3
    mov r9,$08[rbp]
    @local2 3->1
    mov -$10[r13],r9
    sub r13,$18
    mov $10[r13],r15
    mov $18[r13],r8
    mov r8,$10[rbp]
    lit 1->2
    #24
    mov r15,$50[rbx]
    lp+! 2->1
    add rbp,r15
    ;s 1->1
    mov rbx,[r14]
    add r14,$08
    mov rax,[rbx]
    jmp eax

    Interestingly, 3DUP.2 is essentially the same as the lxf/ntf variant:
    load two items from the memory parts of the stack, store three items
    to the memory part of the stack, and one stack-pointer update. The
    differences are:

    64-bit vs. 32-bit cells
    different way of returning to the caller (;S code vs. ret)
    sub vs. lea for the stack-pointer update
    different order of loads and stores
    different register allocation

    But this is just a happy coincidence, and the other versions make it
    clear that gforth-fast does not analyse the data flow even in
    straight-line code.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to mhx on Wed Apr 10 07:34:38 2024
    mhx wrote:
    I wouldn't know what those authors think, believe, or try to enforce.

    There has been no effort spent in optimizing iForth's locals. The shown result is simply a byproduct of the architecture of the compiler.

    It is not my task to guide the users, they can do whatever they ask.
    I believe my actions are consistent with that.

    Locals are completely OK if you want to get stuff done. I use them for
    the boring parts.

    I think the attitude of "programming to make the compiler happy i.e.
    by stack juggling" is a waste of human resources. I can't remember a
    single case where a performance bottleneck was fixed by switching
    from a local to a non-local code formulation. The difference in speed,
    if any, is simply too small.

    In such a case, you first rethink the task and the algorithm and,
    if necessary, write a few lines in assembler (or C) after profiling.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to dxf on Wed Apr 10 10:38:43 2024
    dxf wrote:

    On 10/04/2024 5:34 pm, minforth wrote:
    mhx wrote:

    Locals are completely OK if you want to get stuff done. I use them for
    the boring parts.
    ...
    I think the attitude of "programming to make the compiler happy i.e.
    by stack juggling" is a waste of human resources. I can't remember a
    single case where a performance bottleneck was fixed by switching
    from a local to a non-local code formulation. The difference in speed,
    if any, is simply too small.

    To use forth's stack until such time as it becomes too much for one
    suggests a half-heartedness. It begs the question why use forth at all
    if one is merely going to toy with it.

    Yes, of course it's entirely up to you if you don't fully utilise the
    available possibilities of your tools. But I also suspect that your Forth system has no locals at all.

    To give you another example: Dr Noble's formula translator. Certainly not
    a toy of an incapable programmer, as you imply.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Wed Apr 10 17:06:45 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    [deleted the versions not of interest in this posting
    : 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
    : 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;
    ...
    For comparison, here's what gforth-fast currently gives you:

    3dup.2 3dup.4
    third 1->2 dup 1->1
    mov r15,$10[r13] mov $00[r13],r8
    third 2->3 sub r13,$08
    mov r9,$08[r13] 2over 1->3
    third 3->1 mov r15,$18[r13]
    mov $00[r13],r8 mov r9,$10[r13]
    sub r13,$18 rot 3->1
    mov $10[r13],r15 mov $00[r13],r15
    mov $08[r13],r9 sub r13,$10
    ;s 1->1 mov $08[r13],r9
    mov rbx,[r14] ;s 1->1
    add r14,$08 mov rbx,[r14]
    mov rax,[rbx] add r14,$08
    jmp eax mov rax,[rbx]
    jmp eax

    Interestingly, 3DUP.2 is essentially the same as the lxf/ntf variant:
    load two items from the memory parts of the stack, store three items
    to the memory part of the stack, and one stack-pointer update. The >differences are:

    64-bit vs. 32-bit cells
    different way of returning to the caller (;S code vs. ret)
    sub vs. lea for the stack-pointer update
    different order of loads and stores
    different register allocation

    But this is just a happy coincidence, and the other versions make it
    clear that gforth-fast does not analyse the data flow even in
    straight-line code.

    3DUP.4 also performs these two loads and three stores (again in a
    different order), but it distributes the stack pointer update across
    two instructions.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to All on Wed Apr 10 16:50:36 2024
    The local version loses track of constants. If we reorder test.5 :

    FORTH> : test.6 1 2 3 3dup.1 3dup.3 3dup.4 3dup.2 .S ; ok
    FORTH> see test.6
    Flags: ANSI
    $01341100 : test.6
    $0134110A push 1 b#
    $0134110C push 2 b#
    $0134110E push 3 b#
    $01341110 push 1 b#
    $01341112 push 2 b#
    $01341114 push 3 b#
    $01341116 push 1 b#
    $01341118 push 2 b#
    $0134111A push 3 b#
    $0134111C push 1 b#
    $0134111E push 2 b#
    $01341120 push 3 b#
    $01341122 mov rbx, [rsp #16 +] qword
    $01341127 mov rcx, rbx
    $0134112A mov rbx, [rsp 8 +] qword
    $0134112F push rcx
    $01341130 mov rcx, rbx
    $01341133 mov rbx, [rsp 8 +] qword
    $01341138 push rcx
    $01341139 push rbx
    $0134113A jmp .S+10 ( $012BEA0A ) offset NEAR
    $0134113F ;

    Maybe in vsn 7.0 :--)

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to Anton Ertl on Wed Apr 10 16:35:04 2024
    Anton Ertl wrote:

    However, it's more difficult when control flow is involved, and most native-code compilers register-allocate only in straight-line code.
    For comparison, here's what gforth-fast currently gives you:
    [..]
    Interestingly, 3DUP.2 is essentially the same as the lxf/ntf variant:
    load two items from the memory parts of the stack, store three items
    to the memory part of the stack, and one stack-pointer update. The differences are:
    ..
    Nice example. Here is what iForth does:

    FORTH> : 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ; ok
    FORTH> : 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ; ok
    FORTH> : 3dup.3 params| a b c | a b c a b c ; ok
    FORTH> : 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ; ok
    FORTH> : test.1 3dup.1 .S ; ok
    FORTH> : test.2 3dup.2 .S ; ok
    FORTH> : test.3 3dup.3 .S ; ok
    FORTH> : test.4 3dup.4 .S ; ok
    FORTH> see test.1
    Flags: ANSI
    $01340F80 : test.1
    $01340F8A pop rbx
    $01340F8B pop rdi
    $01340F8C mov rax, [rsp] qword
    $01340F90 push rdi
    $01340F91 push rbx
    $01340F92 push rax
    $01340F93 push rdi
    $01340F94 push rbx
    $01340F95 jmp .S+10 ( $012BEA0A ) offset NEAR
    FORTH> see test.2
    Flags: ANSI
    $01340FC0 : test.2
    $01340FCA mov rbx, [rsp #16 +] qword
    $01340FCF mov rcx, rbx
    $01340FD2 mov rbx, [rsp 8 +] qword
    $01340FD7 push rcx
    $01340FD8 mov rcx, rbx
    $01340FDB mov rbx, [rsp 8 +] qword
    $01340FE0 push rcx
    $01340FE1 push rbx
    $01340FE2 jmp .S+10 ( $012BEA0A ) offset NEAR
    FORTH> see test.3
    Flags: ANSI
    $01341000 : test.3
    $0134100A pop rbx
    $0134100B pop rdi
    $0134100C mov rax, [rsp] qword
    $01341010 push rdi
    $01341011 push rbx
    $01341012 push rax
    $01341013 push rdi
    $01341014 push rbx
    $01341015 jmp .S+10 ( $012BEA0A ) offset NEAR
    FORTH> see test.4
    Flags: ANSI
    $01341040 : test.4
    $0134104A pop rbx
    $0134104B pop rdi
    $0134104C mov rax, [rsp] qword
    $01341050 push rdi
    $01341051 push rbx
    $01341052 push rax
    $01341053 push rdi
    $01341054 push rbx
    $01341055 jmp .S+10 ( $012BEA0A ) offset NEAR

    FORTH> ( and finally all together ... )
    FORTH> : test.5 1 2 3 3dup.1 3dup.2 3dup.3 3dup.4 .S ;

    FORTH> see test.5
    Flags: ANSI
    $01341080 : test.4
    $0134108A push 1 b#
    $0134108C push 2 b#
    $0134108E push 3 b#
    $01341090 push 1 b#
    $01341092 push 2 b#
    $01341094 push 3 b#
    $01341096 mov rbx, [rsp #16 +] qword
    $0134109B mov rcx, rbx
    $0134109E mov rbx, [rsp 8 +] qword
    $013410A3 push rcx
    $013410A4 mov rcx, rbx
    $013410A7 mov rbx, [rsp 8 +] qword
    $013410AC mov rdi, [rsp] qword
    $013410B0 push rcx
    $013410B1 push rbx
    $013410B2 push rdi
    $013410B3 push rcx
    $013410B4 push rbx
    $013410B5 push rdi
    $013410B6 push rcx
    $013410B7 push rbx
    $013410B8 jmp .S+10 ( $012BEA0A ) offset NEAR

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to mhx on Wed Apr 10 20:59:49 2024
    mhx wrote:

    The local version loses track of constants.

    Interesting aspect. Within the test words all 3dup variants had been
    inlined, I assume.
    However inlining semanthically identic words with locals is
    "not commutative", to grossly abuse the algebraic expression. Hmm...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From albert@spenarnc.xs4all.nl@21:1/5 to dxforth@gmail.com on Thu Apr 11 11:49:02 2024
    In article <66174655$1@news.ausics.net>, dxf <dxforth@gmail.com> wrote:
    On 10/04/2024 8:38 pm, minforth wrote:
    dxf wrote:

    On 10/04/2024 5:34 pm, minforth wrote:
    mhx wrote:

    Locals are completely OK if you want to get stuff done. I use them for >>>>> the boring parts.
    ...
    I think the attitude of "programming to make the compiler happy i.e.
    by stack juggling" is a waste of human resources. I can't remember a
    single case where a performance bottleneck was fixed by switching
    from a local to a non-local code formulation. The difference in speed, >>>> if any, is simply too small.

    To use forth's stack until such time as it becomes too much for one
    suggests a half-heartedness.  It begs the question why use forth at all >>> if one is merely going to toy with it.

    Yes, of course it's entirely up to you if you don't fully utilise the
    available possibilities of your tools.

    In calling locals 'a tool' or 'used for the boring stuff' I'd have the >problem of explaining to colleagues how it is I can use a language in a
    way that its creator has dismissed as antithetical.

    But I also suspect that your Forth
    system has no locals at all.

    DX-Forth offers locals as a loadable option. They're implemented
    efficiently so as to be credible. I'm not aware of any user that has
    availed themselves of it. Indeed I find they come to forth looking
    for something that's honestly unique and challenging. To discover
    Moore still resolute after all these years provides their role model.


    ciforth offers LOCAL as a loadable option.
    They are implemented without regard of efficiency.
    Also they cannot be used recursively.

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat purring. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to minforth on Sat Apr 13 17:27:15 2024
    minforth@gmx.net (minforth) writes:
    mhx wrote:

    The local version loses track of constants.

    Interesting aspect. Within the test words all 3dup variants had been
    inlined, I assume.
    However inlining semanthically identic words with locals is
    "not commutative", to grossly abuse the algebraic expression. Hmm...

    I think mathematicians have a word for what you mean, but it's not "commutative".

    Anyway, it's not specific to locals. Everything that loses
    information will affect everything that comes afterwards. E.g., in
    Gforth we have a literal stack that only represents literals on the
    data stack. So anything that moves values from the data stack (e.g.,
    to the return stack) will lose the information about the constant.

    Gforth does not have automatic inlining, so I use manual inlining here
    (I also added some constant folding optimizations that are not built
    in):

    : 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
    : 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;

    : compile,-3dup.1 ( xt -- ) drop ]] >r 2dup r@ -rot r> [[ ;
    ' compile,-3dup.1 optimizes 3dup.1

    : compile,-3dup.2 ( xt -- ) drop ]] 2 pick 2 pick 2 pick [[ ;
    ' compile,-3dup.2 optimizes 3dup.2

    : fold3-4 ( xt -- ) 3 ['] 3lits> ['] >3lits ['] >4lits fold-constants ;
    ' fold3-4 optimizes third

    : fold2-4 ( xt -- ) 2 ['] 2lits> ['] >2lits ['] >4lits fold-constants ;
    ' fold2-4 optimizes 2dup

    : foo12 1 2 3 3dup.1 3dup.2 ;
    : foo21 1 2 3 3dup.2 3dup.1 ;

    Looking at the resulting code for FOO12 and FOO21, we see:

    see foo12
    : foo12 #1 #2 #3
    >r 2dup i -rot r> third third third ; ok
    see foo21
    : foo21 #1 #2 #3 #1 #2 #3
    >r 2dup i -rot r> ; ok

    So 3DUP.1 loses information by involving the return stack, while
    3DUP.2 only uses the data stack, and manages to work on the constants
    when it is COMPILE,d (but I had to add an optimization rule for THIRD
    to achieve that; doing the same with 2DUP did not help 3DUP.1,
    though).

    Having a literal-r-stack and a way to track literals through locals
    would make all 3DUP variants work the same way when all parameters are literals.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From albert@spenarnc.xs4all.nl@21:1/5 to Anton Ertl on Sun Apr 14 13:08:25 2024
    In article <2024Apr13.192715@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    minforth@gmx.net (minforth) writes:
    mhx wrote:

    The local version loses track of constants.

    Interesting aspect. Within the test words all 3dup variants had been >>inlined, I assume.
    However inlining semanthically identic words with locals is
    "not commutative", to grossly abuse the algebraic expression. Hmm...

    I think mathematicians have a word for what you mean, but it's not >"commutative".

    Anyway, it's not specific to locals. Everything that loses
    information will affect everything that comes afterwards. E.g., in
    Gforth we have a literal stack that only represents literals on the
    data stack. So anything that moves values from the data stack (e.g.,
    to the return stack) will lose the information about the constant.

    To cite Andrew Tannenbaum:
    "global optimisation and symbolic debugging are each others
    arch enemies"

    - anton
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat purring. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul Rubin@21:1/5 to dxf on Mon Apr 15 12:34:41 2024
    dxf <dxforth@gmail.com> writes:
    To use forth's stack until such time as it becomes too much for one
    suggests a half-heartedness. It begs the question why use forth at all
    if one is merely going to toy with it.

    If Forth's greatness comes from its extensibility, why resist using that extensibility to have locals? Forth is a philosophy that has many
    attractions, but its stack is an implementation artifact.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul Rubin@21:1/5 to dxf on Mon Apr 15 12:41:51 2024
    dxf <dxforth@gmail.com> writes:
    Moore and Fox discussed that. Making a non-optimal approach more
    efficient isn't the answer.

    Moore and Fox made cpu's that saved a lot of hardware by not having
    registers. If you're on a register machine though, why not use it?
    Doing the opposite is far from optimal.

    Locals may not have been practical on Moore's stack machines, but
    neither was ROT, it seems to me. So no locals and not that much
    stackrobatics. He instead used VARIABLEs all over the place, which
    burnt extra storage since they often weren't all alive at the same time,
    plus gave rise to naming troubles and loss of reentrancy, both sort of
    odd for a stack language.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)