• Integer roots game

    From minforth@arcor.de@21:1/5 to All on Tue Dec 6 06:40:34 2022
    Inspired by another thread I came up with the iterative code shown below.
    The challenge: find a shorter AND faster version of USQRT!
    Have fun!
    ( please don't short-circuit through FSQRT or such ;-)

    \ code:
    : USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
    0 to ct u 2/ to x0
    BEGIN u x0 / x0 + 2/ to x1
    x1 x0 <
    WHILE x1 to x0
    ct 1+ to ct
    REPEAT
    ." root:" x0 . ." iterations: " ct . ;

    CR 20 usqrt
    CR 200 usqrt
    CR 2000 usqrt
    CR 20000 usqrt
    CR 200000 usqrt
    CR 2000000 usqrt
    CR 20000000 usqrt
    CR 200000000 usqrt

    \ results:
    include usqrt.fth
    root:4 iterations: 2
    root:14 iterations: 4
    root:44 iterations: 6
    root:141 iterations: 8
    root:447 iterations: 10
    root:1414 iterations: 12
    root:4472 iterations: 14
    root:14142 iterations: 16 ok

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans Bezemer@21:1/5 to minf...@arcor.de on Tue Dec 6 07:52:06 2022
    On Tuesday, December 6, 2022 at 3:40:35 PM UTC+1, minf...@arcor.de wrote:
    Inspired by another thread I came up with the iterative code shown below.
    The challenge: find a shorter AND faster version of USQRT!

    Done:
    4 3
    14 4
    44 6
    141 8
    447 9
    1414 11
    4472 13
    14142 14

    : sqrt-rem ( n -- sqrt rem)
    >r 0 1 begin dup r@ > except 4 * repeat
    begin \ find a power of 4 greater than TORS
    dup 1 > \ compute while greater than unity
    while
    2/ 2/ swap over over + negate r@ + \ integer divide by 4
    dup 0< if drop 2/ else rdrop >r 2/ over + then swap
    repeat drop r> ( sqrt rem)
    ;

    BTW - less iterations doesn't always mean "faster". I got another pretty fast one, but it always requires 31 iterations on a 64 bit architecture.

    Hans Bezemer

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@arcor.de@21:1/5 to the.bee...@gmail.com on Tue Dec 6 08:28:20 2022
    the.bee...@gmail.com schrieb am Dienstag, 6. Dezember 2022 um 16:52:08 UTC+1:
    On Tuesday, December 6, 2022 at 3:40:35 PM UTC+1, minf...@arcor.de wrote:
    Inspired by another thread I came up with the iterative code shown below. The challenge: find a shorter AND faster version of USQRT!
    Done:
    4 3
    14 4
    44 6
    141 8
    447 9
    1414 11
    4472 13
    14142 14

    : sqrt-rem ( n -- sqrt rem)
    r 0 1 begin dup r@ > except 4 * repeat
    begin \ find a power of 4 greater than TORS
    dup 1 > \ compute while greater than unity
    while
    2/ 2/ swap over over + negate r@ + \ integer divide by 4
    dup 0< if drop 2/ else rdrop >r 2/ over + then swap
    repeat drop r> ( sqrt rem)
    ;

    BTW - less iterations doesn't always mean "faster". I got another pretty fast one, but it always requires 31 iterations on a 64 bit architecture.

    Nice, although it has more loops and more operations in the main loop,
    but which doesn't spoil the cake.
    BTW what is that non-standard word: EXCEPT (guessing: 0= WHILE) ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans Bezemer@21:1/5 to minf...@arcor.de on Tue Dec 6 08:37:05 2022
    On Tuesday, December 6, 2022 at 5:28:21 PM UTC+1, minf...@arcor.de wrote:
    BTW what is that non-standard word: EXCEPT (guessing: 0= WHILE) ?
    You guessed correctly, Sir. I always hated "0= WHILE" and "0= IF", so I made myself a few alternatives.

    Hans Bezemer

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans Bezemer@21:1/5 to minf...@arcor.de on Tue Dec 6 08:25:09 2022
    On Tuesday, December 6, 2022 at 3:40:35 PM UTC+1, minf...@arcor.de wrote:
    Inspired by another thread I came up with the iterative code shown below.
    The challenge: find a shorter AND faster version of USQRT!
    I'll give you that: it is FAST! However, this one is even faster (on my system):

    : usqrt
    >r 0 r@ 2/ ( ct x0 R: u)
    begin
    r@ over / over + 2/ swap over over < ( ct x1 x0 f)
    while
    drop swap 1+ swap
    repeat rdrop nip nip
    ;
    time pp4th -x usqrt.4th

    Yours:
    real 0m0,762s
    user 0m0,761s
    sys 0m0,000s

    Mine:
    real 0m0,682s
    user 0m0,682s
    sys 0m0,000s

    Hans Bezemer

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@arcor.de@21:1/5 to the.bee...@gmail.com on Tue Dec 6 08:46:23 2022
    the.bee...@gmail.com schrieb am Dienstag, 6. Dezember 2022 um 17:25:11 UTC+1:
    On Tuesday, December 6, 2022 at 3:40:35 PM UTC+1, minf...@arcor.de wrote:
    Inspired by another thread I came up with the iterative code shown below. The challenge: find a shorter AND faster version of USQRT!
    I'll give you that: it is FAST! However, this one is even faster (on my system):

    : usqrt
    r 0 r@ 2/ ( ct x0 R: u)
    begin
    r@ over / over + 2/ swap over over < ( ct x1 x0 f)
    while
    drop swap 1+ swap
    repeat rdrop nip nip
    ;
    time pp4th -x usqrt.4th

    Yours:
    real 0m0,762s
    user 0m0,761s
    sys 0m0,000s

    Mine:
    real 0m0,682s
    user 0m0,682s
    sys 0m0,000s

    Even nicer, also faster here (I have implemented only TOS-caching, but not register locals).
    BTW over over is 2dup. For perfectionists: u2/ u/ and u< could double the range.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to minf...@arcor.de on Wed Dec 7 03:53:27 2022
    On 7/12/2022 3:28 am, minf...@arcor.de wrote:
    ...
    Nice, although it has more loops and more operations in the main loop,

    This one duplicates your logic.

    : SQRT ( +n -- +root )
    0 over 2/ >r
    BEGIN
    over r@ / r@ + 2/
    dup r@ <
    WHILE
    rdrop >r 1+
    REPEAT drop
    ." root:" r> . ." iterations: " . drop ;

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans Bezemer@21:1/5 to dxforth on Tue Dec 6 09:24:39 2022
    On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:
    Nice, although it has more loops and more operations in the main loop,
    This one duplicates your logic.
    That one shaves another 8/100 of a second from my version. ;-)

    Hans Bezemer

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to Anton Ertl on Tue Dec 6 18:38:41 2022
    In article <2022Dec6.175959@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    "minf...@arcor.de" <minforth@arcor.de> writes:
    Inspired by another thread I came up with the iterative code shown below. >>The challenge: find a shorter AND faster version of USQRT!
    Have fun!
    ( please don't short-circuit through FSQRT or such ;-)

    \ code:
    : USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
    0 to ct u 2/ to x0
    BEGIN u x0 / x0 + 2/ to x1
    x1 x0 <
    WHILE x1 to x0
    ct 1+ to ct
    REPEAT
    ." root:" x0 . ." iterations: " ct . ;

    This is the specialization of the Newton-Raphson method for the square
    root, see ><https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>

    For the initial value I would use

    1 bits/cell u clz - 2/ lshift

    where clz ( x -- u ) is the count-leading-zeros function that many >architectures have built-in. Newton-Raphson profits a lot from a good >initial value.

    Where `clz is available you should use it. In this context I consider it
    as cheating.
    If clz is not available in hardware, it takes up as much iterations
    (or more) than you save on the Newton iteration.


    - anton

    Groetjes Albert
    --
    "in our communism country Viet Nam, people are forced to be
    alive and in the western country like US, people are free to
    die from Covid 19 lol" duc ha
    albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to minf...@arcor.de on Tue Dec 6 18:35:20 2022
    In article <3c089374-8f18-4d6b-a31e-7322ed66b9f9n@googlegroups.com>, minf...@arcor.de <minforth@arcor.de> wrote:
    Inspired by another thread I came up with the iterative code shown below.
    The challenge: find a shorter AND faster version of USQRT!
    Have fun!
    ( please don't short-circuit through FSQRT or such ;-)

    \ code:
    : USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
    0 to ct u 2/ to x0
    BEGIN u x0 / x0 + 2/ to x1
    x1 x0 <
    WHILE x1 to x0
    ct 1+ to ct
    REPEAT
    ." root:" x0 . ." iterations: " ct . ;

    CR 20 usqrt
    CR 200 usqrt
    CR 2000 usqrt
    CR 20000 usqrt
    CR 200000 usqrt
    CR 2000000 usqrt
    CR 20000000 usqrt
    CR 200000000 usqrt

    \ results:
    include usqrt.fth
    root:4 iterations: 2
    root:14 iterations: 4
    root:44 iterations: 6
    root:141 iterations: 8
    root:447 iterations: 10
    root:1414 iterations: 12
    root:4472 iterations: 14
    root:14142 iterations: 16 ok
    My take on this.

    Newton is only crazy fast if the start value is good.
    The following takes 10 iterations away:

    \ For N return FLOOR of the square root of n.
    : SQRT DUP >R 10 RSHIFT 1024 MAX \ Minimize iterations.
    BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE
    NIP REPEAT DROP RDROP ;

    S[ ] OK -1 1 RSHIFT
    S[ 9223372036854775807 ] OK SQRT
    S[ 3037000499 ] OK

    (that is MAX-INT , if you had not noticed.

    Groetjes Albert
    --
    "in our communism country Viet Nam, people are forced to be
    alive and in the western country like US, people are free to
    die from Covid 19 lol" duc ha
    albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to minf...@arcor.de on Tue Dec 6 16:59:59 2022
    "minf...@arcor.de" <minforth@arcor.de> writes:
    Inspired by another thread I came up with the iterative code shown below.
    The challenge: find a shorter AND faster version of USQRT!
    Have fun!
    ( please don't short-circuit through FSQRT or such ;-)

    \ code:
    : USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
    0 to ct u 2/ to x0
    BEGIN u x0 / x0 + 2/ to x1
    x1 x0 <
    WHILE x1 to x0
    ct 1+ to ct
    REPEAT
    ." root:" x0 . ." iterations: " ct . ;

    This is the specialization of the Newton-Raphson method for the square
    root, see <https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>

    For the initial value I would use

    1 bits/cell u clz - 2/ lshift

    where clz ( x -- u ) is the count-leading-zeros function that many architectures have built-in. Newton-Raphson profits a lot from a good
    initial value.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to none albert on Tue Dec 6 10:01:50 2022
    On Tuesday, December 6, 2022 at 6:35:24 PM UTC+1, none albert wrote:
    [..]
    Newton is only crazy fast if the start value is good.
    The following takes 10 iterations away:

    \ For N return FLOOR of the square root of n.
    : SQRT DUP >R 10 RSHIFT 1024 MAX \ Minimize iterations.
    BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE
    NIP REPEAT DROP RDROP ;

    S[ ] OK -1 1 RSHIFT
    S[ 9223372036854775807 ] OK SQRT
    S[ 3037000499 ] OK

    (that is MAX-INT , if you had not noticed.
    [..]

    FORTH> : SQRT DUP >R 10 RSHIFT 1024 MAX BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE NIP REPEAT DROP -R ;
    Redefining `SQRT` ok
    FORTH> : test timer-reset 0 #1000000000 0 do -1 1 RSHIFT sqrt + loop . .elapsed ; ok
    FORTH> test 3037000499000000000 53.520 seconds elapsed. ok
    FORTH> forget sqrt : test2 timer-reset 0 #1000000000 0 do -1 1 RSHIFT sqrt + loop . .elapsed ; ok
    FORTH> test2 3037000499000000000 0.435 seconds elapsed. ok

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@arcor.de@21:1/5 to Anton Ertl on Tue Dec 6 10:05:12 2022
    Anton Ertl schrieb am Dienstag, 6. Dezember 2022 um 18:21:48 UTC+1:
    "minf...@arcor.de" <minf...@arcor.de> writes:
    Inspired by another thread I came up with the iterative code shown below. >The challenge: find a shorter AND faster version of USQRT!
    Have fun!
    ( please don't short-circuit through FSQRT or such ;-)

    \ code:
    : USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
    0 to ct u 2/ to x0
    BEGIN u x0 / x0 + 2/ to x1
    x1 x0 <
    WHILE x1 to x0
    ct 1+ to ct
    REPEAT
    ." root:" x0 . ." iterations: " ct . ;
    This is the specialization of the Newton-Raphson method for the square
    root, see <https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>


    Correct. Another algorithm would include tabulation and interpolation. Probably faster
    but eats more space.

    If anyone is interested, there is a practical story behind the game: flow measurement
    through orifices where dp behaves about quadratic with the flow. For instance the basic
    principles are explained here: https://www.efunda.com/formulae/fluids/calc_orifice_flowmeter.cfm

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@arcor.de@21:1/5 to Marcel Hendrix on Tue Dec 6 13:05:10 2022
    Marcel Hendrix schrieb am Dienstag, 6. Dezember 2022 um 19:01:52 UTC+1:
    On Tuesday, December 6, 2022 at 6:35:24 PM UTC+1, none albert wrote:
    [..]
    Newton is only crazy fast if the start value is good.
    The following takes 10 iterations away:

    \ For N return FLOOR of the square root of n.
    : SQRT DUP >R 10 RSHIFT 1024 MAX \ Minimize iterations.
    BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE
    NIP REPEAT DROP RDROP ;

    S[ ] OK -1 1 RSHIFT
    S[ 9223372036854775807 ] OK SQRT
    S[ 3037000499 ] OK

    (that is MAX-INT , if you had not noticed.
    [..]

    FORTH> : SQRT DUP >R 10 RSHIFT 1024 MAX BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE NIP REPEAT DROP -R ;
    Redefining `SQRT` ok
    FORTH> : test timer-reset 0 #1000000000 0 do -1 1 RSHIFT sqrt + loop . .elapsed ; ok
    FORTH> test 3037000499000000000 53.520 seconds elapsed. ok
    FORTH> forget sqrt : test2 timer-reset 0 #1000000000 0 do -1 1 RSHIFT sqrt + loop . .elapsed ; ok
    FORTH> test2 3037000499000000000 0.435 seconds elapsed. ok


    Cute. But all those algorithms use slow division in the loop.

    Here is a version ( in pseudo-Forth ;-) } using only shifts and additions:
    : USQRT {: u | T H B S -- uroot32 :}
    0 to H $8000 to B 15 to S
    BEGIN
    H 2* B + S lshift to T
    T u < IF
    H B + to H
    u T - to u
    THEN
    S 1- to S
    B 2/ to B
    B 0=
    UNTIL
    ." root:" H . ;

    BTW (not shown above) for syntactic sugar I regularly use
    := += -/ *= /=
    with locals. Makes code more readable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to albert@cherry. on Tue Dec 6 22:01:25 2022
    albert@cherry.(none) (albert) writes:
    In article <2022Dec6.175959@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    "minf...@arcor.de" <minforth@arcor.de> writes:
    Inspired by another thread I came up with the iterative code shown below. >>>The challenge: find a shorter AND faster version of USQRT!
    Have fun!
    ( please don't short-circuit through FSQRT or such ;-)

    \ code:
    : USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
    0 to ct u 2/ to x0
    BEGIN u x0 / x0 + 2/ to x1
    x1 x0 <
    WHILE x1 to x0
    ct 1+ to ct
    REPEAT
    ." root:" x0 . ." iterations: " ct . ;

    This is the specialization of the Newton-Raphson method for the square >>root, see >><https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>

    For the initial value I would use

    1 bits/cell u clz - 2/ lshift

    where clz ( x -- u ) is the count-leading-zeros function that many >>architectures have built-in. Newton-Raphson profits a lot from a good >>initial value.

    Where `clz is available you should use it. In this context I consider it
    as cheating.
    If clz is not available in hardware, it takes up as much iterations
    (or more) than you save on the Newton iteration.

    Here's a LOG2 definition (originally by Wil Baden <wilbadenE5oFJA.57w@netcom.com>, with his peculiarities removed, extended to 64 bits, and slightly more efficient):

    : log2 ( n -- log2 )
    -1 swap ( log2 n)
    dup $ffffffff00000000 and if swap 32 + swap 32 rshift then
    dup $00000000ffff0000 and if swap 16 + swap 16 rshift then
    dup $000000000000ff00 and if swap 8 + swap 8 rshift then
    dup $00000000000000f0 and if swap 4 + swap 4 rshift then
    dup $000000000000000c and if swap 2 + swap 2 rshift then
    dup $0000000000000002 and if swap 1 + swap 1 rshift then
    + ;

    With this function the initial value becomes:

    1 u log2 2/ lshift

    It will be interesting to see at what values the crossover is between saving iterations for Newton, and the additional cost of log2.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to Hans Bezemer on Wed Dec 7 11:09:21 2022
    On 7/12/2022 4:24 am, Hans Bezemer wrote:
    On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:
    Nice, although it has more loops and more operations in the main loop,
    This one duplicates your logic.
    That one shaves another 8/100 of a second from my version. ;-)

    Locals are supposed to reduce 'stack juggling' and simplify code but in this instance it was the juggling between variables that proved extraneous.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@arcor.de@21:1/5 to dxforth on Tue Dec 6 16:49:47 2022
    dxforth schrieb am Mittwoch, 7. Dezember 2022 um 01:09:23 UTC+1:
    On 7/12/2022 4:24 am, Hans Bezemer wrote:
    On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:
    Nice, although it has more loops and more operations in the main loop,
    This one duplicates your logic.
    That one shaves another 8/100 of a second from my version. ;-)
    Locals are supposed to reduce 'stack juggling' and simplify code but in this instance it was the juggling between variables that proved extraneous.

    Lol!! Tell that Heron of Alexandria, it is his root-finding method. His biggest drawback
    is that he was born before Charles Moore .. about 2000 years ... :o)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to minf...@arcor.de on Wed Dec 7 12:22:10 2022
    On 7/12/2022 11:49 am, minf...@arcor.de wrote:
    dxforth schrieb am Mittwoch, 7. Dezember 2022 um 01:09:23 UTC+1:
    On 7/12/2022 4:24 am, Hans Bezemer wrote:
    On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:
    Nice, although it has more loops and more operations in the main loop, >>>> This one duplicates your logic.
    That one shaves another 8/100 of a second from my version. ;-)
    Locals are supposed to reduce 'stack juggling' and simplify code but in this >> instance it was the juggling between variables that proved extraneous.

    Lol!! Tell that Heron of Alexandria, it is his root-finding method.

    Maybe but it was your implementation. So did the non-locals version
    prove to be "shorter AND faster"? That was your challenge - no?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@arcor.de@21:1/5 to dxforth on Wed Dec 7 01:20:44 2022
    dxforth schrieb am Mittwoch, 7. Dezember 2022 um 02:22:16 UTC+1:
    On 7/12/2022 11:49 am, minf...@arcor.de wrote:
    dxforth schrieb am Mittwoch, 7. Dezember 2022 um 01:09:23 UTC+1:
    On 7/12/2022 4:24 am, Hans Bezemer wrote:
    On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:
    Nice, although it has more loops and more operations in the main loop, >>>> This one duplicates your logic.
    That one shaves another 8/100 of a second from my version. ;-)
    Locals are supposed to reduce 'stack juggling' and simplify code but in this
    instance it was the juggling between variables that proved extraneous.

    Lol!! Tell that Heron of Alexandria, it is his root-finding method.
    Maybe but it was your implementation. So did the non-locals version
    prove to be "shorter AND faster"? That was your challenge - no?

    Admitted that, I won't yawn at that old "shun locals" talk again. If the proposal
    just duplicates Heron's logic, it is just equivalent IMO. If speed depends on the
    used compiler, it is not better really - lxf seems even capable to generate the same
    code for both versions.

    OTOH Marcel, Albert and Anton accelerated the simple algorithm by better start values.

    Yesterday I found a different algorithm in the net that avoids slow divisions and should
    therefore be even faster. The ARM code at the end of the document is a gem: http://www.azillionmonkeys.com/qed/ulerysqroot.pdf

    I didn't do timings but the other code that I presented in the thread was a linear translation
    of the C version to Forth. I checked it with gforth.

    P.S. pray forgive that while doing this I did not kick out those nasty locals. ;o)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to minf...@arcor.de on Wed Dec 7 11:08:37 2022
    "minf...@arcor.de" <minforth@arcor.de> writes:
    Lol!! Tell that Heron of Alexandria, it is his root-finding method. His biggest drawback
    is that he was born before Charles Moore .. about 2000 years ... :o)

    But the question is how Heron described it. Modern descriptions of
    course use medern notation and terminology, but the original notation
    and terminology is often quite different.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to minf...@arcor.de on Wed Dec 7 21:47:28 2022
    On 7/12/2022 8:20 pm, minf...@arcor.de wrote:
    dxforth schrieb am Mittwoch, 7. Dezember 2022 um 02:22:16 UTC+1:
    On 7/12/2022 11:49 am, minf...@arcor.de wrote:
    dxforth schrieb am Mittwoch, 7. Dezember 2022 um 01:09:23 UTC+1:
    On 7/12/2022 4:24 am, Hans Bezemer wrote:
    On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:
    Nice, although it has more loops and more operations in the main loop, >>>>>> This one duplicates your logic.
    That one shaves another 8/100 of a second from my version. ;-)
    Locals are supposed to reduce 'stack juggling' and simplify code but in this
    instance it was the juggling between variables that proved extraneous.

    Lol!! Tell that Heron of Alexandria, it is his root-finding method.
    Maybe but it was your implementation. So did the non-locals version
    prove to be "shorter AND faster"? That was your challenge - no?

    Admitted that, I won't yawn at that old "shun locals" talk again. If the proposal
    just duplicates Heron's logic, it is just equivalent IMO. If speed depends on the
    used compiler, it is not better really -

    If the improvement is consistent across compilers, how is it not better?
    You can say you don't care - but that's a different argument.

    lxf seems even capable to generate the same
    code for both versions.

    According to Jeff Fox, Moore considered optimizing locals a backwards solution to the problem of the user failing to optimize the stack.

    https://groups.google.com/g/comp.lang.forth/c/liQl9h9OzgY/m/PfO98kkYEXYJ

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@arcor.de@21:1/5 to Anton Ertl on Wed Dec 7 04:07:16 2022
    Anton Ertl schrieb am Mittwoch, 7. Dezember 2022 um 12:13:57 UTC+1:
    "minf...@arcor.de" <minf...@arcor.de> writes:
    Lol!! Tell that Heron of Alexandria, it is his root-finding method. His biggest drawback
    is that he was born before Charles Moore .. about 2000 years ... :o)
    But the question is how Heron described it. Modern descriptions of
    course use medern notation and terminology, but the original notation
    and terminology is often quite different.

    To my amateurish knowledge, mathematics at that time (Pythagoreans put aside) was mostly influenced by physics (i.e. mechanics, astronomy, etc.) and geometry (i.e. in architecture).

    Therefore I guess that Heron's recursion (probably invented much earlier in Egypt)
    had been described as algorithmic recipe using geometric elements and short text
    annotations in Greek.

    Certainly from our modern notations a form using variables comes closer to the original
    than stack machine code.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gerry Jackson@21:1/5 to minf...@arcor.de on Wed Dec 7 11:36:21 2022
    On 06/12/2022 14:40, minf...@arcor.de wrote:
    Inspired by another thread I came up with the iterative code shown below.
    The challenge: find a shorter AND faster version of USQRT!
    Have fun!
    ( please don't short-circuit through FSQRT or such ;-)

    \ code:
    : USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
    0 to ct u 2/ to x0
    BEGIN u x0 / x0 + 2/ to x1
    x1 x0 <
    WHILE x1 to x0
    ct 1+ to ct
    REPEAT
    ." root:" x0 . ." iterations: " ct . ;

    CR 20 usqrt
    CR 200 usqrt
    CR 2000 usqrt
    CR 20000 usqrt
    CR 200000 usqrt
    CR 2000000 usqrt
    CR 20000000 usqrt
    CR 200000000 usqrt

    \ results:
    include usqrt.fth
    root:4 iterations: 2
    root:14 iterations: 4
    root:44 iterations: 6
    root:141 iterations: 8
    root:447 iterations: 10
    root:1414 iterations: 12
    root:4472 iterations: 14
    root:14142 iterations: 16 ok

    In ANS/2012 Forth, solutions that use 2/ on an unsigned number with the
    ms bit set won't work for unsigned square root as 2/ leaves the ms bit unchanged.

    Here's a recursive solution I wrote a few months ago based on: https://en.wikipedia.org/wiki/Integer_square_root#Using_bitwise_operations

    I've put stack comments in so it can be related to the original in
    Wikipedia.

    : usqrt ( u -- u1 )
    dup 2 u< if exit then
    dup >r 2 rshift recurse
    2* ( -- u sm )
    1+ dup ( -- u la la )
    dup * ( -- u la la^2 )
    r> u> if 1- then ( -- u1 )
    ;

    Worked OK for me, I wasn't concerned about speed so don't know how it
    compares in that respect. Incidentally this provides an example of using
    the stack vs locals.

    --
    Gerry

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@arcor.de@21:1/5 to Gerry Jackson on Wed Dec 7 04:13:44 2022
    Gerry Jackson schrieb am Mittwoch, 7. Dezember 2022 um 12:36:24 UTC+1:
    Here's a recursive solution I wrote a few months ago based on: https://en.wikipedia.org/wiki/Integer_square_root#Using_bitwise_operations

    I've put stack comments in so it can be related to the original in
    Wikipedia.

    : usqrt ( u -- u1 )
    dup 2 u< if exit then
    dup >r 2 rshift recurse
    2* ( -- u sm )
    1+ dup ( -- u la la )
    dup * ( -- u la la^2 )
    if 1- then ( -- u1 )
    ;

    Worked OK for me, I wasn't concerned about speed so don't know how it compares in that respect. Incidentally this provides an example of using
    the stack vs locals.

    Cool ! I really like this one ! :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@arcor.de@21:1/5 to the.bee...@gmail.com on Wed Dec 7 05:01:04 2022
    the.bee...@gmail.com schrieb am Mittwoch, 7. Dezember 2022 um 13:33:02 UTC+1:
    On Wednesday, December 7, 2022 at 10:20:46 AM UTC+1, minf...@arcor.de wrote:
    Yesterday I found a different algorithm in the net that avoids slow divisions and should
    therefore be even faster. The ARM code at the end of the document is a gem: http://www.azillionmonkeys.com/qed/ulerysqroot.pdf
    Quick hack:

    cell-bits 2 / 1- constant (bshift)

    : sqrt
    r (bshift) 1 over lshift >r 0 ( bs nh R: v b )
    begin
    tuck 1 lshift r@ + over lshift ( nh bs t R: v b)
    over r@ > ( nh bs t b f R: v)
    if nip rot else swap negate r> + >r rot over + then ( bs b nh R: v)
    rot 1- rot 1 rshift >r swap r@ 0= ( bs nh R: v b)
    until rdrop rdrop nip
    ;

    Works on both 32- and 64-bit. Isn't surprisingly fast. As a matter of fact: I've seen implementations of binary algorithms that were faster.

    Yes, it is a mixed thing in Forth. Lots of time spent in bringing arguments to the top.
    The ultra-compact ARM code used registers.

    Readability: algorithm and stack movements are like mashed carrots and potatoes, difficult to separate.
    In my Forth

    : USQRT32 {: u | T H B:$8000 S:15 == H :}
    BEGIN
    H 2* B + S lshift := T
    T u < IF B += H T -= u THEN
    1 -= S B 2/ := B
    B ~UNTIL ;

    CR 200_000_000 USQRT32 .( 200mil-root: ) .

    Yes, this is slower and thus a game loser. ;-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans Bezemer@21:1/5 to minf...@arcor.de on Wed Dec 7 04:33:00 2022
    On Wednesday, December 7, 2022 at 10:20:46 AM UTC+1, minf...@arcor.de wrote:
    Yesterday I found a different algorithm in the net that avoids slow divisions and should
    therefore be even faster. The ARM code at the end of the document is a gem: http://www.azillionmonkeys.com/qed/ulerysqroot.pdf
    Quick hack:

    cell-bits 2 / 1- constant (bshift)

    : sqrt
    >r (bshift) 1 over lshift >r 0 ( bs nh R: v b )
    begin
    tuck 1 lshift r@ + over lshift ( nh bs t R: v b)
    r> over r@ > ( nh bs t b f R: v)
    if nip rot else swap negate r> + >r rot over + then ( bs b nh R: v)
    rot 1- rot 1 rshift >r swap r@ 0= ( bs nh R: v b)
    until rdrop rdrop nip
    ;

    Works on both 32- and 64-bit. Isn't surprisingly fast. As a matter of fact: I've seen implementations of binary algorithms that were faster.

    Hans Bezemer

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans Bezemer@21:1/5 to Gerry Jackson on Wed Dec 7 05:44:58 2022
    On Wednesday, December 7, 2022 at 12:36:24 PM UTC+1, Gerry Jackson wrote:
    Worked OK for me, I wasn't concerned about speed so don't know how it compares in that respect. Incidentally this provides an example of using
    the stack vs locals.
    The algorithm that started this discussion did 1M iterations in about 0.76s. Yours is 0.2s faster on my combo - even with all the function call overhead.

    HB

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans Bezemer@21:1/5 to minf...@arcor.de on Wed Dec 7 05:24:49 2022
    On Wednesday, December 7, 2022 at 2:01:06 PM UTC+1, minf...@arcor.de wrote:
    : USQRT32 {: u | T H B:$8000 S:15 == H :}
    Oops - 32-bit only! It's actually two loops - one to figure out the first candidate for a binary square
    and the second one to figure out the square itself. That's where S and B come from - if you offer
    MAX-N the first possible candidate (in 32 bits = SQRT(MAX-) ~ 2^15). So S and B are related.

    Therefore, you can derive B from S at the start of the program - and may be one could eradicate
    a lot of these vars from the algorithm - with some speed penalty of course. But I haven't delved
    that deep (yet).

    S seems to be equal to half the number of bits of MAX-N, rounded down. On this basis you can
    calculate the number of bits for S for a specific architecture.

    BTW, this might be an interesting one for this discussion. https://sourceforge.net/p/forth-4th/wiki/Inside%20Zen%20FSQRT/

    It explains how (on 32 bit) an FSQRT is made by:
    1. Cranking up the mantissa to the max;
    2. Halving the exponent;
    3. Calculating a good approximation now the (integer) mantissa is close to MAX-N.

    In 32-bit it required about 3-4 integer iterations and 3-4 floating point iterations - if I remember
    correctly.

    HB

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gerry Jackson@21:1/5 to Hans Bezemer on Wed Dec 7 15:30:16 2022
    On 07/12/2022 13:44, Hans Bezemer wrote:
    On Wednesday, December 7, 2022 at 12:36:24 PM UTC+1, Gerry Jackson wrote:
    Worked OK for me, I wasn't concerned about speed so don't know how it
    compares in that respect. Incidentally this provides an example of using
    the stack vs locals.
    The algorithm that started this discussion did 1M iterations in about 0.76s. Yours is 0.2s faster on my combo - even with all the function call overhead.

    HB

    I've just realised it can be made slightly quicker using the well formed
    flag trick in the last line to eliminate the IF ... THEN

    : usqrt ( u -- u1 )
    dup 2 u< if exit then
    dup >r 2 rshift recurse
    2* ( -- u sm )
    1+ dup ( -- u la la )
    dup * ( -- u la la^2 )
    \ r> u> if 1- then ( -- u1 )
    r> u> + ( -- u1 )
    ;

    I believe 4th uses 1 for TRUE so the last line for 4th would be:
    r> u> -

    I'm a bit dubious about whether a marginal saving like that is worthwhile.


    --
    Gerry

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@arcor.de@21:1/5 to Gerry Jackson on Wed Dec 7 10:12:09 2022
    Gerry Jackson schrieb am Mittwoch, 7. Dezember 2022 um 16:30:20 UTC+1:
    On 07/12/2022 13:44, Hans Bezemer wrote:
    On Wednesday, December 7, 2022 at 12:36:24 PM UTC+1, Gerry Jackson wrote:
    Worked OK for me, I wasn't concerned about speed so don't know how it
    compares in that respect. Incidentally this provides an example of using >> the stack vs locals.
    The algorithm that started this discussion did 1M iterations in about 0.76s.
    Yours is 0.2s faster on my combo - even with all the function call overhead.

    HB
    I've just realised it can be made slightly quicker using the well formed
    flag trick in the last line to eliminate the IF ... THEN
    : usqrt ( u -- u1 )
    dup 2 u< if exit then
    dup >r 2 rshift recurse
    2* ( -- u sm )
    1+ dup ( -- u la la )
    dup * ( -- u la la^2 )
    \ r> u> if 1- then ( -- u1 )
    + ( -- u1 )
    ;

    I believe 4th uses 1 for TRUE so the last line for 4th would be:
    -

    I'm a bit dubious about whether a marginal saving like that is worthwhile.

    \ with microscopic change & tested
    : USQRT
    dup 2 u< IF EXIT THEN
    dup >r 2 rshift recurse
    2* dup 1+ dup * r> u< - ;

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gerry Jackson@21:1/5 to minf...@arcor.de on Wed Dec 7 21:07:28 2022
    On 07/12/2022 18:12, minf...@arcor.de wrote:
    Gerry Jackson schrieb am Mittwoch, 7. Dezember 2022 um 16:30:20 UTC+1:
    On 07/12/2022 13:44, Hans Bezemer wrote:
    On Wednesday, December 7, 2022 at 12:36:24 PM UTC+1, Gerry Jackson wrote: >>>> Worked OK for me, I wasn't concerned about speed so don't know how it
    compares in that respect. Incidentally this provides an example of using >>>> the stack vs locals.
    The algorithm that started this discussion did 1M iterations in about 0.76s.
    Yours is 0.2s faster on my combo - even with all the function call overhead.

    HB
    I've just realised it can be made slightly quicker using the well formed
    flag trick in the last line to eliminate the IF ... THEN
    : usqrt ( u -- u1 )
    dup 2 u< if exit then
    dup >r 2 rshift recurse
    2* ( -- u sm )
    1+ dup ( -- u la la )
    dup * ( -- u la la^2 )
    \ r> u> if 1- then ( -- u1 )
    + ( -- u1 )
    ;

    I believe 4th uses 1 for TRUE so the last line for 4th would be:
    -

    I'm a bit dubious about whether a marginal saving like that is worthwhile.

    \ with microscopic change & tested
    : USQRT
    dup 2 u< IF EXIT THEN
    dup >r 2 rshift recurse
    2* dup 1+ dup * r> u< - ;


    Sorry but that doesn't seem to work for all cases e.g. using your USQRT

    25 usqrt . \ displays 4

    Using this test program on GForth
    : x ( n -- ) \ get square roots from 0 to n-1
    -1 swap 0
    do
    i usqrt 2dup <
    if cr i 3 .r ." : " nip dup then \ Display next square number
    . \ Followed by square roots of integers up to next square
    loop drop
    ;

    \ With my USQRT
    101 x
    0: 0
    1: 1 1 1
    4: 2 2 2 2 2
    9: 3 3 3 3 3 3 3
    16: 4 4 4 4 4 4 4 4 4
    25: 5 5 5 5 5 5 5 5 5 5 5
    36: 6 6 6 6 6 6 6 6 6 6 6 6 6
    49: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
    64: 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
    81: 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
    100: 10 ok

    Whereas using your version we get

    101 x
    0: 0
    1: 1 1 1
    4: 2 2 2 2 2 2
    10: 3 3 3 3 3 3
    16: 4 4 4 4 4 4 4 4 4 4
    26: 5 5 5 5 5 5 5 5 5 5 5 5 5 5
    40: 6 6 6 6 6 6 6 6 6 6
    50: 7 7 7 7 7 7 7 7 7 7 7 7 7 7
    64: 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
    82: 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 ok

    I'm tired and it's too late to start thinking about it tonight.

    --
    Gerry

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to dxforth on Wed Dec 7 22:06:01 2022
    dxforth <dxforth@gmail.com> writes:
    Locals are supposed to reduce 'stack juggling' and simplify code but in this >instance it was the juggling between variables that proved extraneous.

    I have collected some of the versions posted here into

    http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th

    I then created some versions from it that have the stack effect ( +n1
    -- n2 ) and that do not count the iterations:

    \ derived from original locals version
    : USQRT1 {: u | x0 x1 -- uroot ; positive integer square root :}
    u 2/ to x0
    BEGIN u x0 / x0 + 2/ to x1
    x1 x0 <
    WHILE x1 to x0
    REPEAT
    x0 ;

    \ derived from dxforths locals-less version
    : USQRT4 ( +n -- +root )
    dup 2/ >r
    BEGIN ( +n r:x0 )
    dup r@ / r@ + 2/ ( +n x1 R:x0 )
    dup r@ <
    WHILE
    r> drop >r
    REPEAT
    2drop r> ;

    \ An attempt to write code for VFX's current limitations: Try to do
    \ everything on the data stack with only one stack item at basic block
    \ bounaries. Keep u on the return stack, so we only do a read from
    \ there. Unfortunately, the data stack depth changes here.
    : usqrt9 ( u -- root )
    dup >r 2/ begin ( x0 R:u )
    r@ over / over + 2/ ( x0 x1 R:u )
    2dup > while
    nip repeat
    r> 2drop ;

    \ Maybe we can do better by not changing the stack depth; however, we
    \ use 2 stack items in the basic blocks in the loop.
    : usqrta ( u -- root )
    dup >r 2/ r@ begin ( x0 u R:u )
    over / over + 2/ ( x0 x1 R:u )
    2dup > while
    nip r@ repeat
    r> 2drop ;

    Let's first look at the inner loops as produced by VFX64:

    usqrt1 usqrt4 usqrt9 usqrta
    MOV RAX, [RDI+10] MOV RCX, [RSP] MOV RDX, [RSP] MOV RAX, RBX
    CQO MOV RAX, RBX MOV RAX, RDX CQO
    IDIV QWord [RDI+-08] CQO CQO IDIV QWord [RBP] ADD RAX, [RDI+-08] IDIV RCX IDIV RBX ADD RAX, [RBP] SAR RAX, # 1 MOV RDX, [RSP] ADD RAX, RBX SAR RAX, # 1
    MOV [RDI+-10], RAX ADD RDX, RAX SAR RAX, # 1 CMP RAX, [RBP] MOV RDX, [RDI+-10] SAR RDX, # 1 CMP RBX, RAX MOV RBX, RAX
    CMP RDX, [RDI+-08] MOV RCX, [RSP] LEA RBP, [RBP+-08] JNL 004E410D
    JNL 004E3F00 CMP RDX, RCX MOV [RBP], RBX MOV RDX, [RSP] MOV RDX, [RDI+-10] LEA RBP, [RBP+-08] MOV RBX, RAX MOV [RBP], RBX MOV [RDI+-08], RDX MOV [RBP], RBX JLE/N004E4090 MOV RBX, RDX
    JMP 004E3ED3 MOV RBX, RDX LEA RBP, [RBP+08] JMP 004E40E3
    JNL 004E4028 JMP 004E4064
    LEA RSP, [RSP+08]
    PUSH RBX
    MOV RBX, [RBP]
    LEA RBP, [RBP+08]
    JMP 004E3FEA

    Here the locals code looks quite good, but it has a lot of memory
    accesses. 9 and a look indeed better than 4. Let's see how they
    perform:

    for i in 1 4 9 a; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'
    187.7 257.1 191.3 179.7 [h0:~/pub/forth/programs:86099] for i in 1 4 9 a; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'

    1 4 9 a
    187.7 257.1 191.3 179.7 instructions per invocation (average from u=2..1e7) 165.9 141.2 136.3 161.9 cycles per invocation on Rocket Lake
    169.5 105.0 156.8 163.0 cycles per invocation on Zen3

    The Zen3 result is quite surprising: while usqrt4 executes the most instructions (as expected), it takes far fewer cycles on Zen3 than all
    the other versions.

    The Rocket Lake result is not as great for usqrt4, and usqrt9 beats it
    by a little.

    I currently have no explanation why the cycles results are like this.

    - anton









































    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to Gerry Jackson on Thu Dec 8 11:09:55 2022
    On 8/12/2022 2:30 am, Gerry Jackson wrote:
    ...
    I've just realised it can be made slightly quicker using the well formed flag trick in the last line to eliminate the IF ... THEN
    ...
    I'm a bit dubious about whether a marginal saving like that is worthwhile.

    Especially when an optimizing compiler is compelled to generate the 'well-formed'
    flag. I don't think it was ever a given that 'calculate' was better than 'test and branch'. Each situation according to its merits.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to Anton Ertl on Thu Dec 8 12:24:48 2022
    On 8/12/2022 9:06 am, Anton Ertl wrote:
    dxforth <dxforth@gmail.com> writes:
    Locals are supposed to reduce 'stack juggling' and simplify code but in this >> instance it was the juggling between variables that proved extraneous.

    I have collected some of the versions posted here into

    http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th

    I then created some versions from it that have the stack effect ( +n1
    -- n2 ) and that do not count the iterations:

    ...

    \ derived from dxforths locals-less version
    : USQRT4 ( +n -- +root )
    dup 2/ >r
    BEGIN ( +n r:x0 )
    dup r@ / r@ + 2/ ( +n x1 R:x0 )
    dup r@ <
    WHILE
    r> drop >r
    REPEAT
    2drop r> ;

    With no count to maintain, it can be simplified to:

    : USQRT4 ( +n -- +root )
    dup 2/ dup
    BEGIN drop
    2dup / over + 2/
    2dup >
    WHILE
    swap
    REPEAT
    rot 2drop ;

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From SpainHackForth@21:1/5 to dxforth on Thu Dec 8 02:09:39 2022
    On Thursday, December 8, 2022 at 2:24:55 AM UTC+1, dxforth wrote:
    On 8/12/2022 9:06 am, Anton Ertl wrote:
    dxforth <dxf...@gmail.com> writes:
    Locals are supposed to reduce 'stack juggling' and simplify code but in this
    instance it was the juggling between variables that proved extraneous.

    I have collected some of the versions posted here into

    http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th

    I then created some versions from it that have the stack effect ( +n1
    -- n2 ) and that do not count the iterations:

    ...

    \ derived from dxforths locals-less version
    : USQRT4 ( +n -- +root )
    dup 2/ >r
    BEGIN ( +n r:x0 )
    dup r@ / r@ + 2/ ( +n x1 R:x0 )
    dup r@ <
    WHILE
    drop >r
    REPEAT
    2drop r> ;
    With no count to maintain, it can be simplified to:
    : USQRT4 ( +n -- +root )
    dup 2/ dup
    BEGIN drop
    2dup / over + 2/
    2dup >
    WHILE
    swap
    REPEAT
    rot 2drop ;

    I wrote a assembler word for Mecrisp passed on this article… I think it was number number 4… I tried number 9 and it was not very precise… https://www.codeproject.com/Articles/69941/Best-Square-Root-Method-Algorithm-Function-Precisi

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Heinrich Hohl@21:1/5 to Anton Ertl on Thu Dec 8 02:12:34 2022
    On Tuesday, December 6, 2022 at 11:14:13 PM UTC+1, Anton Ertl wrote:
    Here's a LOG2 definition (originally by Wil Baden
    <wilbadenE...@netcom.com>, with his peculiarities removed, extended to 64 bits, and slightly more efficient):

    : log2 ( n -- log2 )
    -1 swap ( log2 n)
    dup $ffffffff00000000 and if swap 32 + swap 32 rshift then
    dup $00000000ffff0000 and if swap 16 + swap 16 rshift then
    dup $000000000000ff00 and if swap 8 + swap 8 rshift then
    dup $00000000000000f0 and if swap 4 + swap 4 rshift then
    dup $000000000000000c and if swap 2 + swap 2 rshift then
    dup $0000000000000002 and if swap 1 + swap 1 rshift then
    + ;

    Hello Anton,

    did Wil Baden explain in detail how the algorithm works?

    The routine works properly and the code is clear, but the algorithm
    behind it is not easy to understand. I would be grateful if you have
    any reference with more information on this Log2 algorithm.

    Henry

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans Bezemer@21:1/5 to Heinrich Hohl on Thu Dec 8 09:12:19 2022
    On Thursday, December 8, 2022 at 11:12:36 AM UTC+1, Heinrich Hohl wrote:
    did Wil Baden explain in detail how the algorithm works?

    The routine works properly and the code is clear, but the algorithm
    behind it is not easy to understand. I would be grateful if you have
    any reference with more information on this Log2 algorithm.

    Maybe this helps:
    decimal $ffffffff00000000 2 base ! u. 1111111111111111111111111111111100000000000000000000000000000000 ok
    decimal $00000000ffff0000 2 base ! u. 11111111111111110000000000000000 ok decimal $000000000000ff00 2 base ! u. 1111111100000000 ok
    decimal $00000000000000f0 2 base ! u. 11110000 ok
    decimal $000000000000000c 2 base ! u. 1100 ok
    decimal $0000000000000002 2 base ! u. 10

    If the first one checks out (AND returns a non-zero value) we know it is in the top 32-bits.
    So we update out count (it's bits 0 - 63, so we start off with -1) by adding 32 to it, shift
    the whole shebang 32 bits to the right and test the first half of the last 32 bits.

    If AND had returned zero, it's not in the first 32 bits - so we leave it be and continue with the
    first half of the 32 bits. Same thing here - if zero, we leave it be, if not we add 16 to the count
    and shift the whole darn thing 16 bits to the right - and repeat the entire procedure for the
    first half of the 16 bits (which is 8 bits).

    The last bit left is either 0 or 1. So we add it to our count. Let's assume we did "*0001" then
    that bit is added to -1 - and it returns that bit zero is set. If we did "*0000" then zero is added
    to -1 - signalling that NO bit is set at all.

    It's quite simple and elegant when you think of it.

    Hans Bezemer

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From P Falth@21:1/5 to dxforth on Thu Dec 8 09:54:14 2022
    On Thursday, 8 December 2022 at 01:10:06 UTC+1, dxforth wrote:
    On 8/12/2022 2:30 am, Gerry Jackson wrote:
    ...
    I've just realised it can be made slightly quicker using the well formed flag trick in the last line to eliminate the IF ... THEN
    ...
    I'm a bit dubious about whether a marginal saving like that is worthwhile.
    Especially when an optimizing compiler is compelled to generate the 'well-formed'
    flag. I don't think it was ever a given that 'calculate' was better than 'test
    and branch'. Each situation according to its merits.

    This is exactly what happens with my codegenerator. The "optimized" version is 5% larger (but just 3 bytes) and more surprisingly have 30% longer runtime! .

    Gerry's original recursive version is anyway 5 times faster then your version!

    Could be the division is slow on my processor, an E5-4657L v2

    BR
    Peter

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Heinrich Hohl@21:1/5 to the.bee...@gmail.com on Thu Dec 8 09:44:34 2022
    On Thursday, December 8, 2022 at 6:12:21 PM UTC+1, the.bee...@gmail.com wrote:
    It's quite simple and elegant when you think of it.

    Hans Bezemer

    Hello Hans,

    thank you for the explanations. Yes, the algorithm is tricky but also elegant. This is basically a binary search for the topmost nonzero bit.

    I had a look at "Bit Twiddling Hacks" and found similar algorithms. https://graphics.stanford.edu/~seander/bithacks.html

    "Count the consecutive zero bits (trailing) on the right by binary search"
    is almost the same algorithm, except that it searches for the lowermost nonzero bit.

    Henry

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Heinrich Hohl on Thu Dec 8 17:32:24 2022
    Heinrich Hohl <hheinrich.hohl@gmail.com> writes:
    On Tuesday, December 6, 2022 at 11:14:13 PM UTC+1, Anton Ertl wrote:
    Here's a LOG2 definition (originally by Wil Baden
    <wilbadenE...@netcom.com>, with his peculiarities removed, extended to 64 bits, and slightly more efficient):

    : log2 ( n -- log2 )
    -1 swap ( log2 n)
    dup $ffffffff00000000 and if swap 32 + swap 32 rshift then
    dup $00000000ffff0000 and if swap 16 + swap 16 rshift then
    dup $000000000000ff00 and if swap 8 + swap 8 rshift then
    dup $00000000000000f0 and if swap 4 + swap 4 rshift then
    dup $000000000000000c and if swap 2 + swap 2 rshift then
    dup $0000000000000002 and if swap 1 + swap 1 rshift then
    + ;

    Hello Anton,

    did Wil Baden explain in detail how the algorithm works?

    I only have found it in a reply by me, and what I kept from his
    posting does not include an explanation. You could take the
    Message-Id <wilbadenE5oFJA.57w@netcom.com> and look it up.
    al.howardknight.net does not have it (probably too old); maybe Google
    Groups can find it (but the last few times I tried to find Message-Ids
    with Google Groups, I was unsuccessful).

    The routine works properly and the code is clear, but the algorithm
    behind it is not easy to understand. I would be grateful if you have
    any reference with more information on this Log2 algorithm.

    Hans Bezemer explains it.

    A similar way to do it, but without ifs (derived from the C code in
    Gforth):

    : log2a ( n -- log2 )
    dup 0= swap ( log2' n )
    dup $ffffffff u> 32 and rot over + -rot rshift
    dup $0000ffff u> 16 and rot over + -rot rshift
    dup $000000ff u> 8 and rot over + -rot rshift
    dup $0000000f u> 4 and rot over + -rot rshift
    dup $00000003 u> 2 and rot over + -rot rshift
    $00000001 u> 1 and + ;

    \ checked against Gforth's builtin log2
    : check
    100000000 0 do i log2 i log2a <> if cr i hex. i log2 . i log2a . then loop ;

    VFX64 translates this into 47 instructions (163 bytes) without a
    single memory reference.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to dxforth on Thu Dec 8 18:41:36 2022
    dxforth <dxforth@gmail.com> writes:
    With no count to maintain, it can be simplified to:

    : USQRT4 ( +n -- +root )
    dup 2/ dup
    BEGIN drop
    2dup / over + 2/
    2dup >
    WHILE
    swap
    REPEAT
    rot 2drop ;

    The version above is called USQRTB (or B) below. It inspired me to:

    : usqrtc ( u -- root )
    dup 2/ dup
    begin ( u x0 x1 )
    nip ( u x1 )
    2dup / over + 2/ ( u x1 x2 )
    2dup <= until
    drop nip ;

    The idea here is to never move a value on the critical path to memory
    in VFX (i.e., the critical path value should be on the TOS at the
    basic block boundary).

    Here are the numbers, I'll show the code in a later posting:

    for i in 1 4 9 a b c; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null
    for i in 1 4 9 a b c; do perf stat -x" " -e cycles:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'

    1 4 9 a b c
    187.7 257.1 191.3 179.7 177.7 141.0 instructions per invocation
    166.3 141.2 139.0 161.4 160.6 136.3 cycles per invocation on Rocket Lake
    169.6 105.1 157.5 162.8 168.5 95.7 cycles per invocation on Zen3

    The cycles for USQRT4 vary a lot on the Rocket Lake; they are often
    174 cycles, but here I show a good result.

    Anyway, the idea behind USQRTC seems to work well: It wins in the
    number of instructions, and the number of cycles on both CPUs.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From P Falth@21:1/5 to Anton Ertl on Thu Dec 8 12:04:25 2022
    On Thursday, 8 December 2022 at 19:52:02 UTC+1, Anton Ertl wrote:
    dxforth <dxf...@gmail.com> writes:
    With no count to maintain, it can be simplified to:

    : USQRT4 ( +n -- +root )
    dup 2/ dup
    BEGIN drop
    2dup / over + 2/
    2dup >
    WHILE
    swap
    REPEAT
    rot 2drop ;
    The version above is called USQRTB (or B) below. It inspired me to:

    : usqrtc ( u -- root )
    dup 2/ dup
    begin ( u x0 x1 )
    nip ( u x1 )
    2dup / over + 2/ ( u x1 x2 )
    2dup <= until
    drop nip ;

    The idea here is to never move a value on the critical path to memory
    in VFX (i.e., the critical path value should be on the TOS at the
    basic block boundary).

    Here are the numbers, I'll show the code in a later posting:

    for i in 1 4 9 a b c; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null
    for i in 1 4 9 a b c; do perf stat -x" " -e cycles:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'

    1 4 9 a b c
    187.7 257.1 191.3 179.7 177.7 141.0 instructions per invocation
    166.3 141.2 139.0 161.4 160.6 136.3 cycles per invocation on Rocket Lake 169.6 105.1 157.5 162.8 168.5 95.7 cycles per invocation on Zen3

    The cycles for USQRT4 vary a lot on the Rocket Lake; they are often
    174 cycles, but here I show a good result.

    Anyway, the idea behind USQRTC seems to work well: It wins in the
    number of instructions, and the number of cycles on both CPUs.

    But its name is wrong. It is not unsigned. It requires a signed positive number as seen also by the original stack comment .

    try -1 usqrtc

    BR
    Peter

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@arcor.de@21:1/5 to P Falth on Thu Dec 8 12:35:37 2022
    P Falth schrieb am Donnerstag, 8. Dezember 2022 um 18:54:15 UTC+1:
    On Thursday, 8 December 2022 at 01:10:06 UTC+1, dxforth wrote:
    On 8/12/2022 2:30 am, Gerry Jackson wrote:
    ...
    I've just realised it can be made slightly quicker using the well formed flag trick in the last line to eliminate the IF ... THEN
    ...
    I'm a bit dubious about whether a marginal saving like that is worthwhile.
    Especially when an optimizing compiler is compelled to generate the 'well-formed'
    flag. I don't think it was ever a given that 'calculate' was better than 'test
    and branch'. Each situation according to its merits.
    This is exactly what happens with my codegenerator. The "optimized" version is
    5% larger (but just 3 bytes) and more surprisingly have 30% longer runtime! .

    Gerry's original recursive version is anyway 5 times faster then your version!

    Could be the division is slow on my processor, an E5-4657L v2

    IMO 5 times speed difference is too much to be explained by a single division. Memory and CPU cache effects seem more plausible. Recursion retakes
    items from the stack that is probably still cached, because those items were stored only recently there in a preceding cycle.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to All on Thu Dec 8 13:22:28 2022
    On Thursday, December 8, 2022 at 7:52:02 PM UTC+1, Anton Ertl wrote:
    [..]
    FORTH> in ( #100,000,000 iterations )
    303700049900000000 7.753 seconds elapsed. minforth
    303700049900000000 7.039 seconds elapsed. dxforth #1
    303700049900000000 4.424 seconds elapsed. Albert
    303700049900000000 0.681 seconds elapsed. Anton #1
    303700049900000000 6.229 seconds elapsed. Gerry
    303700049900000000 6.540 seconds elapsed. Anton #2
    303700049900000000 6.208 seconds elapsed. Anton #3
    303700049900000000 6.554 seconds elapsed. dxforth #2
    303700049900000000 6.497 seconds elapsed. Anton #3 ok
    303700049900000000 0.036 seconds elapsed. iForth ok
    FORTH> .TICKER-INFO
    AMD Ryzen 7 5800X 8-Core Processor, TICKS-GET uses iTSC at 4192MHz

    Locals cost something, but a smart algorithm obviously wins.

    The last result is because some inlined words evaluate their arguments.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to P Falth on Thu Dec 8 22:26:50 2022
    P Falth <peter.m.falth@gmail.com> writes:
    But its name is wrong. It is not unsigned. It requires a signed positive number
    as seen also by the original stack comment .

    Yes, that can be fixed with

    : usqrtd ( u -- root )
    dup 1 rshift dup
    begin ( u x0 x1 )
    nip ( u x1 )
    2dup u/ over + 1 rshift ( u x1 x2 )
    2dup u<= until
    drop nip ;

    but not every Forth system has U/ and U<= (and VFX64 has U/, but does
    not inline it).

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Thu Dec 8 22:33:15 2022
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    dxforth <dxforth@gmail.com> writes:
    With no count to maintain, it can be simplified to:

    : USQRT4 ( +n -- +root )
    dup 2/ dup
    BEGIN drop
    2dup / over + 2/
    2dup >
    WHILE
    swap
    REPEAT
    rot 2drop ;

    The version above is called USQRTB (or B) below. It inspired me to:

    : usqrtc ( u -- root )
    dup 2/ dup
    begin ( u x0 x1 )
    nip ( u x1 )
    2dup / over + 2/ ( u x1 x2 )
    2dup <= until
    drop nip ;

    The idea here is to never move a value on the critical path to memory
    in VFX (i.e., the critical path value should be on the TOS at the
    basic block boundary).

    Here are the numbers, I'll show the code in a later posting:

    for i in 1 4 9 a b c; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null
    for i in 1 4 9 a b c; do perf stat -x" " -e cycles:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'

    1 4 9 a b c
    187.7 257.1 191.3 179.7 177.7 141.0 instructions per invocation
    166.3 141.2 139.0 161.4 160.6 136.3 cycles per invocation on Rocket Lake >169.6 105.1 157.5 162.8 168.5 95.7 cycles per invocation on Zen3

    The cycles for USQRT4 vary a lot on the Rocket Lake; they are often
    174 cycles, but here I show a good result.

    Anyway, the idea behind USQRTC seems to work well: It wins in the
    number of instructions, and the number of cycles on both CPUs.

    Here is the promised code for the inner loop on VFX64:

    usqrt4 usqrt9 usqrtb usqrtc
    MOV RCX, [RSP] MOV RDX, [RSP] MOV RAX, [RBP+08] MOV RAX, [RBP+08] MOV RAX, RBX MOV RAX, RDX CQO CQO
    CQO CQO IDIV QWord [RBP] IDIV RBX
    IDIV RCX IDIV RBX ADD RAX, [RBP] ADD RAX, RBX
    MOV RDX, [RSP] ADD RAX, RBX SAR RAX, # 1 SAR RAX, # 1
    ADD RDX, RAX SAR RAX, # 1 CMP RAX, [RBP] CMP RBX, RAX
    SAR RDX, # 1 CMP RBX, RAX MOV RBX, RAX MOV [RBP], RBX
    MOV RCX, [RSP] LEA RBP, [RBP+-08] JNL 004E418D MOV RBX, RAX
    CMP RDX, RCX MOV [RBP], RBX MOV RDX, RBX JNLE 004E41E2
    LEA RBP, [RBP+-08] MOV RBX, RAX MOV RBX, [RBP]
    MOV [RBP], RBX JLE/N004E4090 MOV [RBP], RDX
    MOV RBX, RDX LEA RBP, [RBP+08] JMP 004E4162
    JNL 004E4028 JMP 004E4064
    LEA RSP, [RSP+08]
    PUSH RBX
    MOV RBX, [RBP]
    LEA RBP, [RBP+08]
    JMP 004E3FEA

    For space reasons I left USQRT1 and USQRTA (relatively slow versions
    away here, you can look at them in an earlier posting.

    The memory reference in USQRTC are for different memory locations, so
    one is only read (the value U), and another is only written (to a
    different location, which is then not used in the next iteration; it's
    only relevant when leaving the loop.

    USQRT9 also seems to have this property, yet is signifiantly slower;
    the reasons for that are unclear to me.

    USQRT4 PUSHes a value (x1) to the return stack towards the end of an
    iteration, and loads it (as x0) at the start of the next iteration.
    This does not seem to hurt much, though, and I don't know why.

    USQRTB moves a value to data stack RAM in the penultimate instruction
    of the iteration and loads it again in the IDIV instruction.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@arcor.de@21:1/5 to Marcel Hendrix on Thu Dec 8 14:21:37 2022
    Marcel Hendrix schrieb am Donnerstag, 8. Dezember 2022 um 22:22:30 UTC+1:
    On Thursday, December 8, 2022 at 7:52:02 PM UTC+1, Anton Ertl wrote:
    [..]
    FORTH> in ( #100,000,000 iterations )
    303700049900000000 7.753 seconds elapsed. minforth
    303700049900000000 7.039 seconds elapsed. dxforth #1
    303700049900000000 4.424 seconds elapsed. Albert
    303700049900000000 0.681 seconds elapsed. Anton #1
    303700049900000000 6.229 seconds elapsed. Gerry
    303700049900000000 6.540 seconds elapsed. Anton #2
    303700049900000000 6.208 seconds elapsed. Anton #3
    303700049900000000 6.554 seconds elapsed. dxforth #2
    303700049900000000 6.497 seconds elapsed. Anton #3 ok
    303700049900000000 0.036 seconds elapsed. iForth ok
    FORTH> .TICKER-INFO
    AMD Ryzen 7 5800X 8-Core Processor, TICKS-GET uses iTSC at 4192MHz

    Locals cost something, but a smart algorithm obviously wins.

    I think many Forths do some form of register caching for top stack element(s). But keeping locals in registers seems still unpopular in Forth. This could explain
    those many memory accesses by VFX Forth for the locals version, and the incurred speed penalty.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to Anton Ertl on Fri Dec 9 11:09:43 2022
    On 9/12/2022 5:41 am, Anton Ertl wrote:
    dxforth <dxforth@gmail.com> writes:
    With no count to maintain, it can be simplified to:

    : USQRT4 ( +n -- +root )
    dup 2/ dup
    BEGIN drop
    2dup / over + 2/
    2dup >
    WHILE
    swap
    REPEAT
    rot 2drop ;

    The version above is called USQRTB (or B) below. It inspired me to:

    : usqrtc ( u -- root )
    dup 2/ dup
    begin ( u x0 x1 )
    nip ( u x1 )
    2dup / over + 2/ ( u x1 x2 )
    2dup <= until
    drop nip ;

    One less branch. Neat.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Fri Dec 9 09:58:27 2022
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    dxforth <dxforth@gmail.com> writes:
    Locals are supposed to reduce 'stack juggling' and simplify code but in this >>instance it was the juggling between variables that proved extraneous.

    I have collected some of the versions posted here into

    http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th

    After establishing that USQRTD has the fastest loop, I now add the
    better initial value. I benchmark albert@cherry.(none)'s cheap, but
    less accurate initial value as well as my more precise, but more
    expensive initial value. The values tested are for 2..10^7 (like in
    the other tests).

    First, the results:

    c d 5 e
    141.1 104.5 77.0 63.8 instructions/invocation
    136.3 46.7 46.0 45.9 cycles/invocation Rocket Lake
    95.8 40.0 32.5 32.5 cycles/invocation Zen3

    c uses the original initial value: u 2/
    d uses my log2-based initial value (see below)
    5 and e use albert@cherry.(none)'s initial value;
    5 is albert@cherry.(none)'s code
    e uses his initial value with the rest from c.

    Concerning usqrtd, I could not use exactly, what I outlined earlier, but
    had to adapt it slightly to guarantee an initial value that's larger
    than the root (required by the terminating condition of the loop):

    : log2 ( n -- log2 )
    dup 0= swap ( log2' n )
    dup $ffffffff u> 32 and rot over + -rot rshift
    dup $0000ffff u> 16 and rot over + -rot rshift
    dup $000000ff u> 8 and rot over + -rot rshift
    dup $0000000f u> 4 and rot over + -rot rshift
    dup $00000003 u> 2 and rot over + -rot rshift
    $00000001 u> 1 and + ;

    : usqrtd ( u -- root )
    2 over log2 2/ lshift
    dup begin ( u x0 x1 )
    nip ( u x1 )
    2dup / over + 2/ ( u x1 x2 )
    2dup <= until
    drop nip ;

    : uSQRT5 DUP >R 10 RSHIFT 1024 MAX \ Minimize iterations.
    BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE
    NIP REPEAT DROP R> DROP ;

    : usqrte ( u -- root )
    dup 10 rshift 1024 max
    dup begin ( u x0 x1 )
    nip ( u x1 )
    2dup / over + 2/ ( u x1 x2 )
    2dup <= until
    drop nip ;

    It's obvious that, for the range of numbers benchmarked, the increased precision of USQRTD does not compensate for its increased cost. I
    also benchmarked with u=1000000000002..1000010000000, with the
    following results:

    c d 5 e
    226.9 102.7 201.4 149.9 instructions/invocation
    270.7 45.3 142.2 142.6 cycles/invocation Rocket Lake
    170.5 39.5 102.1 101.8 cycles/invocation Zen3

    So for larger u, the higher precision of the log2 approach pays off.
    OTOH, if you knew that u is in that range, you could use the following
    cheap initial value:

    DUP 20 RSHIFT $100000 MAX

    and win over SQRTD again.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Anton Ertl on Fri Dec 9 10:09:48 2022
    On Friday, December 9, 2022 at 11:33:55 AM UTC+1, Anton Ertl wrote:
    Concerning usqrtd, I could not use exactly, what I outlined earlier, but
    had to adapt it slightly to guarantee an initial value that's larger
    than the root (required by the terminating condition of the loop):

    iForth has the log2 method implemented as a table (it is very
    frequently used in my type of applications). For your version of
    the benchmark implementation, I didn't want to mention that
    I had to use "log2 1+".

    Therefore the sqrt with log2 initial value wins by several
    streetlengths. (Although the SQRT based on FSQRT might
    still be faster).

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Marcel Hendrix on Fri Dec 9 18:36:07 2022
    Marcel Hendrix <mhx@iae.nl> writes:
    iForth has the log2 method implemented as a table (it is very
    frequently used in my type of applications).

    Note that Intel and AMD have bsr (since the 386) and lzcnt (if the
    BMI1 or ABM extensions are present) that allow a fast implementation
    of log2. In (development) Gforth it's

    55CD95A2B62A: test r8,r8
    55CD95A2B62D: jz $55CD95A2B645
    55CD95A2B62F: bsr r8,r8
    55CD95A2B633: xor r8,$3F
    55CD95A2B637: movsx rax,r8
    55CD95A2B63A: mov r8d,$3F
    55CD95A2B640: sub r8,rax
    55CD95A2B643: jmp $55CD95A2B64C
    55CD95A2B645: mov r8,$FFFFFFFF
    55CD95A2B64C: add r13,$08
    55CD95A2B650: mov rcx,-$08[r13]
    55CD95A2B654: jmp ecx

    but I can see several ways to improve on this gcc output (bsr itself
    is almost there).

    Given that Gforth has LOG2 and iforth has LOG2, there is a slight
    chance of standardization.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From lehs@21:1/5 to All on Fri Dec 9 14:11:49 2022
    tisdag 6 december 2022 kl. 18:21:48 UTC+1 skrev Anton Ertl:
    "minf...@arcor.de" <minf...@arcor.de> writes:
    Inspired by another thread I came up with the iterative code shown below. >The challenge: find a shorter AND faster version of USQRT!
    Have fun!
    ( please don't short-circuit through FSQRT or such ;-)

    \ code:
    : USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
    0 to ct u 2/ to x0
    BEGIN u x0 / x0 + 2/ to x1
    x1 x0 <
    WHILE x1 to x0
    ct 1+ to ct
    REPEAT
    ." root:" x0 . ." iterations: " ct . ;
    This is the specialization of the Newton-Raphson method for the square
    root, see <https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>

    For the initial value I would use

    1 bits/cell u clz - 2/ lshift

    where clz ( x -- u ) is the count-leading-zeros function that many architectures have built-in. Newton-Raphson profits a lot from a good
    initial value.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    In android gforth lg2 is slightly faster than log2, especially for smaller u. The idea of lg2 is to asume that u is exponential distributed rather than rectangular.

    : lg2 \ u -- n
    locals| u |
    0 -1
    begin 2* dup u and
    while 1 under+
    repeat drop ;

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Anton Ertl on Fri Dec 9 14:12:43 2022
    On Friday, December 9, 2022 at 7:48:45 PM UTC+1, Anton Ertl wrote:
    Marcel Hendrix <m...@iae.nl> writes:
    iForth has the log2 method implemented as a table (it is very
    frequently used in my type of applications).

    Sorry, I mixed log2 up with 2^x ( u -- 2^u ) ...

    Note that Intel and AMD have bsr (since the 386) and lzcnt (if the
    BMI1 or ABM extensions are present) that allow a fast implementation
    of log2.

    FORTH> : test 7 log2 . ; ok
    FORTH> see test
    Flags: ANSI
    $0133DC40 : test
    $0133DC4A mov rbx, 7 d#
    $0133DC51 mov rax, $FFFFFFFF:FFFFFFFF q#
    $0133DC5B bsr rax, rbx
    $0133DC5F mov rbx, rax
    $0133DC62 push rbx
    $0133DC63 jmp .+10 ( $0124A102 ) offset NEAR
    $0133DC68 ;

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Marcel Hendrix on Fri Dec 9 22:24:29 2022
    Marcel Hendrix <mhx@iae.nl> writes:
    On Friday, December 9, 2022 at 7:48:45 PM UTC+1, Anton Ertl wrote:
    Marcel Hendrix <m...@iae.nl> writes:
    iForth has the log2 method implemented as a table (it is very
    frequently used in my type of applications).

    Sorry, I mixed log2 up with 2^x ( u -- 2^u ) ...

    Note that Intel and AMD have bsr (since the 386) and lzcnt (if the
    BMI1 or ABM extensions are present) that allow a fast implementation
    of log2.

    FORTH> : test 7 log2 . ; ok
    FORTH> see test
    Flags: ANSI
    $0133DC40 : test
    $0133DC4A mov rbx, 7 d#
    $0133DC51 mov rax, $FFFFFFFF:FFFFFFFF q#
    $0133DC5B bsr rax, rbx
    $0133DC5F mov rbx, rax
    $0133DC62 push rbx
    $0133DC63 jmp .+10 ( $0124A102 ) offset NEAR
    $0133DC68 ;

    bsr's result is not defined if the source=0, so I would write
    something like

    bsr dest, source
    bne nonzero
    mov dest, -1
    nonzero:
    # do something with dest

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Anton Ertl on Fri Dec 9 15:32:46 2022
    On Friday, December 9, 2022 at 11:27:13 PM UTC+1, Anton Ertl wrote:
    [..]
    bsr's result is not defined if the source=0, so I would write
    something like

    bsr dest, source
    bne nonzero
    mov dest, -1
    nonzero:
    # do something with dest

    The log2 of 0 is undefined anyway, so I prefer to let the user
    of log2 decide how it should be handled...

    Or do you mean that bsr with 0 does not work the same on all
    architectures/chip generations?

    FORTH> : test 0 log2 . ; ok
    FORTH> see test ( on AMD )
    Flags: ANSI
    $0133DC40 : test
    $0133DC4A xor rbx, rbx
    $0133DC4D mov rax, $FFFFFFFF:FFFFFFFF q#
    $0133DC57 bsr rax, rbx
    $0133DC5B mov rbx, rax
    $0133DC5E push rbx
    $0133DC5F jmp .+10 ( $0124A102 ) offset NEAR
    FORTH> test -1 ok

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to Marcel Hendrix on Sat Dec 10 14:05:05 2022
    On 10/12/2022 10:32 am, Marcel Hendrix wrote:
    On Friday, December 9, 2022 at 11:27:13 PM UTC+1, Anton Ertl wrote:
    [..]
    bsr's result is not defined if the source=0, so I would write
    something like

    bsr dest, source
    bne nonzero
    mov dest, -1
    nonzero:
    # do something with dest

    The log2 of 0 is undefined anyway, so I prefer to let the user
    of log2 decide how it should be handled...

    The implementer you mean? Several options spring to mind:

    a) THROW - clumsy but gets user's attention; always works
    b) 0 - can be useful
    c) -1 - throwing a spanner in the works in the hope user notices

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Marcel Hendrix on Sat Dec 10 15:37:23 2022
    Marcel Hendrix <mhx@iae.nl> writes:
    On Friday, December 9, 2022 at 11:27:13 PM UTC+1, Anton Ertl wrote:
    [..]
    bsr's result is not defined if the source=0, so I would write
    something like

    bsr dest, source
    bne nonzero
    mov dest, -1
    nonzero:
    # do something with dest

    The log2 of 0 is undefined anyway, so I prefer to let the user
    of log2 decide how it should be handled...

    What happens in programming is that some user happens to write a
    program that just uses the result as-is, and happens to work as
    intended even for input 0. And if you want to provide a high-quality experience, your programming system should better be consistent in
    what it returns in that case.

    Gforth returns -1 for 0 LOG2.

    One benefit of returning -1 is that you can get a count of leading
    zeros with

    ( x ) bits/cell 1- swap log2 -

    Or do you mean that bsr with 0 does not work the same on all >architectures/chip generations?

    Intel certainly reserves the right to do it that way by saying that
    the result is undefined in that case. However, for the same reason as discussed above, they (and AMD) better stick with the behaviour of
    their original implementation, or their shiny new processor might get
    a reputation for incompatibility (yes, Intel may blame the
    application, but the customer does not care who is to blame).

    So they might just as well document the actual behaviour.

    FORTH> : test 0 log2 . ; ok
    FORTH> see test ( on AMD )
    Flags: ANSI
    $0133DC40 : test
    $0133DC4A xor rbx, rbx
    $0133DC4D mov rax, $FFFFFFFF:FFFFFFFF q#
    $0133DC57 bsr rax, rbx
    $0133DC5B mov rbx, rax
    $0133DC5E push rbx
    $0133DC5F jmp .+10 ( $0124A102 ) offset NEAR
    FORTH> test -1 ok

    It seems that on your hardware (Zen3 IIRC) bsr leaves the destination
    register unchanged when the source is 0, and iForth is one application
    that relies on this behaviour in order to return -1 (otherwise the
    instruction at $0133DC4D would be superfluous); I just tried in on a
    Skylake, and unsurprisingly it behaves the same way. If you don't
    want to rely on Intel and AMD sticking to the same behaviour, you can
    change the central instructions as shown above or as

    mov rax, -1
    bsr rbx,rbx
    cmove rbx, rax

    Interstingly, AMD has added an instruction lzcnt with the ABM
    extension since the K10 (2007), and Intel has picked it up in the BMI1 extension since the Haswell (2013). lzcnt delivers a complementary
    result (e.g., all-bits-set produces 0, and 0 produces 64), but in
    particular it is also defined for 0. It is faster than bsr on the AMD
    CPUs listed on https://gmplib.org/~tege/x86-timing.pdf (e.g., 1 cycle
    latency on Zen2 and 4 lzcnt/cycle throughput, vs. 4 cycle latency for
    bsr and 1 bsr every 4 cycles throughput. On the Intel CPUs listed
    they both have 3 cycles latency, and 1 instruction/cycle throughput.
    I wonder what makes bsr so hard to implement on the AMD CPUs that they
    added lzcnt.

    Of course for log2 you would need something like:

    lzcnt rax, rbx
    mov rbx, 63
    sub rbx, rax

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From P Falth@21:1/5 to Anton Ertl on Sat Dec 10 14:37:33 2022
    On Saturday, 10 December 2022 at 17:49:17 UTC+1, Anton Ertl wrote:
    Marcel Hendrix <m...@iae.nl> writes:
    On Friday, December 9, 2022 at 11:27:13 PM UTC+1, Anton Ertl wrote:
    [..]
    bsr's result is not defined if the source=0, so I would write
    something like

    bsr dest, source
    bne nonzero
    mov dest, -1
    nonzero:
    # do something with dest

    The log2 of 0 is undefined anyway, so I prefer to let the user
    of log2 decide how it should be handled...
    What happens in programming is that some user happens to write a
    program that just uses the result as-is, and happens to work as
    intended even for input 0. And if you want to provide a high-quality experience, your programming system should better be consistent in
    what it returns in that case.

    Gforth returns -1 for 0 LOG2.

    One benefit of returning -1 is that you can get a count of leading
    zeros with

    ( x ) bits/cell 1- swap log2 -
    Or do you mean that bsr with 0 does not work the same on all >architectures/chip generations?
    Intel certainly reserves the right to do it that way by saying that
    the result is undefined in that case. However, for the same reason as discussed above, they (and AMD) better stick with the behaviour of
    their original implementation, or their shiny new processor might get
    a reputation for incompatibility (yes, Intel may blame the
    application, but the customer does not care who is to blame).

    So they might just as well document the actual behaviour.
    FORTH> : test 0 log2 . ; ok
    FORTH> see test ( on AMD )
    Flags: ANSI
    $0133DC40 : test
    $0133DC4A xor rbx, rbx
    $0133DC4D mov rax, $FFFFFFFF:FFFFFFFF q#
    $0133DC57 bsr rax, rbx
    $0133DC5B mov rbx, rax
    $0133DC5E push rbx
    $0133DC5F jmp .+10 ( $0124A102 ) offset NEAR
    FORTH> test -1 ok
    It seems that on your hardware (Zen3 IIRC) bsr leaves the destination register unchanged when the source is 0, and iForth is one application
    that relies on this behaviour in order to return -1 (otherwise the instruction at $0133DC4D would be superfluous); I just tried in on a
    Skylake, and unsurprisingly it behaves the same way. If you don't
    want to rely on Intel and AMD sticking to the same behaviour, you can
    change the central instructions as shown above or as

    mov rax, -1
    bsr rbx,rbx
    cmove rbx, rax

    Interstingly, AMD has added an instruction lzcnt with the ABM
    extension since the K10 (2007), and Intel has picked it up in the BMI1 extension since the Haswell (2013). lzcnt delivers a complementary
    result (e.g., all-bits-set produces 0, and 0 produces 64), but in
    particular it is also defined for 0. It is faster than bsr on the AMD
    CPUs listed on https://gmplib.org/~tege/x86-timing.pdf (e.g., 1 cycle
    latency on Zen2 and 4 lzcnt/cycle throughput, vs. 4 cycle latency for
    bsr and 1 bsr every 4 cycles throughput. On the Intel CPUs listed
    they both have 3 cycles latency, and 1 instruction/cycle throughput.
    I wonder what makes bsr so hard to implement on the AMD CPUs that they
    added lzcnt.

    Of course for log2 you would need something like:

    lzcnt rax, rbx
    mov rbx, 63
    sub rbx, rax

    My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode. But it does not flag an error if executed, instead it executes bsr! and give a wrong
    result. This is also stated in the Intel manual.

    BR
    Peter

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to Anton Ertl on Sun Dec 11 10:56:48 2022
    On 11/12/2022 2:37 am, Anton Ertl wrote:

    Gforth returns -1 for 0 LOG2.

    One benefit of returning -1 is that you can get a count of leading
    zeros with

    ( x ) bits/cell 1- swap log2 -

    OTOH simplest implementations return 0:

    : LOG2 ( u1 -- u2 ) \ Rick C.
    0 begin swap u2/ dup while swap 1+ repeat drop ;

    code LOG2 ( u1 -- u2 )
    bx pop ax ax xor 1 $: bx shr 2 $ jz ax inc 1 $ ju 2 $: 1push end-code

    To return something else requires it be special-cased - which can be
    left to the application.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Heinrich Hohl@21:1/5 to dxforth on Sun Dec 11 00:34:13 2022
    On Sunday, December 11, 2022 at 12:56:52 AM UTC+1, dxforth wrote:
    OTOH simplest implementations return 0:

    : LOG2 ( u1 -- u2 ) \ Rick C.
    0 begin swap u2/ dup while swap 1+ repeat drop ;
    To return something else requires it be special-cased - which can be
    left to the application.

    You can rewrite this as:

    : LOG2 ( u1 -- u2 ) -1 SWAP BEGIN DUP WHILE SWAP 1+ SWAP U2/ REPEAT DROP ;

    Same number of operations, but u1 = 0 returns u2 = -1.

    For u1 = 0, the output u2 = -1 is the best solution. We need a way to find out that the output
    is illegal after the calculation has been done (i.e. the argument u1 is no longer available).
    As Anton stated, -1 has certain advantages. It also has the advantage that we can use
    0< to check whether the output is correct or illegal.

    u2 = 0 should allow us to conclude that u1 = 1.

    Henry

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to P Falth on Sun Dec 11 09:28:07 2022
    P Falth <peter.m.falth@gmail.com> writes:
    My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
    But it does not flag an error if executed, instead it executes bsr! and give a wrong
    result. This is also stated in the Intel manual.

    Interesting. Looking at the encoding, lzcount is encoded as repne
    bsr, and the repne prefix normally is ignored except for the string instructions. Apparently nobody encoded bsr with a repne prefix,
    otherwise they could not have encoded lzcnt in this way.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From lehs@21:1/5 to All on Sun Dec 11 01:37:54 2022
    I can't prove it, but it seems that beginning with a value greater than the square root and terminate when Xn - Xn+1 < 2 always gives the correct floor function SQRTF.

    : sqrtf \ u -- n
    locals| u |
    u sqrtc~
    begin dup u over / + 2/
    tuck - 2 <
    until ;

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@arcor.de@21:1/5 to All on Sun Dec 11 02:38:59 2022
    Since steam went out of square roots, here's what I have for cube roots.
    Seldom used for volumetric calculations. Optimization ideas?

    : UCBRT64 {: x | Y B -- cbrt ; integer cube root :}
    0 to Y
    63 BEGIN dup 0 >=
    WHILE
    Y 2* to Y
    Y 1+ Y * 3 * 1+ to B
    x over rshift B >= IF
    B over lshift negate x + to x
    Y 1+ to Y
    THEN 3 -
    REPEAT drop
    Y ;

    cr 0 ucbrt64 . .( 0? )
    cr 1 ucbrt64 . .( 1? )
    cr 2 ucbrt64 . .( 1? )
    cr 2 ucbrt64 . .( 1? )
    cr 7 ucbrt64 . .( 1? )
    cr 8 ucbrt64 . .( 2? )
    cr 9 ucbrt64 . .( 2? )
    cr 999 ucbrt64 . .( 9? )
    cr 1000 ucbrt64 . .( 10? )
    cr 1001 ucbrt64 . .( 10? )

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kerr-Mudd, John@21:1/5 to Anton Ertl on Sun Dec 11 11:48:30 2022
    XPost: at.lang.asm

    On Sun, 11 Dec 2022 09:28:07 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    [log2 to give a first estimate for a sqrt program]

    P Falth <peter.m.falth@gmail.com> writes:
    My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
    But it does not flag an error if executed, instead it executes bsr! and give a wrong
    result. This is also stated in the Intel manual.

    Interesting. Looking at the encoding, lzcount is encoded as repne
    bsr, and the repne prefix normally is ignored except for the string instructions. Apparently nobody encoded bsr with a repne prefix,
    otherwise they could not have encoded lzcnt in this way.


    I just tried a leading zero count in asm, trying "repnz shr ax,1"; the assembler (nasm) was OK encoding it but on execution it dropped out
    after one shift, even though the flags register shows 'nz'. This is on a
    20 year old Pentium.
    Intel's Instruction Reference (ASDM) says rep[] only applies to string ops,
    so I can't really complain.
    I rarely use rep lods[x] though!

    xpost to ala, as it's more about asm.


    --
    Bah, and indeed Humbug.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kerr-Mudd, John@21:1/5 to John" on Sun Dec 11 11:55:27 2022
    XPost: alt.lang.asm

    On Sun, 11 Dec 2022 11:48:30 +0000
    "Kerr-Mudd, John" <admin@127.0.0.1> wrote:

    On Sun, 11 Dec 2022 09:28:07 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    [log2 to give a first estimate for a sqrt program]

    P Falth <peter.m.falth@gmail.com> writes:
    My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
    But it does not flag an error if executed, instead it executes bsr! and give a wrong
    result. This is also stated in the Intel manual.

    Interesting. Looking at the encoding, lzcount is encoded as repne
    bsr, and the repne prefix normally is ignored except for the string instructions. Apparently nobody encoded bsr with a repne prefix,
    otherwise they could not have encoded lzcnt in this way.


    I just tried a leading zero count in asm, trying "repnz shr ax,1"; the assembler (nasm) was OK encoding it but on execution it dropped out
    after one shift, even though the flags register shows 'nz'. This is on a
    20 year old Pentium.
    Intel's Instruction Reference (ASDM) says rep[] only applies to string ops, so I can't really complain.
    I rarely use rep lods[x] though!

    xpost to ala, as it's more about asm.

    /alt/.lang.asm
    doh!

    --
    Bah, and indeed Humbug.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to Heinrich Hohl on Mon Dec 12 13:35:14 2022
    On 11/12/2022 7:34 pm, Heinrich Hohl wrote:
    On Sunday, December 11, 2022 at 12:56:52 AM UTC+1, dxforth wrote:
    OTOH simplest implementations return 0:

    : LOG2 ( u1 -- u2 ) \ Rick C.
    0 begin swap u2/ dup while swap 1+ repeat drop ;
    To return something else requires it be special-cased - which can be
    left to the application.

    You can rewrite this as:

    : LOG2 ( u1 -- u2 ) -1 SWAP BEGIN DUP WHILE SWAP 1+ SWAP U2/ REPEAT DROP ;

    Same number of operations, but u1 = 0 returns u2 = -1.

    On VFX it's even smaller (12 instr vs. 17) - though this may be a compiler quirk. I doubt I could re-code the 8086 version to give a -1 result in the same number of instructions/bytes let alone make it smaller.

    For u1 = 0, the output u2 = -1 is the best solution. We need a way to find out that the output
    is illegal after the calculation has been done (i.e. the argument u1 is no longer available).
    As Anton stated, -1 has certain advantages. It also has the advantage that we can use
    0< to check whether the output is correct or illegal.

    u2 = 0 should allow us to conclude that u1 = 1.

    I would be inclined to test the input rather than the output. Testing
    the output is a good indicator as any that it wasn't 'the best solution'.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Heinrich Hohl@21:1/5 to dxforth on Mon Dec 12 03:04:57 2022
    On Monday, December 12, 2022 at 3:35:16 AM UTC+1, dxforth wrote:
    I would be inclined to test the input rather than the output. Testing
    the output is a good indicator as any that it wasn't 'the best solution'.

    Absolutely correct. What I want to say is that for u1=0 the output should either be undefined, or it should be u2=-1. Not any other value.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kerr-Mudd, John@21:1/5 to Steve on Mon Dec 12 12:57:36 2022
    XPost: alt.lang.asm

    On Mon, 12 Dec 2022 12:43:34 -0000 (UTC)
    Bogus@Embarq.com (Steve) wrote:

    In <20221211115527.bf15c41eb56b7d462eedf5a0@127.0.0.1>, "Kerr-Mudd, John" <admin@127.0.0.1> writes:
    On Sun, 11 Dec 2022 11:48:30 +0000
    "Kerr-Mudd, John" <admin@127.0.0.1> wrote:

    On Sun, 11 Dec 2022 09:28:07 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    [log2 to give a first estimate for a sqrt program]

    P Falth <peter.m.falth@gmail.com> writes:
    My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
    But it does not flag an error if executed, instead it executes bsr! and give a wrong
    result. This is also stated in the Intel manual.

    Interesting. Looking at the encoding, lzcount is encoded as repne
    bsr, and the repne prefix normally is ignored except for the string
    instructions. Apparently nobody encoded bsr with a repne prefix,
    otherwise they could not have encoded lzcnt in this way.


    I just tried a leading zero count in asm, trying "repnz shr ax,1"; the
    assembler (nasm) was OK encoding it but on execution it dropped out
    after one shift, even though the flags register shows 'nz'. This is on a >> 20 year old Pentium.
    Intel's Instruction Reference (ASDM) says rep[] only applies to string ops,
    so I can't really complain.
    I rarely use rep lods[x] though!

    xpost to ala, as it's more about asm.


    Hi,

    It seems strange to use such an encoding?


    Maybe so, but it would've saved a loop.

    This was my poor attempt:

    Log2b: ; ax<-ln2(ax) ; uses cx [14]
    mov cx,16 ; bits per word
    push cx
    repnz shr ax,1
    inc cx ; 1 too many
    pop ax
    sub ax,cx ;
    ; ln(0)=-1 fix
    jnz .ln0nofix
    dec ax
    .ln0nofix:
    ret


    Ok, we can use 'bsr' to find the leading bit these days.

    Log2c: ; ax<-ln2(ax) ; uses bx [7]

    mov bx,-1
    xchg ax,bx
    bsr ax,bx ; cpu 386




    for rep lods, perhaps size coders use
    rep lodsb
    rather than
    mov bx,cx
    mov al,[si+bx]






    --
    Bah, and indeed Humbug.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Steve@21:1/5 to John" on Mon Dec 12 12:43:34 2022
    XPost: alt.lang.asm

    In <20221211115527.bf15c41eb56b7d462eedf5a0@127.0.0.1>, "Kerr-Mudd, John" <admin@127.0.0.1> writes:
    On Sun, 11 Dec 2022 11:48:30 +0000
    "Kerr-Mudd, John" <admin@127.0.0.1> wrote:

    On Sun, 11 Dec 2022 09:28:07 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    [log2 to give a first estimate for a sqrt program]

    P Falth <peter.m.falth@gmail.com> writes:
    My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
    But it does not flag an error if executed, instead it executes bsr! and give a wrong
    result. This is also stated in the Intel manual.

    Interesting. Looking at the encoding, lzcount is encoded as repne
    bsr, and the repne prefix normally is ignored except for the string
    instructions. Apparently nobody encoded bsr with a repne prefix,
    otherwise they could not have encoded lzcnt in this way.


    I just tried a leading zero count in asm, trying "repnz shr ax,1"; the
    assembler (nasm) was OK encoding it but on execution it dropped out
    after one shift, even though the flags register shows 'nz'. This is on a
    20 year old Pentium.
    Intel's Instruction Reference (ASDM) says rep[] only applies to string ops, >> so I can't really complain.
    I rarely use rep lods[x] though!

    xpost to ala, as it's more about asm.

    /alt/.lang.asm
    doh!

    --
    Bah, and indeed Humbug.

    Hi,

    It seems strange to use such an encoding?

    Cheers,

    Steve N,

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From lehs@21:1/5 to All on Mon Dec 12 08:03:19 2022
    söndag 11 december 2022 kl. 10:37:55 UTC+1 skrev lehs:
    I can't prove it, but it seems that beginning with a value greater than the square root and terminate when Xn - Xn+1 < 2 always gives the correct floor function SQRTF.

    : sqrtf \ u -- n
    locals| u |
    u sqrtc~
    begin dup u over / + 2/
    tuck - 2 <
    until ;

    No it don't work, but

    : lgf \ u -- n
    0 swap
    begin 1 rshift dup
    while 1 under+
    repeat drop ;

    : sqrtc~ \ u -- n 2^(1+½lg2 u)
    lgf 2/ 1+ 1 swap lshift ;

    : sqrtf \ u -- n almost as fast as float
    1- locals| u |
    u sqrtc~
    begin dup u over u/ + u2/ tuck u> 0=
    until ;

    false [if]
    : sqrtf \ u -- floor a few procent faster
    0 d>f fsqrt f>d drop ;
    [then]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to Heinrich Hohl on Tue Dec 13 14:52:43 2022
    On 12/12/2022 10:04 pm, Heinrich Hohl wrote:
    On Monday, December 12, 2022 at 3:35:16 AM UTC+1, dxforth wrote:
    I would be inclined to test the input rather than the output. Testing
    the output is a good indicator as any that it wasn't 'the best solution'.

    Absolutely correct. What I want to say is that for u1=0 the output should either be undefined, or it should be u2=-1. Not any other value.

    It should be u2=-1 only until such time as someone comes along and says
    u2=0 is useful. I don't have an integer example but I do have an f/p
    example:

    \ return the decimal exponent of a number
    100e flog . 2 ok
    10e flog . 1 ok
    1e flog . 0 ok
    0e flog . 0 ok
    0.1e flog . -1 ok
    0.01e flog . -2 ok

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@arcor.de@21:1/5 to dxforth on Mon Dec 12 23:16:52 2022
    dxforth schrieb am Dienstag, 13. Dezember 2022 um 04:52:47 UTC+1:
    On 12/12/2022 10:04 pm, Heinrich Hohl wrote:
    On Monday, December 12, 2022 at 3:35:16 AM UTC+1, dxforth wrote:
    I would be inclined to test the input rather than the output. Testing
    the output is a good indicator as any that it wasn't 'the best solution'.

    Absolutely correct. What I want to say is that for u1=0 the output should either be undefined, or it should be u2=-1. Not any other value.
    It should be u2=-1 only until such time as someone comes along and says
    u2=0 is useful. I don't have an integer example but I do have an f/p
    example:

    \ return the decimal exponent of a number
    100e flog . 2 ok
    10e flog . 1 ok
    1e flog . 0 ok
    0e flog . 0 ok
    0.1e flog . -1 ok
    0.01e flog . -2 ok

    0e flog has to be a NAN.
    If any, -INF as approximation, but never zero.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Heinrich Hohl@21:1/5 to All on Tue Dec 13 01:14:22 2022
    On Tuesday, December 13, 2022 at 4:52:47 AM UTC+1, dxforth wrote:
    Looks correct at first sight, but at closer inspection we find:

    10^2 = 100
    10^1 = 10
    10^0 = 1
    10^-1 = 0.1
    10^-2 = 0.01

    The result 0 is not part of this series. 0 does not have an exponent.
    For x --> 0, the exponent grows towards - infinity.

    However, in a simple FP package it may be convenient to represent
    f=0.0 as significand = 0 and exponent = 0.

    \ return the decimal exponent of a number
    100e flog . 2 ok
    10e flog . 1 ok
    1e flog . 0 ok
    0e flog . 0 ok
    0.1e flog . -1 ok
    0.01e flog . -2 ok

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@arcor.de@21:1/5 to Heinrich Hohl on Tue Dec 13 03:40:19 2022
    Heinrich Hohl schrieb am Dienstag, 13. Dezember 2022 um 10:14:23 UTC+1:
    On Tuesday, December 13, 2022 at 4:52:47 AM UTC+1, dxforth wrote:
    Looks correct at first sight, but at closer inspection we find:

    10^2 = 100
    10^1 = 10
    10^0 = 1
    10^-1 = 0.1
    10^-2 = 0.01

    The result 0 is not part of this series. 0 does not have an exponent.
    For x --> 0, the exponent grows towards - infinity.

    However, in a simple FP package it may be convenient to represent
    f=0.0 as significand = 0 and exponent = 0.
    \ return the decimal exponent of a number
    100e flog . 2 ok
    10e flog . 1 ok
    1e flog . 0 ok
    0e flog . 0 ok
    0.1e flog . -1 ok
    0.01e flog . -2 ok

    Before starting with negative 0.0e shouldn't it suffice to say that any similarities between integer and fp logarithm stop for values below 1 ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to Heinrich Hohl on Wed Dec 14 12:14:48 2022
    On 13/12/2022 8:14 pm, Heinrich Hohl wrote:
    On Tuesday, December 13, 2022 at 4:52:47 AM UTC+1, dxforth wrote:
    Looks correct at first sight, but at closer inspection we find:

    10^2 = 100
    10^1 = 10
    10^0 = 1
    10^-1 = 0.1
    10^-2 = 0.01

    The result 0 is not part of this series. 0 does not have an exponent.
    For x --> 0, the exponent grows towards - infinity.

    However, in a simple FP package it may be convenient to represent
    f=0.0 as significand = 0 and exponent = 0.

    Indeed in every f/p package, since:

    0e fs. 0.E0 ok

    To print the exponent it has to be extracted and FLOG is handy for that.
    While FLOG isn't guaranteed to work on 0e, having it return 0 is better
    than some other value.

    In the case of LOG2 I don't see sufficient argument for it returning -1.
    If least effort and least consequences from misuse be one's aim, then
    returning 0 looks good to me.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to minf...@arcor.de on Wed Dec 14 12:22:33 2022
    On 13/12/2022 6:16 pm, minf...@arcor.de wrote:

    0e flog has to be a NAN.
    If any, -INF as approximation, but never zero.

    Only in the cult of IEEE. Mathematics simply calls it 'undefined'.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)