Forum: >>> Magnum BBS <<<

Integer roots game

From minforth@arcor.de@21:1/5 to All on Tue Dec 6 06:40:34 2022

Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)

\ code:
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;

CR 20 usqrt
CR 200 usqrt
CR 2000 usqrt
CR 20000 usqrt
CR 200000 usqrt
CR 2000000 usqrt
CR 20000000 usqrt
CR 200000000 usqrt

\ results:
include usqrt.fth
root:4 iterations: 2
root:14 iterations: 4
root:44 iterations: 6
root:141 iterations: 8
root:447 iterations: 10
root:1414 iterations: 12
root:4472 iterations: 14
root:14142 iterations: 16 ok

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans Bezemer@21:1/5 to minf...@arcor.de on Tue Dec 6 07:52:06 2022

On Tuesday, December 6, 2022 at 3:40:35 PM UTC+1, minf...@arcor.de wrote:

Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!

Done:
4 3
14 4
44 6
141 8
447 9
1414 11
4472 13
14142 14

: sqrt-rem ( n -- sqrt rem)
>r 0 1 begin dup r@ > except 4 * repeat
begin \ find a power of 4 greater than TORS
dup 1 > \ compute while greater than unity
while
2/ 2/ swap over over + negate r@ + \ integer divide by 4
dup 0< if drop 2/ else rdrop >r 2/ over + then swap
repeat drop r> ( sqrt rem)
;

BTW - less iterations doesn't always mean "faster". I got another pretty fast one, but it always requires 31 iterations on a 64 bit architecture.

Hans Bezemer

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to the.bee...@gmail.com on Tue Dec 6 08:28:20 2022

the.bee...@gmail.com schrieb am Dienstag, 6. Dezember 2022 um 16:52:08 UTC+1:

On Tuesday, December 6, 2022 at 3:40:35 PM UTC+1, minf...@arcor.de wrote:

Inspired by another thread I came up with the iterative code shown below. The challenge: find a shorter AND faster version of USQRT!

Done:
4 3
14 4
44 6
141 8
447 9
1414 11
4472 13
14142 14

: sqrt-rem ( n -- sqrt rem)

r 0 1 begin dup r@ > except 4 * repeat

begin \ find a power of 4 greater than TORS
dup 1 > \ compute while greater than unity
while
2/ 2/ swap over over + negate r@ + \ integer divide by 4
dup 0< if drop 2/ else rdrop >r 2/ over + then swap
repeat drop r> ( sqrt rem)
;

BTW - less iterations doesn't always mean "faster". I got another pretty fast one, but it always requires 31 iterations on a 64 bit architecture.

Nice, although it has more loops and more operations in the main loop,
but which doesn't spoil the cake.
BTW what is that non-standard word: EXCEPT (guessing: 0= WHILE) ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans Bezemer@21:1/5 to minf...@arcor.de on Tue Dec 6 08:37:05 2022

On Tuesday, December 6, 2022 at 5:28:21 PM UTC+1, minf...@arcor.de wrote:

BTW what is that non-standard word: EXCEPT (guessing: 0= WHILE) ?

You guessed correctly, Sir. I always hated "0= WHILE" and "0= IF", so I made myself a few alternatives.

Hans Bezemer

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans Bezemer@21:1/5 to minf...@arcor.de on Tue Dec 6 08:25:09 2022

On Tuesday, December 6, 2022 at 3:40:35 PM UTC+1, minf...@arcor.de wrote:

Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!

I'll give you that: it is FAST! However, this one is even faster (on my system):

: usqrt
>r 0 r@ 2/ ( ct x0 R: u)
begin
r@ over / over + 2/ swap over over < ( ct x1 x0 f)
while
drop swap 1+ swap
repeat rdrop nip nip
;
time pp4th -x usqrt.4th

Yours:
real 0m0,762s
user 0m0,761s
sys 0m0,000s

Mine:
real 0m0,682s
user 0m0,682s
sys 0m0,000s

Hans Bezemer

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to the.bee...@gmail.com on Tue Dec 6 08:46:23 2022

the.bee...@gmail.com schrieb am Dienstag, 6. Dezember 2022 um 17:25:11 UTC+1:

On Tuesday, December 6, 2022 at 3:40:35 PM UTC+1, minf...@arcor.de wrote:

Inspired by another thread I came up with the iterative code shown below. The challenge: find a shorter AND faster version of USQRT!

I'll give you that: it is FAST! However, this one is even faster (on my system):

: usqrt

r 0 r@ 2/ ( ct x0 R: u)

begin
r@ over / over + 2/ swap over over < ( ct x1 x0 f)
while
drop swap 1+ swap
repeat rdrop nip nip
;
time pp4th -x usqrt.4th

Yours:
real 0m0,762s
user 0m0,761s
sys 0m0,000s

Mine:
real 0m0,682s
user 0m0,682s
sys 0m0,000s

Even nicer, also faster here (I have implemented only TOS-caching, but not register locals).
BTW over over is 2dup. For perfectionists: u2/ u/ and u< could double the range.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From dxforth@21:1/5 to minf...@arcor.de on Wed Dec 7 03:53:27 2022

On 7/12/2022 3:28 am, minf...@arcor.de wrote:

...
Nice, although it has more loops and more operations in the main loop,

This one duplicates your logic.

: SQRT ( +n -- +root )
0 over 2/ >r
BEGIN
over r@ / r@ + 2/
dup r@ <
WHILE
rdrop >r 1+
REPEAT drop
." root:" r> . ." iterations: " . drop ;

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans Bezemer@21:1/5 to dxforth on Tue Dec 6 09:24:39 2022

On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:

Nice, although it has more loops and more operations in the main loop,

This one duplicates your logic.

That one shaves another 8/100 of a second from my version. ;-)

Hans Bezemer

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From none) (albert@21:1/5 to Anton Ertl on Tue Dec 6 18:38:41 2022

In article <2022Dec6.175959@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

"minf...@arcor.de" <minforth@arcor.de> writes:

Inspired by another thread I came up with the iterative code shown below. >>The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)

\ code:
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;

This is the specialization of the Newton-Raphson method for the square
root, see ><https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>

For the initial value I would use

1 bits/cell u clz - 2/ lshift

where clz ( x -- u ) is the count-leading-zeros function that many >architectures have built-in. Newton-Raphson profits a lot from a good >initial value.

Where `clz is available you should use it. In this context I consider it
as cheating.
If clz is not available in hardware, it takes up as much iterations
(or more) than you save on the Newton iteration.

- anton

Groetjes Albert
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From none) (albert@21:1/5 to minf...@arcor.de on Tue Dec 6 18:35:20 2022

In article <3c089374-8f18-4d6b-a31e-7322ed66b9f9n@googlegroups.com>, minf...@arcor.de <minforth@arcor.de> wrote:

Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)

\ code:
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;

CR 20 usqrt
CR 200 usqrt
CR 2000 usqrt
CR 20000 usqrt
CR 200000 usqrt
CR 2000000 usqrt
CR 20000000 usqrt
CR 200000000 usqrt

\ results:
include usqrt.fth
root:4 iterations: 2
root:14 iterations: 4
root:44 iterations: 6
root:141 iterations: 8
root:447 iterations: 10
root:1414 iterations: 12
root:4472 iterations: 14
root:14142 iterations: 16 ok

My take on this.

Newton is only crazy fast if the start value is good.
The following takes 10 iterations away:

\ For N return FLOOR of the square root of n.
: SQRT DUP >R 10 RSHIFT 1024 MAX \ Minimize iterations.
BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE
NIP REPEAT DROP RDROP ;

S[ ] OK -1 1 RSHIFT
S[ 9223372036854775807 ] OK SQRT
S[ 3037000499 ] OK

(that is MAX-INT , if you had not noticed.

Groetjes Albert
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to minf...@arcor.de on Tue Dec 6 16:59:59 2022

"minf...@arcor.de" <minforth@arcor.de> writes:

Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)

\ code:
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;

This is the specialization of the Newton-Raphson method for the square
root, see <https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>

For the initial value I would use

1 bits/cell u clz - 2/ lshift

where clz ( x -- u ) is the count-leading-zeros function that many architectures have built-in. Newton-Raphson profits a lot from a good
initial value.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Hendrix@21:1/5 to none albert on Tue Dec 6 10:01:50 2022

On Tuesday, December 6, 2022 at 6:35:24 PM UTC+1, none albert wrote:
[..]

Newton is only crazy fast if the start value is good.
The following takes 10 iterations away:

\ For N return FLOOR of the square root of n.
: SQRT DUP >R 10 RSHIFT 1024 MAX \ Minimize iterations.
BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE
NIP REPEAT DROP RDROP ;

S[ ] OK -1 1 RSHIFT
S[ 9223372036854775807 ] OK SQRT
S[ 3037000499 ] OK

(that is MAX-INT , if you had not noticed.

[..]

FORTH> : SQRT DUP >R 10 RSHIFT 1024 MAX BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE NIP REPEAT DROP -R ;
Redefining `SQRT` ok
FORTH> : test timer-reset 0 #1000000000 0 do -1 1 RSHIFT sqrt + loop . .elapsed ; ok
FORTH> test 3037000499000000000 53.520 seconds elapsed. ok
FORTH> forget sqrt : test2 timer-reset 0 #1000000000 0 do -1 1 RSHIFT sqrt + loop . .elapsed ; ok
FORTH> test2 3037000499000000000 0.435 seconds elapsed. ok

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to Anton Ertl on Tue Dec 6 10:05:12 2022

Anton Ertl schrieb am Dienstag, 6. Dezember 2022 um 18:21:48 UTC+1:

"minf...@arcor.de" <minf...@arcor.de> writes:

Inspired by another thread I came up with the iterative code shown below. >The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)

\ code:
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;

This is the specialization of the Newton-Raphson method for the square
root, see <https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>

Correct. Another algorithm would include tabulation and interpolation. Probably faster
but eats more space.

If anyone is interested, there is a practical story behind the game: flow measurement
through orifices where dp behaves about quadratic with the flow. For instance the basic
principles are explained here: https://www.efunda.com/formulae/fluids/calc_orifice_flowmeter.cfm

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to Marcel Hendrix on Tue Dec 6 13:05:10 2022

Marcel Hendrix schrieb am Dienstag, 6. Dezember 2022 um 19:01:52 UTC+1:

On Tuesday, December 6, 2022 at 6:35:24 PM UTC+1, none albert wrote:
[..]

Newton is only crazy fast if the start value is good.
The following takes 10 iterations away:

\ For N return FLOOR of the square root of n.
: SQRT DUP >R 10 RSHIFT 1024 MAX \ Minimize iterations.
BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE
NIP REPEAT DROP RDROP ;

S[ ] OK -1 1 RSHIFT
S[ 9223372036854775807 ] OK SQRT
S[ 3037000499 ] OK

(that is MAX-INT , if you had not noticed.

[..]

FORTH> : SQRT DUP >R 10 RSHIFT 1024 MAX BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE NIP REPEAT DROP -R ;
Redefining `SQRT` ok
FORTH> : test timer-reset 0 #1000000000 0 do -1 1 RSHIFT sqrt + loop . .elapsed ; ok
FORTH> test 3037000499000000000 53.520 seconds elapsed. ok
FORTH> forget sqrt : test2 timer-reset 0 #1000000000 0 do -1 1 RSHIFT sqrt + loop . .elapsed ; ok
FORTH> test2 3037000499000000000 0.435 seconds elapsed. ok

Cute. But all those algorithms use slow division in the loop.

Here is a version ( in pseudo-Forth ;-) } using only shifts and additions:
: USQRT {: u | T H B S -- uroot32 :}
0 to H $8000 to B 15 to S
BEGIN
H 2* B + S lshift to T
T u < IF
H B + to H
u T - to u
THEN
S 1- to S
B 2/ to B
B 0=
UNTIL
." root:" H . ;

BTW (not shown above) for syntactic sugar I regularly use
:= += -/ *= /=
with locals. Makes code more readable.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to albert@cherry. on Tue Dec 6 22:01:25 2022

albert@cherry.(none) (albert) writes:

In article <2022Dec6.175959@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

"minf...@arcor.de" <minforth@arcor.de> writes:

Inspired by another thread I came up with the iterative code shown below. >>>The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)

\ code:
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;

This is the specialization of the Newton-Raphson method for the square >>root, see >><https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>

For the initial value I would use

1 bits/cell u clz - 2/ lshift

where clz ( x -- u ) is the count-leading-zeros function that many >>architectures have built-in. Newton-Raphson profits a lot from a good >>initial value.

Where `clz is available you should use it. In this context I consider it
as cheating.
If clz is not available in hardware, it takes up as much iterations
(or more) than you save on the Newton iteration.

Here's a LOG2 definition (originally by Wil Baden <wilbadenE5oFJA.57w@netcom.com>, with his peculiarities removed, extended to 64 bits, and slightly more efficient):

: log2 ( n -- log2 )
-1 swap ( log2 n)
dup $ffffffff00000000 and if swap 32 + swap 32 rshift then
dup $00000000ffff0000 and if swap 16 + swap 16 rshift then
dup $000000000000ff00 and if swap 8 + swap 8 rshift then
dup $00000000000000f0 and if swap 4 + swap 4 rshift then
dup $000000000000000c and if swap 2 + swap 2 rshift then
dup $0000000000000002 and if swap 1 + swap 1 rshift then
+ ;

With this function the initial value becomes:

1 u log2 2/ lshift

It will be interesting to see at what values the crossover is between saving iterations for Newton, and the additional cost of log2.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From dxforth@21:1/5 to Hans Bezemer on Wed Dec 7 11:09:21 2022

On 7/12/2022 4:24 am, Hans Bezemer wrote:

On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:

Nice, although it has more loops and more operations in the main loop,

This one duplicates your logic.

That one shaves another 8/100 of a second from my version. ;-)

Locals are supposed to reduce 'stack juggling' and simplify code but in this instance it was the juggling between variables that proved extraneous.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to dxforth on Tue Dec 6 16:49:47 2022

dxforth schrieb am Mittwoch, 7. Dezember 2022 um 01:09:23 UTC+1:

On 7/12/2022 4:24 am, Hans Bezemer wrote:

On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:

Nice, although it has more loops and more operations in the main loop,

This one duplicates your logic.

That one shaves another 8/100 of a second from my version. ;-)

Locals are supposed to reduce 'stack juggling' and simplify code but in this instance it was the juggling between variables that proved extraneous.

Lol!! Tell that Heron of Alexandria, it is his root-finding method. His biggest drawback
is that he was born before Charles Moore .. about 2000 years ... :o)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From dxforth@21:1/5 to minf...@arcor.de on Wed Dec 7 12:22:10 2022

On 7/12/2022 11:49 am, minf...@arcor.de wrote:

dxforth schrieb am Mittwoch, 7. Dezember 2022 um 01:09:23 UTC+1:

On 7/12/2022 4:24 am, Hans Bezemer wrote:

On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:

Nice, although it has more loops and more operations in the main loop, >>>> This one duplicates your logic.

That one shaves another 8/100 of a second from my version. ;-)

Locals are supposed to reduce 'stack juggling' and simplify code but in this >> instance it was the juggling between variables that proved extraneous.

Lol!! Tell that Heron of Alexandria, it is his root-finding method.

Maybe but it was your implementation. So did the non-locals version
prove to be "shorter AND faster"? That was your challenge - no?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to dxforth on Wed Dec 7 01:20:44 2022

dxforth schrieb am Mittwoch, 7. Dezember 2022 um 02:22:16 UTC+1:

On 7/12/2022 11:49 am, minf...@arcor.de wrote:

dxforth schrieb am Mittwoch, 7. Dezember 2022 um 01:09:23 UTC+1:

On 7/12/2022 4:24 am, Hans Bezemer wrote:

On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:

Nice, although it has more loops and more operations in the main loop, >>>> This one duplicates your logic.

That one shaves another 8/100 of a second from my version. ;-)

Locals are supposed to reduce 'stack juggling' and simplify code but in this
instance it was the juggling between variables that proved extraneous.

Lol!! Tell that Heron of Alexandria, it is his root-finding method.

Maybe but it was your implementation. So did the non-locals version
prove to be "shorter AND faster"? That was your challenge - no?

Admitted that, I won't yawn at that old "shun locals" talk again. If the proposal
just duplicates Heron's logic, it is just equivalent IMO. If speed depends on the
used compiler, it is not better really - lxf seems even capable to generate the same
code for both versions.

OTOH Marcel, Albert and Anton accelerated the simple algorithm by better start values.

Yesterday I found a different algorithm in the net that avoids slow divisions and should
therefore be even faster. The ARM code at the end of the document is a gem: http://www.azillionmonkeys.com/qed/ulerysqroot.pdf

I didn't do timings but the other code that I presented in the thread was a linear translation
of the C version to Forth. I checked it with gforth.

P.S. pray forgive that while doing this I did not kick out those nasty locals. ;o)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to minf...@arcor.de on Wed Dec 7 11:08:37 2022

"minf...@arcor.de" <minforth@arcor.de> writes:

Lol!! Tell that Heron of Alexandria, it is his root-finding method. His biggest drawback
is that he was born before Charles Moore .. about 2000 years ... :o)

But the question is how Heron described it. Modern descriptions of
course use medern notation and terminology, but the original notation
and terminology is often quite different.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From dxforth@21:1/5 to minf...@arcor.de on Wed Dec 7 21:47:28 2022

On 7/12/2022 8:20 pm, minf...@arcor.de wrote:

dxforth schrieb am Mittwoch, 7. Dezember 2022 um 02:22:16 UTC+1:

On 7/12/2022 11:49 am, minf...@arcor.de wrote:

dxforth schrieb am Mittwoch, 7. Dezember 2022 um 01:09:23 UTC+1:

On 7/12/2022 4:24 am, Hans Bezemer wrote:

On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:

Nice, although it has more loops and more operations in the main loop, >>>>>> This one duplicates your logic.

That one shaves another 8/100 of a second from my version. ;-)

Locals are supposed to reduce 'stack juggling' and simplify code but in this
instance it was the juggling between variables that proved extraneous.

Lol!! Tell that Heron of Alexandria, it is his root-finding method.

Maybe but it was your implementation. So did the non-locals version
prove to be "shorter AND faster"? That was your challenge - no?

Admitted that, I won't yawn at that old "shun locals" talk again. If the proposal
just duplicates Heron's logic, it is just equivalent IMO. If speed depends on the
used compiler, it is not better really -

If the improvement is consistent across compilers, how is it not better?
You can say you don't care - but that's a different argument.

lxf seems even capable to generate the same
code for both versions.

According to Jeff Fox, Moore considered optimizing locals a backwards solution to the problem of the user failing to optimize the stack.

https://groups.google.com/g/comp.lang.forth/c/liQl9h9OzgY/m/PfO98kkYEXYJ

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to Anton Ertl on Wed Dec 7 04:07:16 2022

Anton Ertl schrieb am Mittwoch, 7. Dezember 2022 um 12:13:57 UTC+1:

"minf...@arcor.de" <minf...@arcor.de> writes:

Lol!! Tell that Heron of Alexandria, it is his root-finding method. His biggest drawback
is that he was born before Charles Moore .. about 2000 years ... :o)

But the question is how Heron described it. Modern descriptions of
course use medern notation and terminology, but the original notation
and terminology is often quite different.

To my amateurish knowledge, mathematics at that time (Pythagoreans put aside) was mostly influenced by physics (i.e. mechanics, astronomy, etc.) and geometry (i.e. in architecture).

Therefore I guess that Heron's recursion (probably invented much earlier in Egypt)
had been described as algorithmic recipe using geometric elements and short text
annotations in Greek.

Certainly from our modern notations a form using variables comes closer to the original
than stack machine code.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gerry Jackson@21:1/5 to minf...@arcor.de on Wed Dec 7 11:36:21 2022

On 06/12/2022 14:40, minf...@arcor.de wrote:

Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)

\ code:
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;

CR 20 usqrt
CR 200 usqrt
CR 2000 usqrt
CR 20000 usqrt
CR 200000 usqrt
CR 2000000 usqrt
CR 20000000 usqrt
CR 200000000 usqrt

\ results:
include usqrt.fth
root:4 iterations: 2
root:14 iterations: 4
root:44 iterations: 6
root:141 iterations: 8
root:447 iterations: 10
root:1414 iterations: 12
root:4472 iterations: 14
root:14142 iterations: 16 ok

In ANS/2012 Forth, solutions that use 2/ on an unsigned number with the
ms bit set won't work for unsigned square root as 2/ leaves the ms bit unchanged.

Here's a recursive solution I wrote a few months ago based on: https://en.wikipedia.org/wiki/Integer_square_root#Using_bitwise_operations

I've put stack comments in so it can be related to the original in
Wikipedia.

: usqrt ( u -- u1 )
dup 2 u< if exit then
dup >r 2 rshift recurse
2* ( -- u sm )
1+ dup ( -- u la la )
dup * ( -- u la la^2 )
r> u> if 1- then ( -- u1 )
;

Worked OK for me, I wasn't concerned about speed so don't know how it
compares in that respect. Incidentally this provides an example of using
the stack vs locals.

--
Gerry

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to Gerry Jackson on Wed Dec 7 04:13:44 2022

Gerry Jackson schrieb am Mittwoch, 7. Dezember 2022 um 12:36:24 UTC+1:

Here's a recursive solution I wrote a few months ago based on: https://en.wikipedia.org/wiki/Integer_square_root#Using_bitwise_operations

I've put stack comments in so it can be related to the original in
Wikipedia.

: usqrt ( u -- u1 )
dup 2 u< if exit then
dup >r 2 rshift recurse
2* ( -- u sm )
1+ dup ( -- u la la )
dup * ( -- u la la^2 )

if 1- then ( -- u1 )

;

Worked OK for me, I wasn't concerned about speed so don't know how it compares in that respect. Incidentally this provides an example of using
the stack vs locals.

Cool ! I really like this one ! :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to the.bee...@gmail.com on Wed Dec 7 05:01:04 2022

the.bee...@gmail.com schrieb am Mittwoch, 7. Dezember 2022 um 13:33:02 UTC+1:

On Wednesday, December 7, 2022 at 10:20:46 AM UTC+1, minf...@arcor.de wrote:

Yesterday I found a different algorithm in the net that avoids slow divisions and should
therefore be even faster. The ARM code at the end of the document is a gem: http://www.azillionmonkeys.com/qed/ulerysqroot.pdf

Quick hack:

cell-bits 2 / 1- constant (bshift)

: sqrt

r (bshift) 1 over lshift >r 0 ( bs nh R: v b )

begin
tuck 1 lshift r@ + over lshift ( nh bs t R: v b)

over r@ > ( nh bs t b f R: v)

if nip rot else swap negate r> + >r rot over + then ( bs b nh R: v)
rot 1- rot 1 rshift >r swap r@ 0= ( bs nh R: v b)
until rdrop rdrop nip
;

Works on both 32- and 64-bit. Isn't surprisingly fast. As a matter of fact: I've seen implementations of binary algorithms that were faster.

Yes, it is a mixed thing in Forth. Lots of time spent in bringing arguments to the top.
The ultra-compact ARM code used registers.

Readability: algorithm and stack movements are like mashed carrots and potatoes, difficult to separate.
In my Forth

: USQRT32 {: u | T H B:$8000 S:15 == H :}
BEGIN
H 2* B + S lshift := T
T u < IF B += H T -= u THEN
1 -= S B 2/ := B
B ~UNTIL ;

CR 200_000_000 USQRT32 .( 200mil-root: ) .

Yes, this is slower and thus a game loser. ;-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans Bezemer@21:1/5 to minf...@arcor.de on Wed Dec 7 04:33:00 2022

On Wednesday, December 7, 2022 at 10:20:46 AM UTC+1, minf...@arcor.de wrote:

Yesterday I found a different algorithm in the net that avoids slow divisions and should
therefore be even faster. The ARM code at the end of the document is a gem: http://www.azillionmonkeys.com/qed/ulerysqroot.pdf

Quick hack:

cell-bits 2 / 1- constant (bshift)

: sqrt
>r (bshift) 1 over lshift >r 0 ( bs nh R: v b )
begin
tuck 1 lshift r@ + over lshift ( nh bs t R: v b)
r> over r@ > ( nh bs t b f R: v)
if nip rot else swap negate r> + >r rot over + then ( bs b nh R: v)
rot 1- rot 1 rshift >r swap r@ 0= ( bs nh R: v b)
until rdrop rdrop nip
;

Works on both 32- and 64-bit. Isn't surprisingly fast. As a matter of fact: I've seen implementations of binary algorithms that were faster.

Hans Bezemer

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans Bezemer@21:1/5 to Gerry Jackson on Wed Dec 7 05:44:58 2022

On Wednesday, December 7, 2022 at 12:36:24 PM UTC+1, Gerry Jackson wrote:

Worked OK for me, I wasn't concerned about speed so don't know how it compares in that respect. Incidentally this provides an example of using
the stack vs locals.

The algorithm that started this discussion did 1M iterations in about 0.76s. Yours is 0.2s faster on my combo - even with all the function call overhead.

HB

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans Bezemer@21:1/5 to minf...@arcor.de on Wed Dec 7 05:24:49 2022

On Wednesday, December 7, 2022 at 2:01:06 PM UTC+1, minf...@arcor.de wrote:

: USQRT32 {: u | T H B:$8000 S:15 == H :}

Oops - 32-bit only! It's actually two loops - one to figure out the first candidate for a binary square
and the second one to figure out the square itself. That's where S and B come from - if you offer
MAX-N the first possible candidate (in 32 bits = SQRT(MAX-) ~ 2^15). So S and B are related.

Therefore, you can derive B from S at the start of the program - and may be one could eradicate
a lot of these vars from the algorithm - with some speed penalty of course. But I haven't delved
that deep (yet).

S seems to be equal to half the number of bits of MAX-N, rounded down. On this basis you can
calculate the number of bits for S for a specific architecture.

BTW, this might be an interesting one for this discussion. https://sourceforge.net/p/forth-4th/wiki/Inside%20Zen%20FSQRT/

It explains how (on 32 bit) an FSQRT is made by:
1. Cranking up the mantissa to the max;
2. Halving the exponent;
3. Calculating a good approximation now the (integer) mantissa is close to MAX-N.

In 32-bit it required about 3-4 integer iterations and 3-4 floating point iterations - if I remember
correctly.

HB

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gerry Jackson@21:1/5 to Hans Bezemer on Wed Dec 7 15:30:16 2022

On 07/12/2022 13:44, Hans Bezemer wrote:

On Wednesday, December 7, 2022 at 12:36:24 PM UTC+1, Gerry Jackson wrote:

Worked OK for me, I wasn't concerned about speed so don't know how it
compares in that respect. Incidentally this provides an example of using
the stack vs locals.

The algorithm that started this discussion did 1M iterations in about 0.76s. Yours is 0.2s faster on my combo - even with all the function call overhead.

HB

I've just realised it can be made slightly quicker using the well formed
flag trick in the last line to eliminate the IF ... THEN

: usqrt ( u -- u1 )
dup 2 u< if exit then
dup >r 2 rshift recurse
2* ( -- u sm )
1+ dup ( -- u la la )
dup * ( -- u la la^2 )
\ r> u> if 1- then ( -- u1 )
r> u> + ( -- u1 )
;

I believe 4th uses 1 for TRUE so the last line for 4th would be:
r> u> -

I'm a bit dubious about whether a marginal saving like that is worthwhile.

--
Gerry

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to Gerry Jackson on Wed Dec 7 10:12:09 2022

Gerry Jackson schrieb am Mittwoch, 7. Dezember 2022 um 16:30:20 UTC+1:

On 07/12/2022 13:44, Hans Bezemer wrote:

On Wednesday, December 7, 2022 at 12:36:24 PM UTC+1, Gerry Jackson wrote:

Worked OK for me, I wasn't concerned about speed so don't know how it
compares in that respect. Incidentally this provides an example of using >> the stack vs locals.

The algorithm that started this discussion did 1M iterations in about 0.76s.
Yours is 0.2s faster on my combo - even with all the function call overhead.

HB

I've just realised it can be made slightly quicker using the well formed
flag trick in the last line to eliminate the IF ... THEN
: usqrt ( u -- u1 )
dup 2 u< if exit then
dup >r 2 rshift recurse
2* ( -- u sm )
1+ dup ( -- u la la )
dup * ( -- u la la^2 )
\ r> u> if 1- then ( -- u1 )

+ ( -- u1 )

;

I believe 4th uses 1 for TRUE so the last line for 4th would be:

-

I'm a bit dubious about whether a marginal saving like that is worthwhile.

\ with microscopic change & tested
: USQRT
dup 2 u< IF EXIT THEN
dup >r 2 rshift recurse
2* dup 1+ dup * r> u< - ;

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gerry Jackson@21:1/5 to minf...@arcor.de on Wed Dec 7 21:07:28 2022

On 07/12/2022 18:12, minf...@arcor.de wrote:

Gerry Jackson schrieb am Mittwoch, 7. Dezember 2022 um 16:30:20 UTC+1:

On 07/12/2022 13:44, Hans Bezemer wrote:

On Wednesday, December 7, 2022 at 12:36:24 PM UTC+1, Gerry Jackson wrote: >>>> Worked OK for me, I wasn't concerned about speed so don't know how it

compares in that respect. Incidentally this provides an example of using >>>> the stack vs locals.

The algorithm that started this discussion did 1M iterations in about 0.76s.
Yours is 0.2s faster on my combo - even with all the function call overhead.

HB

I've just realised it can be made slightly quicker using the well formed
flag trick in the last line to eliminate the IF ... THEN
: usqrt ( u -- u1 )
dup 2 u< if exit then
dup >r 2 rshift recurse
2* ( -- u sm )
1+ dup ( -- u la la )
dup * ( -- u la la^2 )
\ r> u> if 1- then ( -- u1 )

+ ( -- u1 )

;

I believe 4th uses 1 for TRUE so the last line for 4th would be:

-

I'm a bit dubious about whether a marginal saving like that is worthwhile.

\ with microscopic change & tested
: USQRT
dup 2 u< IF EXIT THEN
dup >r 2 rshift recurse
2* dup 1+ dup * r> u< - ;

Sorry but that doesn't seem to work for all cases e.g. using your USQRT

25 usqrt . \ displays 4

Using this test program on GForth
: x ( n -- ) \ get square roots from 0 to n-1
-1 swap 0
do
i usqrt 2dup <
if cr i 3 .r ." : " nip dup then \ Display next square number
. \ Followed by square roots of integers up to next square
loop drop
;

\ With my USQRT
101 x
0: 0
1: 1 1 1
4: 2 2 2 2 2
9: 3 3 3 3 3 3 3
16: 4 4 4 4 4 4 4 4 4
25: 5 5 5 5 5 5 5 5 5 5 5
36: 6 6 6 6 6 6 6 6 6 6 6 6 6
49: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
64: 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
81: 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
100: 10 ok

Whereas using your version we get

101 x
0: 0
1: 1 1 1
4: 2 2 2 2 2 2
10: 3 3 3 3 3 3
16: 4 4 4 4 4 4 4 4 4 4
26: 5 5 5 5 5 5 5 5 5 5 5 5 5 5
40: 6 6 6 6 6 6 6 6 6 6
50: 7 7 7 7 7 7 7 7 7 7 7 7 7 7
64: 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
82: 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 ok

I'm tired and it's too late to start thinking about it tonight.

--
Gerry

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to dxforth on Wed Dec 7 22:06:01 2022

dxforth <dxforth@gmail.com> writes:

Locals are supposed to reduce 'stack juggling' and simplify code but in this >instance it was the juggling between variables that proved extraneous.

I have collected some of the versions posted here into

http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th

I then created some versions from it that have the stack effect ( +n1
-- n2 ) and that do not count the iterations:

\ derived from original locals version
: USQRT1 {: u | x0 x1 -- uroot ; positive integer square root :}
u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
REPEAT
x0 ;

\ derived from dxforths locals-less version
: USQRT4 ( +n -- +root )
dup 2/ >r
BEGIN ( +n r:x0 )
dup r@ / r@ + 2/ ( +n x1 R:x0 )
dup r@ <
WHILE
r> drop >r
REPEAT
2drop r> ;

\ An attempt to write code for VFX's current limitations: Try to do
\ everything on the data stack with only one stack item at basic block
\ bounaries. Keep u on the return stack, so we only do a read from
\ there. Unfortunately, the data stack depth changes here.
: usqrt9 ( u -- root )
dup >r 2/ begin ( x0 R:u )
r@ over / over + 2/ ( x0 x1 R:u )
2dup > while
nip repeat
r> 2drop ;

\ Maybe we can do better by not changing the stack depth; however, we
\ use 2 stack items in the basic blocks in the loop.
: usqrta ( u -- root )
dup >r 2/ r@ begin ( x0 u R:u )
over / over + 2/ ( x0 x1 R:u )
2dup > while
nip r@ repeat
r> 2drop ;

Let's first look at the inner loops as produced by VFX64:

usqrt1 usqrt4 usqrt9 usqrta
MOV RAX, [RDI+10] MOV RCX, [RSP] MOV RDX, [RSP] MOV RAX, RBX
CQO MOV RAX, RBX MOV RAX, RDX CQO
IDIV QWord [RDI+-08] CQO CQO IDIV QWord [RBP] ADD RAX, [RDI+-08] IDIV RCX IDIV RBX ADD RAX, [RBP] SAR RAX, # 1 MOV RDX, [RSP] ADD RAX, RBX SAR RAX, # 1
MOV [RDI+-10], RAX ADD RDX, RAX SAR RAX, # 1 CMP RAX, [RBP] MOV RDX, [RDI+-10] SAR RDX, # 1 CMP RBX, RAX MOV RBX, RAX
CMP RDX, [RDI+-08] MOV RCX, [RSP] LEA RBP, [RBP+-08] JNL 004E410D
JNL 004E3F00 CMP RDX, RCX MOV [RBP], RBX MOV RDX, [RSP] MOV RDX, [RDI+-10] LEA RBP, [RBP+-08] MOV RBX, RAX MOV [RBP], RBX MOV [RDI+-08], RDX MOV [RBP], RBX JLE/N004E4090 MOV RBX, RDX
JMP 004E3ED3 MOV RBX, RDX LEA RBP, [RBP+08] JMP 004E40E3
JNL 004E4028 JMP 004E4064
LEA RSP, [RSP+08]
PUSH RBX
MOV RBX, [RBP]
LEA RBP, [RBP+08]
JMP 004E3FEA

Here the locals code looks quite good, but it has a lot of memory
accesses. 9 and a look indeed better than 4. Let's see how they
perform:

for i in 1 4 9 a; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'
187.7 257.1 191.3 179.7 [h0:~/pub/forth/programs:86099] for i in 1 4 9 a; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'

1 4 9 a
187.7 257.1 191.3 179.7 instructions per invocation (average from u=2..1e7) 165.9 141.2 136.3 161.9 cycles per invocation on Rocket Lake
169.5 105.0 156.8 163.0 cycles per invocation on Zen3

The Zen3 result is quite surprising: while usqrt4 executes the most instructions (as expected), it takes far fewer cycles on Zen3 than all
the other versions.

The Rocket Lake result is not as great for usqrt4, and usqrt9 beats it
by a little.

I currently have no explanation why the cycles results are like this.

- anton

--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From dxforth@21:1/5 to Gerry Jackson on Thu Dec 8 11:09:55 2022

On 8/12/2022 2:30 am, Gerry Jackson wrote:

...
I've just realised it can be made slightly quicker using the well formed flag trick in the last line to eliminate the IF ... THEN
...
I'm a bit dubious about whether a marginal saving like that is worthwhile.

Especially when an optimizing compiler is compelled to generate the 'well-formed'
flag. I don't think it was ever a given that 'calculate' was better than 'test and branch'. Each situation according to its merits.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From dxforth@21:1/5 to Anton Ertl on Thu Dec 8 12:24:48 2022

On 8/12/2022 9:06 am, Anton Ertl wrote:

dxforth <dxforth@gmail.com> writes:

Locals are supposed to reduce 'stack juggling' and simplify code but in this >> instance it was the juggling between variables that proved extraneous.

I have collected some of the versions posted here into

http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th

I then created some versions from it that have the stack effect ( +n1
-- n2 ) and that do not count the iterations:

...

\ derived from dxforths locals-less version
: USQRT4 ( +n -- +root )
dup 2/ >r
BEGIN ( +n r:x0 )
dup r@ / r@ + 2/ ( +n x1 R:x0 )
dup r@ <
WHILE
r> drop >r
REPEAT
2drop r> ;

With no count to maintain, it can be simplified to:

: USQRT4 ( +n -- +root )
dup 2/ dup
BEGIN drop
2dup / over + 2/
2dup >
WHILE
swap
REPEAT
rot 2drop ;

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From SpainHackForth@21:1/5 to dxforth on Thu Dec 8 02:09:39 2022

On Thursday, December 8, 2022 at 2:24:55 AM UTC+1, dxforth wrote:

On 8/12/2022 9:06 am, Anton Ertl wrote:

dxforth <dxf...@gmail.com> writes:

Locals are supposed to reduce 'stack juggling' and simplify code but in this
instance it was the juggling between variables that proved extraneous.

I have collected some of the versions posted here into

http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th

I then created some versions from it that have the stack effect ( +n1
-- n2 ) and that do not count the iterations:

...

\ derived from dxforths locals-less version
: USQRT4 ( +n -- +root )
dup 2/ >r
BEGIN ( +n r:x0 )
dup r@ / r@ + 2/ ( +n x1 R:x0 )
dup r@ <
WHILE

drop >r

REPEAT
2drop r> ;

With no count to maintain, it can be simplified to:
: USQRT4 ( +n -- +root )
dup 2/ dup
BEGIN drop
2dup / over + 2/
2dup >
WHILE
swap
REPEAT
rot 2drop ;

I wrote a assembler word for Mecrisp passed on this article… I think it was number number 4… I tried number 9 and it was not very precise… https://www.codeproject.com/Articles/69941/Best-Square-Root-Method-Algorithm-Function-Precisi

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Heinrich Hohl@21:1/5 to Anton Ertl on Thu Dec 8 02:12:34 2022

On Tuesday, December 6, 2022 at 11:14:13 PM UTC+1, Anton Ertl wrote:

Here's a LOG2 definition (originally by Wil Baden
<wilbadenE...@netcom.com>, with his peculiarities removed, extended to 64 bits, and slightly more efficient):

: log2 ( n -- log2 )
-1 swap ( log2 n)
dup $ffffffff00000000 and if swap 32 + swap 32 rshift then
dup $00000000ffff0000 and if swap 16 + swap 16 rshift then
dup $000000000000ff00 and if swap 8 + swap 8 rshift then
dup $00000000000000f0 and if swap 4 + swap 4 rshift then
dup $000000000000000c and if swap 2 + swap 2 rshift then
dup $0000000000000002 and if swap 1 + swap 1 rshift then
+ ;

Hello Anton,

did Wil Baden explain in detail how the algorithm works?

The routine works properly and the code is clear, but the algorithm
behind it is not easy to understand. I would be grateful if you have
any reference with more information on this Log2 algorithm.

Henry

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans Bezemer@21:1/5 to Heinrich Hohl on Thu Dec 8 09:12:19 2022

On Thursday, December 8, 2022 at 11:12:36 AM UTC+1, Heinrich Hohl wrote:

did Wil Baden explain in detail how the algorithm works?

The routine works properly and the code is clear, but the algorithm
behind it is not easy to understand. I would be grateful if you have
any reference with more information on this Log2 algorithm.

Maybe this helps:
decimal $ffffffff00000000 2 base ! u. 1111111111111111111111111111111100000000000000000000000000000000 ok
decimal $00000000ffff0000 2 base ! u. 11111111111111110000000000000000 ok decimal $000000000000ff00 2 base ! u. 1111111100000000 ok
decimal $00000000000000f0 2 base ! u. 11110000 ok
decimal $000000000000000c 2 base ! u. 1100 ok
decimal $0000000000000002 2 base ! u. 10

If the first one checks out (AND returns a non-zero value) we know it is in the top 32-bits.
So we update out count (it's bits 0 - 63, so we start off with -1) by adding 32 to it, shift
the whole shebang 32 bits to the right and test the first half of the last 32 bits.

If AND had returned zero, it's not in the first 32 bits - so we leave it be and continue with the
first half of the 32 bits. Same thing here - if zero, we leave it be, if not we add 16 to the count
and shift the whole darn thing 16 bits to the right - and repeat the entire procedure for the
first half of the 16 bits (which is 8 bits).

The last bit left is either 0 or 1. So we add it to our count. Let's assume we did "*0001" then
that bit is added to -1 - and it returns that bit zero is set. If we did "*0000" then zero is added
to -1 - signalling that NO bit is set at all.

It's quite simple and elegant when you think of it.

Hans Bezemer

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From P Falth@21:1/5 to dxforth on Thu Dec 8 09:54:14 2022

On Thursday, 8 December 2022 at 01:10:06 UTC+1, dxforth wrote:

On 8/12/2022 2:30 am, Gerry Jackson wrote:

...
I've just realised it can be made slightly quicker using the well formed flag trick in the last line to eliminate the IF ... THEN
...
I'm a bit dubious about whether a marginal saving like that is worthwhile.

Especially when an optimizing compiler is compelled to generate the 'well-formed'
flag. I don't think it was ever a given that 'calculate' was better than 'test
and branch'. Each situation according to its merits.

This is exactly what happens with my codegenerator. The "optimized" version is 5% larger (but just 3 bytes) and more surprisingly have 30% longer runtime! .

Gerry's original recursive version is anyway 5 times faster then your version!

Could be the division is slow on my processor, an E5-4657L v2

BR
Peter

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Heinrich Hohl@21:1/5 to the.bee...@gmail.com on Thu Dec 8 09:44:34 2022

On Thursday, December 8, 2022 at 6:12:21 PM UTC+1, the.bee...@gmail.com wrote:

It's quite simple and elegant when you think of it.

Hans Bezemer

Hello Hans,

thank you for the explanations. Yes, the algorithm is tricky but also elegant. This is basically a binary search for the topmost nonzero bit.

I had a look at "Bit Twiddling Hacks" and found similar algorithms. https://graphics.stanford.edu/~seander/bithacks.html

"Count the consecutive zero bits (trailing) on the right by binary search"
is almost the same algorithm, except that it searches for the lowermost nonzero bit.

Henry

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Heinrich Hohl on Thu Dec 8 17:32:24 2022

Heinrich Hohl <hheinrich.hohl@gmail.com> writes:

On Tuesday, December 6, 2022 at 11:14:13 PM UTC+1, Anton Ertl wrote:

Here's a LOG2 definition (originally by Wil Baden
<wilbadenE...@netcom.com>, with his peculiarities removed, extended to 64 bits, and slightly more efficient):

: log2 ( n -- log2 )
-1 swap ( log2 n)
dup $ffffffff00000000 and if swap 32 + swap 32 rshift then
dup $00000000ffff0000 and if swap 16 + swap 16 rshift then
dup $000000000000ff00 and if swap 8 + swap 8 rshift then
dup $00000000000000f0 and if swap 4 + swap 4 rshift then
dup $000000000000000c and if swap 2 + swap 2 rshift then
dup $0000000000000002 and if swap 1 + swap 1 rshift then
+ ;

Hello Anton,

did Wil Baden explain in detail how the algorithm works?

I only have found it in a reply by me, and what I kept from his
posting does not include an explanation. You could take the
Message-Id <wilbadenE5oFJA.57w@netcom.com> and look it up.
al.howardknight.net does not have it (probably too old); maybe Google
Groups can find it (but the last few times I tried to find Message-Ids
with Google Groups, I was unsuccessful).

The routine works properly and the code is clear, but the algorithm
behind it is not easy to understand. I would be grateful if you have
any reference with more information on this Log2 algorithm.

Hans Bezemer explains it.

A similar way to do it, but without ifs (derived from the C code in
Gforth):

: log2a ( n -- log2 )
dup 0= swap ( log2' n )
dup $ffffffff u> 32 and rot over + -rot rshift
dup $0000ffff u> 16 and rot over + -rot rshift
dup $000000ff u> 8 and rot over + -rot rshift
dup $0000000f u> 4 and rot over + -rot rshift
dup $00000003 u> 2 and rot over + -rot rshift
$00000001 u> 1 and + ;

\ checked against Gforth's builtin log2
: check
100000000 0 do i log2 i log2a <> if cr i hex. i log2 . i log2a . then loop ;

VFX64 translates this into 47 instructions (163 bytes) without a
single memory reference.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to dxforth on Thu Dec 8 18:41:36 2022

dxforth <dxforth@gmail.com> writes:

With no count to maintain, it can be simplified to:

: USQRT4 ( +n -- +root )
dup 2/ dup
BEGIN drop
2dup / over + 2/
2dup >
WHILE
swap
REPEAT
rot 2drop ;

The version above is called USQRTB (or B) below. It inspired me to:

: usqrtc ( u -- root )
dup 2/ dup
begin ( u x0 x1 )
nip ( u x1 )
2dup / over + 2/ ( u x1 x2 )
2dup <= until
drop nip ;

The idea here is to never move a value on the critical path to memory
in VFX (i.e., the critical path value should be on the TOS at the
basic block boundary).

Here are the numbers, I'll show the code in a later posting:

for i in 1 4 9 a b c; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null
for i in 1 4 9 a b c; do perf stat -x" " -e cycles:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'

1 4 9 a b c
187.7 257.1 191.3 179.7 177.7 141.0 instructions per invocation
166.3 141.2 139.0 161.4 160.6 136.3 cycles per invocation on Rocket Lake
169.6 105.1 157.5 162.8 168.5 95.7 cycles per invocation on Zen3

The cycles for USQRT4 vary a lot on the Rocket Lake; they are often
174 cycles, but here I show a good result.

Anyway, the idea behind USQRTC seems to work well: It wins in the
number of instructions, and the number of cycles on both CPUs.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From P Falth@21:1/5 to Anton Ertl on Thu Dec 8 12:04:25 2022

On Thursday, 8 December 2022 at 19:52:02 UTC+1, Anton Ertl wrote:

dxforth <dxf...@gmail.com> writes:

With no count to maintain, it can be simplified to:

: USQRT4 ( +n -- +root )
dup 2/ dup
BEGIN drop
2dup / over + 2/
2dup >
WHILE
swap
REPEAT
rot 2drop ;

The version above is called USQRTB (or B) below. It inspired me to:

: usqrtc ( u -- root )
dup 2/ dup
begin ( u x0 x1 )
nip ( u x1 )
2dup / over + 2/ ( u x1 x2 )
2dup <= until
drop nip ;

The idea here is to never move a value on the critical path to memory
in VFX (i.e., the critical path value should be on the TOS at the
basic block boundary).

Here are the numbers, I'll show the code in a later posting:

for i in 1 4 9 a b c; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null
for i in 1 4 9 a b c; do perf stat -x" " -e cycles:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'

1 4 9 a b c
187.7 257.1 191.3 179.7 177.7 141.0 instructions per invocation
166.3 141.2 139.0 161.4 160.6 136.3 cycles per invocation on Rocket Lake 169.6 105.1 157.5 162.8 168.5 95.7 cycles per invocation on Zen3

The cycles for USQRT4 vary a lot on the Rocket Lake; they are often
174 cycles, but here I show a good result.

Anyway, the idea behind USQRTC seems to work well: It wins in the
number of instructions, and the number of cycles on both CPUs.

But its name is wrong. It is not unsigned. It requires a signed positive number as seen also by the original stack comment .

try -1 usqrtc

BR
Peter

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to P Falth on Thu Dec 8 12:35:37 2022

P Falth schrieb am Donnerstag, 8. Dezember 2022 um 18:54:15 UTC+1:

On Thursday, 8 December 2022 at 01:10:06 UTC+1, dxforth wrote:

On 8/12/2022 2:30 am, Gerry Jackson wrote:

...
I've just realised it can be made slightly quicker using the well formed flag trick in the last line to eliminate the IF ... THEN
...
I'm a bit dubious about whether a marginal saving like that is worthwhile.

Especially when an optimizing compiler is compelled to generate the 'well-formed'
flag. I don't think it was ever a given that 'calculate' was better than 'test
and branch'. Each situation according to its merits.

This is exactly what happens with my codegenerator. The "optimized" version is
5% larger (but just 3 bytes) and more surprisingly have 30% longer runtime! .

Gerry's original recursive version is anyway 5 times faster then your version!

Could be the division is slow on my processor, an E5-4657L v2

IMO 5 times speed difference is too much to be explained by a single division. Memory and CPU cache effects seem more plausible. Recursion retakes
items from the stack that is probably still cached, because those items were stored only recently there in a preceding cycle.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Hendrix@21:1/5 to All on Thu Dec 8 13:22:28 2022

On Thursday, December 8, 2022 at 7:52:02 PM UTC+1, Anton Ertl wrote:
[..]
FORTH> in ( #100,000,000 iterations )
303700049900000000 7.753 seconds elapsed. minforth
303700049900000000 7.039 seconds elapsed. dxforth #1
303700049900000000 4.424 seconds elapsed. Albert
303700049900000000 0.681 seconds elapsed. Anton #1
303700049900000000 6.229 seconds elapsed. Gerry
303700049900000000 6.540 seconds elapsed. Anton #2
303700049900000000 6.208 seconds elapsed. Anton #3
303700049900000000 6.554 seconds elapsed. dxforth #2
303700049900000000 6.497 seconds elapsed. Anton #3 ok
303700049900000000 0.036 seconds elapsed. iForth ok
FORTH> .TICKER-INFO
AMD Ryzen 7 5800X 8-Core Processor, TICKS-GET uses iTSC at 4192MHz

Locals cost something, but a smart algorithm obviously wins.

The last result is because some inlined words evaluate their arguments.

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to P Falth on Thu Dec 8 22:26:50 2022

P Falth <peter.m.falth@gmail.com> writes:

But its name is wrong. It is not unsigned. It requires a signed positive number
as seen also by the original stack comment .

Yes, that can be fixed with

: usqrtd ( u -- root )
dup 1 rshift dup
begin ( u x0 x1 )
nip ( u x1 )
2dup u/ over + 1 rshift ( u x1 x2 )
2dup u<= until
drop nip ;

but not every Forth system has U/ and U<= (and VFX64 has U/, but does
not inline it).

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Thu Dec 8 22:33:15 2022

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

dxforth <dxforth@gmail.com> writes:

With no count to maintain, it can be simplified to:

: USQRT4 ( +n -- +root )
dup 2/ dup
BEGIN drop
2dup / over + 2/
2dup >
WHILE
swap
REPEAT
rot 2drop ;

The version above is called USQRTB (or B) below. It inspired me to:

: usqrtc ( u -- root )
dup 2/ dup
begin ( u x0 x1 )
nip ( u x1 )
2dup / over + 2/ ( u x1 x2 )
2dup <= until
drop nip ;

The idea here is to never move a value on the critical path to memory
in VFX (i.e., the critical path value should be on the TOS at the
basic block boundary).

Here are the numbers, I'll show the code in a later posting:

for i in 1 4 9 a b c; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null
for i in 1 4 9 a b c; do perf stat -x" " -e cycles:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'

1 4 9 a b c
187.7 257.1 191.3 179.7 177.7 141.0 instructions per invocation
166.3 141.2 139.0 161.4 160.6 136.3 cycles per invocation on Rocket Lake >169.6 105.1 157.5 162.8 168.5 95.7 cycles per invocation on Zen3

The cycles for USQRT4 vary a lot on the Rocket Lake; they are often
174 cycles, but here I show a good result.

Anyway, the idea behind USQRTC seems to work well: It wins in the
number of instructions, and the number of cycles on both CPUs.

Here is the promised code for the inner loop on VFX64:

usqrt4 usqrt9 usqrtb usqrtc
MOV RCX, [RSP] MOV RDX, [RSP] MOV RAX, [RBP+08] MOV RAX, [RBP+08] MOV RAX, RBX MOV RAX, RDX CQO CQO
CQO CQO IDIV QWord [RBP] IDIV RBX
IDIV RCX IDIV RBX ADD RAX, [RBP] ADD RAX, RBX
MOV RDX, [RSP] ADD RAX, RBX SAR RAX, # 1 SAR RAX, # 1
ADD RDX, RAX SAR RAX, # 1 CMP RAX, [RBP] CMP RBX, RAX
SAR RDX, # 1 CMP RBX, RAX MOV RBX, RAX MOV [RBP], RBX
MOV RCX, [RSP] LEA RBP, [RBP+-08] JNL 004E418D MOV RBX, RAX
CMP RDX, RCX MOV [RBP], RBX MOV RDX, RBX JNLE 004E41E2
LEA RBP, [RBP+-08] MOV RBX, RAX MOV RBX, [RBP]
MOV [RBP], RBX JLE/N004E4090 MOV [RBP], RDX
MOV RBX, RDX LEA RBP, [RBP+08] JMP 004E4162
JNL 004E4028 JMP 004E4064
LEA RSP, [RSP+08]
PUSH RBX
MOV RBX, [RBP]
LEA RBP, [RBP+08]
JMP 004E3FEA

For space reasons I left USQRT1 and USQRTA (relatively slow versions
away here, you can look at them in an earlier posting.

The memory reference in USQRTC are for different memory locations, so
one is only read (the value U), and another is only written (to a
different location, which is then not used in the next iteration; it's
only relevant when leaving the loop.

USQRT9 also seems to have this property, yet is signifiantly slower;
the reasons for that are unclear to me.

USQRT4 PUSHes a value (x1) to the return stack towards the end of an
iteration, and loads it (as x0) at the start of the next iteration.
This does not seem to hurt much, though, and I don't know why.

USQRTB moves a value to data stack RAM in the penultimate instruction
of the iteration and loads it again in the IDIV instruction.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to Marcel Hendrix on Thu Dec 8 14:21:37 2022

Marcel Hendrix schrieb am Donnerstag, 8. Dezember 2022 um 22:22:30 UTC+1:

On Thursday, December 8, 2022 at 7:52:02 PM UTC+1, Anton Ertl wrote:
[..]
FORTH> in ( #100,000,000 iterations )
303700049900000000 7.753 seconds elapsed. minforth
303700049900000000 7.039 seconds elapsed. dxforth #1
303700049900000000 4.424 seconds elapsed. Albert
303700049900000000 0.681 seconds elapsed. Anton #1
303700049900000000 6.229 seconds elapsed. Gerry
303700049900000000 6.540 seconds elapsed. Anton #2
303700049900000000 6.208 seconds elapsed. Anton #3
303700049900000000 6.554 seconds elapsed. dxforth #2
303700049900000000 6.497 seconds elapsed. Anton #3 ok
303700049900000000 0.036 seconds elapsed. iForth ok
FORTH> .TICKER-INFO
AMD Ryzen 7 5800X 8-Core Processor, TICKS-GET uses iTSC at 4192MHz

Locals cost something, but a smart algorithm obviously wins.

I think many Forths do some form of register caching for top stack element(s). But keeping locals in registers seems still unpopular in Forth. This could explain
those many memory accesses by VFX Forth for the locals version, and the incurred speed penalty.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From dxforth@21:1/5 to Anton Ertl on Fri Dec 9 11:09:43 2022

On 9/12/2022 5:41 am, Anton Ertl wrote:

dxforth <dxforth@gmail.com> writes:

With no count to maintain, it can be simplified to:

: USQRT4 ( +n -- +root )
dup 2/ dup
BEGIN drop
2dup / over + 2/
2dup >
WHILE
swap
REPEAT
rot 2drop ;

The version above is called USQRTB (or B) below. It inspired me to:

: usqrtc ( u -- root )
dup 2/ dup
begin ( u x0 x1 )
nip ( u x1 )
2dup / over + 2/ ( u x1 x2 )
2dup <= until
drop nip ;

One less branch. Neat.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Fri Dec 9 09:58:27 2022

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

dxforth <dxforth@gmail.com> writes:

Locals are supposed to reduce 'stack juggling' and simplify code but in this >>instance it was the juggling between variables that proved extraneous.

I have collected some of the versions posted here into

http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th

After establishing that USQRTD has the fastest loop, I now add the
better initial value. I benchmark albert@cherry.(none)'s cheap, but
less accurate initial value as well as my more precise, but more
expensive initial value. The values tested are for 2..10^7 (like in
the other tests).

First, the results:

c d 5 e
141.1 104.5 77.0 63.8 instructions/invocation
136.3 46.7 46.0 45.9 cycles/invocation Rocket Lake
95.8 40.0 32.5 32.5 cycles/invocation Zen3

c uses the original initial value: u 2/
d uses my log2-based initial value (see below)
5 and e use albert@cherry.(none)'s initial value;
5 is albert@cherry.(none)'s code
e uses his initial value with the rest from c.

Concerning usqrtd, I could not use exactly, what I outlined earlier, but
had to adapt it slightly to guarantee an initial value that's larger
than the root (required by the terminating condition of the loop):

: log2 ( n -- log2 )
dup 0= swap ( log2' n )
dup $ffffffff u> 32 and rot over + -rot rshift
dup $0000ffff u> 16 and rot over + -rot rshift
dup $000000ff u> 8 and rot over + -rot rshift
dup $0000000f u> 4 and rot over + -rot rshift
dup $00000003 u> 2 and rot over + -rot rshift
$00000001 u> 1 and + ;

: usqrtd ( u -- root )
2 over log2 2/ lshift
dup begin ( u x0 x1 )
nip ( u x1 )
2dup / over + 2/ ( u x1 x2 )
2dup <= until
drop nip ;

: uSQRT5 DUP >R 10 RSHIFT 1024 MAX \ Minimize iterations.
BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE
NIP REPEAT DROP R> DROP ;

: usqrte ( u -- root )
dup 10 rshift 1024 max
dup begin ( u x0 x1 )
nip ( u x1 )
2dup / over + 2/ ( u x1 x2 )
2dup <= until
drop nip ;

It's obvious that, for the range of numbers benchmarked, the increased precision of USQRTD does not compensate for its increased cost. I
also benchmarked with u=1000000000002..1000010000000, with the
following results:

c d 5 e
226.9 102.7 201.4 149.9 instructions/invocation
270.7 45.3 142.2 142.6 cycles/invocation Rocket Lake
170.5 39.5 102.1 101.8 cycles/invocation Zen3

So for larger u, the higher precision of the log2 approach pays off.
OTOH, if you knew that u is in that range, you could use the following
cheap initial value:

DUP 20 RSHIFT $100000 MAX

and win over SQRTD again.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Hendrix@21:1/5 to Anton Ertl on Fri Dec 9 10:09:48 2022

On Friday, December 9, 2022 at 11:33:55 AM UTC+1, Anton Ertl wrote:

Concerning usqrtd, I could not use exactly, what I outlined earlier, but
had to adapt it slightly to guarantee an initial value that's larger
than the root (required by the terminating condition of the loop):

iForth has the log2 method implemented as a table (it is very
frequently used in my type of applications). For your version of
the benchmark implementation, I didn't want to mention that
I had to use "log2 1+".

Therefore the sqrt with log2 initial value wins by several
streetlengths. (Although the SQRT based on FSQRT might
still be faster).

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Marcel Hendrix on Fri Dec 9 18:36:07 2022

Marcel Hendrix <mhx@iae.nl> writes:

iForth has the log2 method implemented as a table (it is very
frequently used in my type of applications).

Note that Intel and AMD have bsr (since the 386) and lzcnt (if the
BMI1 or ABM extensions are present) that allow a fast implementation
of log2. In (development) Gforth it's

55CD95A2B62A: test r8,r8
55CD95A2B62D: jz $55CD95A2B645
55CD95A2B62F: bsr r8,r8
55CD95A2B633: xor r8,$3F
55CD95A2B637: movsx rax,r8
55CD95A2B63A: mov r8d,$3F
55CD95A2B640: sub r8,rax
55CD95A2B643: jmp $55CD95A2B64C
55CD95A2B645: mov r8,$FFFFFFFF
55CD95A2B64C: add r13,$08
55CD95A2B650: mov rcx,-$08[r13]
55CD95A2B654: jmp ecx

but I can see several ways to improve on this gcc output (bsr itself
is almost there).

Given that Gforth has LOG2 and iforth has LOG2, there is a slight
chance of standardization.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From lehs@21:1/5 to All on Fri Dec 9 14:11:49 2022

tisdag 6 december 2022 kl. 18:21:48 UTC+1 skrev Anton Ertl:

"minf...@arcor.de" <minf...@arcor.de> writes:

Inspired by another thread I came up with the iterative code shown below. >The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)

\ code:
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;

This is the specialization of the Newton-Raphson method for the square
root, see <https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>

For the initial value I would use

1 bits/cell u clz - 2/ lshift

where clz ( x -- u ) is the count-leading-zeros function that many architectures have built-in. Newton-Raphson profits a lot from a good
initial value.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

In android gforth lg2 is slightly faster than log2, especially for smaller u. The idea of lg2 is to asume that u is exponential distributed rather than rectangular.

: lg2 \ u -- n
locals| u |
0 -1
begin 2* dup u and
while 1 under+
repeat drop ;

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Hendrix@21:1/5 to Anton Ertl on Fri Dec 9 14:12:43 2022

On Friday, December 9, 2022 at 7:48:45 PM UTC+1, Anton Ertl wrote:

Marcel Hendrix <m...@iae.nl> writes:

iForth has the log2 method implemented as a table (it is very
frequently used in my type of applications).

Sorry, I mixed log2 up with 2^x ( u -- 2^u ) ...

Note that Intel and AMD have bsr (since the 386) and lzcnt (if the
BMI1 or ABM extensions are present) that allow a fast implementation
of log2.

FORTH> : test 7 log2 . ; ok
FORTH> see test
Flags: ANSI
$0133DC40 : test
$0133DC4A mov rbx, 7 d#
$0133DC51 mov rax, $FFFFFFFF:FFFFFFFF q#
$0133DC5B bsr rax, rbx
$0133DC5F mov rbx, rax
$0133DC62 push rbx
$0133DC63 jmp .+10 ( $0124A102 ) offset NEAR
$0133DC68 ;

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Marcel Hendrix on Fri Dec 9 22:24:29 2022

Marcel Hendrix <mhx@iae.nl> writes:

On Friday, December 9, 2022 at 7:48:45 PM UTC+1, Anton Ertl wrote:

Marcel Hendrix <m...@iae.nl> writes:

iForth has the log2 method implemented as a table (it is very
frequently used in my type of applications).

Sorry, I mixed log2 up with 2^x ( u -- 2^u ) ...

Note that Intel and AMD have bsr (since the 386) and lzcnt (if the
BMI1 or ABM extensions are present) that allow a fast implementation
of log2.

FORTH> : test 7 log2 . ; ok
FORTH> see test
Flags: ANSI
$0133DC40 : test
$0133DC4A mov rbx, 7 d#
$0133DC51 mov rax, $FFFFFFFF:FFFFFFFF q#
$0133DC5B bsr rax, rbx
$0133DC5F mov rbx, rax
$0133DC62 push rbx
$0133DC63 jmp .+10 ( $0124A102 ) offset NEAR
$0133DC68 ;

bsr's result is not defined if the source=0, so I would write
something like

bsr dest, source
bne nonzero
mov dest, -1
nonzero:
# do something with dest

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Hendrix@21:1/5 to Anton Ertl on Fri Dec 9 15:32:46 2022

On Friday, December 9, 2022 at 11:27:13 PM UTC+1, Anton Ertl wrote:
[..]

bsr's result is not defined if the source=0, so I would write
something like

bsr dest, source
bne nonzero
mov dest, -1
nonzero:
# do something with dest

The log2 of 0 is undefined anyway, so I prefer to let the user
of log2 decide how it should be handled...

Or do you mean that bsr with 0 does not work the same on all
architectures/chip generations?

FORTH> : test 0 log2 . ; ok
FORTH> see test ( on AMD )
Flags: ANSI
$0133DC40 : test
$0133DC4A xor rbx, rbx
$0133DC4D mov rax, $FFFFFFFF:FFFFFFFF q#
$0133DC57 bsr rax, rbx
$0133DC5B mov rbx, rax
$0133DC5E push rbx
$0133DC5F jmp .+10 ( $0124A102 ) offset NEAR
FORTH> test -1 ok

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From dxforth@21:1/5 to Marcel Hendrix on Sat Dec 10 14:05:05 2022

On 10/12/2022 10:32 am, Marcel Hendrix wrote:

On Friday, December 9, 2022 at 11:27:13 PM UTC+1, Anton Ertl wrote:
[..]

bsr's result is not defined if the source=0, so I would write
something like

bsr dest, source
bne nonzero
mov dest, -1
nonzero:
# do something with dest

The log2 of 0 is undefined anyway, so I prefer to let the user
of log2 decide how it should be handled...

The implementer you mean? Several options spring to mind:

a) THROW - clumsy but gets user's attention; always works
b) 0 - can be useful
c) -1 - throwing a spanner in the works in the hope user notices

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Marcel Hendrix on Sat Dec 10 15:37:23 2022

Marcel Hendrix <mhx@iae.nl> writes:

On Friday, December 9, 2022 at 11:27:13 PM UTC+1, Anton Ertl wrote:
[..]

bsr's result is not defined if the source=0, so I would write
something like

bsr dest, source
bne nonzero
mov dest, -1
nonzero:
# do something with dest

The log2 of 0 is undefined anyway, so I prefer to let the user
of log2 decide how it should be handled...

What happens in programming is that some user happens to write a
program that just uses the result as-is, and happens to work as
intended even for input 0. And if you want to provide a high-quality experience, your programming system should better be consistent in
what it returns in that case.

Gforth returns -1 for 0 LOG2.

One benefit of returning -1 is that you can get a count of leading
zeros with

( x ) bits/cell 1- swap log2 -

Or do you mean that bsr with 0 does not work the same on all >architectures/chip generations?

Intel certainly reserves the right to do it that way by saying that
the result is undefined in that case. However, for the same reason as discussed above, they (and AMD) better stick with the behaviour of
their original implementation, or their shiny new processor might get
a reputation for incompatibility (yes, Intel may blame the
application, but the customer does not care who is to blame).

So they might just as well document the actual behaviour.

FORTH> : test 0 log2 . ; ok
FORTH> see test ( on AMD )
Flags: ANSI
$0133DC40 : test
$0133DC4A xor rbx, rbx
$0133DC4D mov rax, $FFFFFFFF:FFFFFFFF q#
$0133DC57 bsr rax, rbx
$0133DC5B mov rbx, rax
$0133DC5E push rbx
$0133DC5F jmp .+10 ( $0124A102 ) offset NEAR
FORTH> test -1 ok

It seems that on your hardware (Zen3 IIRC) bsr leaves the destination
register unchanged when the source is 0, and iForth is one application
that relies on this behaviour in order to return -1 (otherwise the
instruction at $0133DC4D would be superfluous); I just tried in on a
Skylake, and unsurprisingly it behaves the same way. If you don't
want to rely on Intel and AMD sticking to the same behaviour, you can
change the central instructions as shown above or as

mov rax, -1
bsr rbx,rbx
cmove rbx, rax

Interstingly, AMD has added an instruction lzcnt with the ABM
extension since the K10 (2007), and Intel has picked it up in the BMI1 extension since the Haswell (2013). lzcnt delivers a complementary
result (e.g., all-bits-set produces 0, and 0 produces 64), but in
particular it is also defined for 0. It is faster than bsr on the AMD
CPUs listed on https://gmplib.org/~tege/x86-timing.pdf (e.g., 1 cycle
latency on Zen2 and 4 lzcnt/cycle throughput, vs. 4 cycle latency for
bsr and 1 bsr every 4 cycles throughput. On the Intel CPUs listed
they both have 3 cycles latency, and 1 instruction/cycle throughput.
I wonder what makes bsr so hard to implement on the AMD CPUs that they
added lzcnt.

Of course for log2 you would need something like:

lzcnt rax, rbx
mov rbx, 63
sub rbx, rax

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From P Falth@21:1/5 to Anton Ertl on Sat Dec 10 14:37:33 2022

On Saturday, 10 December 2022 at 17:49:17 UTC+1, Anton Ertl wrote:

Marcel Hendrix <m...@iae.nl> writes:

On Friday, December 9, 2022 at 11:27:13 PM UTC+1, Anton Ertl wrote:
[..]

bsr's result is not defined if the source=0, so I would write
something like

bsr dest, source
bne nonzero
mov dest, -1
nonzero:
# do something with dest

The log2 of 0 is undefined anyway, so I prefer to let the user
of log2 decide how it should be handled...

What happens in programming is that some user happens to write a
program that just uses the result as-is, and happens to work as
intended even for input 0. And if you want to provide a high-quality experience, your programming system should better be consistent in
what it returns in that case.

Gforth returns -1 for 0 LOG2.

One benefit of returning -1 is that you can get a count of leading
zeros with

( x ) bits/cell 1- swap log2 -

Or do you mean that bsr with 0 does not work the same on all >architectures/chip generations?

Intel certainly reserves the right to do it that way by saying that
the result is undefined in that case. However, for the same reason as discussed above, they (and AMD) better stick with the behaviour of
their original implementation, or their shiny new processor might get
a reputation for incompatibility (yes, Intel may blame the
application, but the customer does not care who is to blame).

So they might just as well document the actual behaviour.

FORTH> : test 0 log2 . ; ok
FORTH> see test ( on AMD )
Flags: ANSI
$0133DC40 : test
$0133DC4A xor rbx, rbx
$0133DC4D mov rax, $FFFFFFFF:FFFFFFFF q#
$0133DC57 bsr rax, rbx
$0133DC5B mov rbx, rax
$0133DC5E push rbx
$0133DC5F jmp .+10 ( $0124A102 ) offset NEAR
FORTH> test -1 ok

It seems that on your hardware (Zen3 IIRC) bsr leaves the destination register unchanged when the source is 0, and iForth is one application
that relies on this behaviour in order to return -1 (otherwise the instruction at $0133DC4D would be superfluous); I just tried in on a
Skylake, and unsurprisingly it behaves the same way. If you don't
want to rely on Intel and AMD sticking to the same behaviour, you can
change the central instructions as shown above or as

mov rax, -1
bsr rbx,rbx
cmove rbx, rax

Interstingly, AMD has added an instruction lzcnt with the ABM
extension since the K10 (2007), and Intel has picked it up in the BMI1 extension since the Haswell (2013). lzcnt delivers a complementary
result (e.g., all-bits-set produces 0, and 0 produces 64), but in
particular it is also defined for 0. It is faster than bsr on the AMD
CPUs listed on https://gmplib.org/~tege/x86-timing.pdf (e.g., 1 cycle
latency on Zen2 and 4 lzcnt/cycle throughput, vs. 4 cycle latency for
bsr and 1 bsr every 4 cycles throughput. On the Intel CPUs listed
they both have 3 cycles latency, and 1 instruction/cycle throughput.
I wonder what makes bsr so hard to implement on the AMD CPUs that they
added lzcnt.

Of course for log2 you would need something like:

lzcnt rax, rbx
mov rbx, 63
sub rbx, rax

My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode. But it does not flag an error if executed, instead it executes bsr! and give a wrong
result. This is also stated in the Intel manual.

BR
Peter

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From dxforth@21:1/5 to Anton Ertl on Sun Dec 11 10:56:48 2022

On 11/12/2022 2:37 am, Anton Ertl wrote:

Gforth returns -1 for 0 LOG2.

One benefit of returning -1 is that you can get a count of leading
zeros with

( x ) bits/cell 1- swap log2 -

OTOH simplest implementations return 0:

: LOG2 ( u1 -- u2 ) \ Rick C.
0 begin swap u2/ dup while swap 1+ repeat drop ;

code LOG2 ( u1 -- u2 )
bx pop ax ax xor 1 $: bx shr 2 $ jz ax inc 1 $ ju 2 $: 1push end-code

To return something else requires it be special-cased - which can be
left to the application.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Heinrich Hohl@21:1/5 to dxforth on Sun Dec 11 00:34:13 2022

On Sunday, December 11, 2022 at 12:56:52 AM UTC+1, dxforth wrote:

OTOH simplest implementations return 0:

: LOG2 ( u1 -- u2 ) \ Rick C.
0 begin swap u2/ dup while swap 1+ repeat drop ;
To return something else requires it be special-cased - which can be
left to the application.

You can rewrite this as:

: LOG2 ( u1 -- u2 ) -1 SWAP BEGIN DUP WHILE SWAP 1+ SWAP U2/ REPEAT DROP ;

Same number of operations, but u1 = 0 returns u2 = -1.

For u1 = 0, the output u2 = -1 is the best solution. We need a way to find out that the output
is illegal after the calculation has been done (i.e. the argument u1 is no longer available).
As Anton stated, -1 has certain advantages. It also has the advantage that we can use
0< to check whether the output is correct or illegal.

u2 = 0 should allow us to conclude that u1 = 1.

Henry

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to P Falth on Sun Dec 11 09:28:07 2022

P Falth <peter.m.falth@gmail.com> writes:

My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
But it does not flag an error if executed, instead it executes bsr! and give a wrong
result. This is also stated in the Intel manual.

Interesting. Looking at the encoding, lzcount is encoded as repne
bsr, and the repne prefix normally is ignored except for the string instructions. Apparently nobody encoded bsr with a repne prefix,
otherwise they could not have encoded lzcnt in this way.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From lehs@21:1/5 to All on Sun Dec 11 01:37:54 2022

I can't prove it, but it seems that beginning with a value greater than the square root and terminate when Xn - Xn+1 < 2 always gives the correct floor function SQRTF.

: sqrtf \ u -- n
locals| u |
u sqrtc~
begin dup u over / + 2/
tuck - 2 <
until ;

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to All on Sun Dec 11 02:38:59 2022

Since steam went out of square roots, here's what I have for cube roots.
Seldom used for volumetric calculations. Optimization ideas?

: UCBRT64 {: x | Y B -- cbrt ; integer cube root :}
0 to Y
63 BEGIN dup 0 >=
WHILE
Y 2* to Y
Y 1+ Y * 3 * 1+ to B
x over rshift B >= IF
B over lshift negate x + to x
Y 1+ to Y
THEN 3 -
REPEAT drop
Y ;

cr 0 ucbrt64 . .( 0? )
cr 1 ucbrt64 . .( 1? )
cr 2 ucbrt64 . .( 1? )
cr 2 ucbrt64 . .( 1? )
cr 7 ucbrt64 . .( 1? )
cr 8 ucbrt64 . .( 2? )
cr 9 ucbrt64 . .( 2? )
cr 999 ucbrt64 . .( 9? )
cr 1000 ucbrt64 . .( 10? )
cr 1001 ucbrt64 . .( 10? )

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kerr-Mudd, John@21:1/5 to Anton Ertl on Sun Dec 11 11:48:30 2022

XPost: at.lang.asm

On Sun, 11 Dec 2022 09:28:07 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

[log2 to give a first estimate for a sqrt program]

P Falth <peter.m.falth@gmail.com> writes:

My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
But it does not flag an error if executed, instead it executes bsr! and give a wrong
result. This is also stated in the Intel manual.

Interesting. Looking at the encoding, lzcount is encoded as repne
bsr, and the repne prefix normally is ignored except for the string instructions. Apparently nobody encoded bsr with a repne prefix,
otherwise they could not have encoded lzcnt in this way.

I just tried a leading zero count in asm, trying "repnz shr ax,1"; the assembler (nasm) was OK encoding it but on execution it dropped out
after one shift, even though the flags register shows 'nz'. This is on a
20 year old Pentium.
Intel's Instruction Reference (ASDM) says rep[] only applies to string ops,
so I can't really complain.
I rarely use rep lods[x] though!

xpost to ala, as it's more about asm.

--
Bah, and indeed Humbug.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kerr-Mudd, John@21:1/5 to John" on Sun Dec 11 11:55:27 2022

XPost: alt.lang.asm

On Sun, 11 Dec 2022 11:48:30 +0000
"Kerr-Mudd, John" <admin@127.0.0.1> wrote:

On Sun, 11 Dec 2022 09:28:07 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

[log2 to give a first estimate for a sqrt program]

P Falth <peter.m.falth@gmail.com> writes:

My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
But it does not flag an error if executed, instead it executes bsr! and give a wrong
result. This is also stated in the Intel manual.

Interesting. Looking at the encoding, lzcount is encoded as repne
bsr, and the repne prefix normally is ignored except for the string instructions. Apparently nobody encoded bsr with a repne prefix,
otherwise they could not have encoded lzcnt in this way.

I just tried a leading zero count in asm, trying "repnz shr ax,1"; the assembler (nasm) was OK encoding it but on execution it dropped out
after one shift, even though the flags register shows 'nz'. This is on a
20 year old Pentium.
Intel's Instruction Reference (ASDM) says rep[] only applies to string ops, so I can't really complain.
I rarely use rep lods[x] though!

xpost to ala, as it's more about asm.

/alt/.lang.asm
doh!

--
Bah, and indeed Humbug.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From dxforth@21:1/5 to Heinrich Hohl on Mon Dec 12 13:35:14 2022

On 11/12/2022 7:34 pm, Heinrich Hohl wrote:

On Sunday, December 11, 2022 at 12:56:52 AM UTC+1, dxforth wrote:

OTOH simplest implementations return 0:

: LOG2 ( u1 -- u2 ) \ Rick C.
0 begin swap u2/ dup while swap 1+ repeat drop ;
To return something else requires it be special-cased - which can be
left to the application.

You can rewrite this as:

: LOG2 ( u1 -- u2 ) -1 SWAP BEGIN DUP WHILE SWAP 1+ SWAP U2/ REPEAT DROP ;

Same number of operations, but u1 = 0 returns u2 = -1.

On VFX it's even smaller (12 instr vs. 17) - though this may be a compiler quirk. I doubt I could re-code the 8086 version to give a -1 result in the same number of instructions/bytes let alone make it smaller.

For u1 = 0, the output u2 = -1 is the best solution. We need a way to find out that the output
is illegal after the calculation has been done (i.e. the argument u1 is no longer available).
As Anton stated, -1 has certain advantages. It also has the advantage that we can use
0< to check whether the output is correct or illegal.

u2 = 0 should allow us to conclude that u1 = 1.

I would be inclined to test the input rather than the output. Testing
the output is a good indicator as any that it wasn't 'the best solution'.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Heinrich Hohl@21:1/5 to dxforth on Mon Dec 12 03:04:57 2022

On Monday, December 12, 2022 at 3:35:16 AM UTC+1, dxforth wrote:

I would be inclined to test the input rather than the output. Testing
the output is a good indicator as any that it wasn't 'the best solution'.

Absolutely correct. What I want to say is that for u1=0 the output should either be undefined, or it should be u2=-1. Not any other value.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kerr-Mudd, John@21:1/5 to Steve on Mon Dec 12 12:57:36 2022

XPost: alt.lang.asm

On Mon, 12 Dec 2022 12:43:34 -0000 (UTC)
Bogus@Embarq.com (Steve) wrote:

In <20221211115527.bf15c41eb56b7d462eedf5a0@127.0.0.1>, "Kerr-Mudd, John" <admin@127.0.0.1> writes:

On Sun, 11 Dec 2022 11:48:30 +0000
"Kerr-Mudd, John" <admin@127.0.0.1> wrote:

On Sun, 11 Dec 2022 09:28:07 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

[log2 to give a first estimate for a sqrt program]

P Falth <peter.m.falth@gmail.com> writes:

My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
But it does not flag an error if executed, instead it executes bsr! and give a wrong
result. This is also stated in the Intel manual.

Interesting. Looking at the encoding, lzcount is encoded as repne
bsr, and the repne prefix normally is ignored except for the string
instructions. Apparently nobody encoded bsr with a repne prefix,
otherwise they could not have encoded lzcnt in this way.

I just tried a leading zero count in asm, trying "repnz shr ax,1"; the
assembler (nasm) was OK encoding it but on execution it dropped out
after one shift, even though the flags register shows 'nz'. This is on a >> 20 year old Pentium.
Intel's Instruction Reference (ASDM) says rep[] only applies to string ops,
so I can't really complain.
I rarely use rep lods[x] though!

xpost to ala, as it's more about asm.

Hi,

It seems strange to use such an encoding?

Maybe so, but it would've saved a loop.

This was my poor attempt:

Log2b: ; ax<-ln2(ax) ; uses cx [14]
mov cx,16 ; bits per word
push cx
repnz shr ax,1
inc cx ; 1 too many
pop ax
sub ax,cx ;
; ln(0)=-1 fix
jnz .ln0nofix
dec ax
.ln0nofix:
ret

Ok, we can use 'bsr' to find the leading bit these days.

Log2c: ; ax<-ln2(ax) ; uses bx [7]

mov bx,-1
xchg ax,bx
bsr ax,bx ; cpu 386

for rep lods, perhaps size coders use
rep lodsb
rather than
mov bx,cx
mov al,[si+bx]

--
Bah, and indeed Humbug.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Steve@21:1/5 to John" on Mon Dec 12 12:43:34 2022

XPost: alt.lang.asm

In <20221211115527.bf15c41eb56b7d462eedf5a0@127.0.0.1>, "Kerr-Mudd, John" <admin@127.0.0.1> writes:

On Sun, 11 Dec 2022 11:48:30 +0000
"Kerr-Mudd, John" <admin@127.0.0.1> wrote:

On Sun, 11 Dec 2022 09:28:07 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

[log2 to give a first estimate for a sqrt program]

P Falth <peter.m.falth@gmail.com> writes:

My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
But it does not flag an error if executed, instead it executes bsr! and give a wrong
result. This is also stated in the Intel manual.

Interesting. Looking at the encoding, lzcount is encoded as repne
bsr, and the repne prefix normally is ignored except for the string
instructions. Apparently nobody encoded bsr with a repne prefix,
otherwise they could not have encoded lzcnt in this way.

I just tried a leading zero count in asm, trying "repnz shr ax,1"; the
assembler (nasm) was OK encoding it but on execution it dropped out
after one shift, even though the flags register shows 'nz'. This is on a
20 year old Pentium.
Intel's Instruction Reference (ASDM) says rep[] only applies to string ops, >> so I can't really complain.
I rarely use rep lods[x] though!

xpost to ala, as it's more about asm.

/alt/.lang.asm
doh!

--
Bah, and indeed Humbug.

Hi,

It seems strange to use such an encoding?

Cheers,

Steve N,

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From lehs@21:1/5 to All on Mon Dec 12 08:03:19 2022

söndag 11 december 2022 kl. 10:37:55 UTC+1 skrev lehs:

I can't prove it, but it seems that beginning with a value greater than the square root and terminate when Xn - Xn+1 < 2 always gives the correct floor function SQRTF.

: sqrtf \ u -- n
locals| u |
u sqrtc~
begin dup u over / + 2/
tuck - 2 <
until ;

No it don't work, but

: lgf \ u -- n
0 swap
begin 1 rshift dup
while 1 under+
repeat drop ;

: sqrtc~ \ u -- n 2^(1+½lg2 u)
lgf 2/ 1+ 1 swap lshift ;

: sqrtf \ u -- n almost as fast as float
1- locals| u |
u sqrtc~
begin dup u over u/ + u2/ tuck u> 0=
until ;

false [if]
: sqrtf \ u -- floor a few procent faster
0 d>f fsqrt f>d drop ;
[then]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From dxforth@21:1/5 to Heinrich Hohl on Tue Dec 13 14:52:43 2022

On 12/12/2022 10:04 pm, Heinrich Hohl wrote:

On Monday, December 12, 2022 at 3:35:16 AM UTC+1, dxforth wrote:

I would be inclined to test the input rather than the output. Testing
the output is a good indicator as any that it wasn't 'the best solution'.

Absolutely correct. What I want to say is that for u1=0 the output should either be undefined, or it should be u2=-1. Not any other value.

It should be u2=-1 only until such time as someone comes along and says
u2=0 is useful. I don't have an integer example but I do have an f/p
example:

\ return the decimal exponent of a number
100e flog . 2 ok
10e flog . 1 ok
1e flog . 0 ok
0e flog . 0 ok
0.1e flog . -1 ok
0.01e flog . -2 ok

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to dxforth on Mon Dec 12 23:16:52 2022

dxforth schrieb am Dienstag, 13. Dezember 2022 um 04:52:47 UTC+1:

On 12/12/2022 10:04 pm, Heinrich Hohl wrote:

On Monday, December 12, 2022 at 3:35:16 AM UTC+1, dxforth wrote:

I would be inclined to test the input rather than the output. Testing
the output is a good indicator as any that it wasn't 'the best solution'.

Absolutely correct. What I want to say is that for u1=0 the output should either be undefined, or it should be u2=-1. Not any other value.

It should be u2=-1 only until such time as someone comes along and says
u2=0 is useful. I don't have an integer example but I do have an f/p
example:

\ return the decimal exponent of a number
100e flog . 2 ok
10e flog . 1 ok
1e flog . 0 ok
0e flog . 0 ok
0.1e flog . -1 ok
0.01e flog . -2 ok

0e flog has to be a NAN.
If any, -INF as approximation, but never zero.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Heinrich Hohl@21:1/5 to All on Tue Dec 13 01:14:22 2022

On Tuesday, December 13, 2022 at 4:52:47 AM UTC+1, dxforth wrote:
Looks correct at first sight, but at closer inspection we find:

10^2 = 100
10^1 = 10
10^0 = 1
10^-1 = 0.1
10^-2 = 0.01

The result 0 is not part of this series. 0 does not have an exponent.
For x --> 0, the exponent grows towards - infinity.

However, in a simple FP package it may be convenient to represent
f=0.0 as significand = 0 and exponent = 0.

\ return the decimal exponent of a number
100e flog . 2 ok
10e flog . 1 ok
1e flog . 0 ok
0e flog . 0 ok
0.1e flog . -1 ok
0.01e flog . -2 ok

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to Heinrich Hohl on Tue Dec 13 03:40:19 2022

Heinrich Hohl schrieb am Dienstag, 13. Dezember 2022 um 10:14:23 UTC+1:

On Tuesday, December 13, 2022 at 4:52:47 AM UTC+1, dxforth wrote:
Looks correct at first sight, but at closer inspection we find:

10^2 = 100
10^1 = 10
10^0 = 1
10^-1 = 0.1
10^-2 = 0.01

The result 0 is not part of this series. 0 does not have an exponent.
For x --> 0, the exponent grows towards - infinity.

However, in a simple FP package it may be convenient to represent
f=0.0 as significand = 0 and exponent = 0.

\ return the decimal exponent of a number
100e flog . 2 ok
10e flog . 1 ok
1e flog . 0 ok
0e flog . 0 ok
0.1e flog . -1 ok
0.01e flog . -2 ok

Before starting with negative 0.0e shouldn't it suffice to say that any similarities between integer and fp logarithm stop for values below 1 ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From dxforth@21:1/5 to Heinrich Hohl on Wed Dec 14 12:14:48 2022

On 13/12/2022 8:14 pm, Heinrich Hohl wrote:

On Tuesday, December 13, 2022 at 4:52:47 AM UTC+1, dxforth wrote:
Looks correct at first sight, but at closer inspection we find:

10^2 = 100
10^1 = 10
10^0 = 1
10^-1 = 0.1
10^-2 = 0.01

The result 0 is not part of this series. 0 does not have an exponent.
For x --> 0, the exponent grows towards - infinity.

However, in a simple FP package it may be convenient to represent
f=0.0 as significand = 0 and exponent = 0.

Indeed in every f/p package, since:

0e fs. 0.E0 ok

To print the exponent it has to be extracted and FLOG is handy for that.
While FLOG isn't guaranteed to work on 0e, having it return 0 is better
than some other value.

In the case of LOG2 I don't see sufficient argument for it returning -1.
If least effort and least consequences from misuse be one's aim, then
returning 0 looks good to me.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From dxforth@21:1/5 to minf...@arcor.de on Wed Dec 14 12:22:33 2022

On 13/12/2022 6:16 pm, minf...@arcor.de wrote:

0e flog has to be a NAN.
If any, -INF as approximation, but never zero.

Only in the cult of IEEE. Mathematics simply calls it 'undefined'.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Tue May 7 21:37:57 2024
  from Wales, Uk via Telnet
- Keyop
  Tue May 7 20:20:13 2024
  from Huddersfield, West Yorkshire via SSH
- Bob Worm
  Tue May 7 15:23:21 2024
  from Wales, Uk via Telnet
- Bob Worm
  Tue May 7 14:12:05 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	300
Nodes:	16 (2 / 14)
Uptime:	50:31:27
Calls:	6,712
Calls today:	5
Files:	12,243
Messages:	5,354,927
Posted today:	1

Integer roots game

Who's Online

Recent Visitors

System Info