Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
On Tuesday, December 6, 2022 at 3:40:35 PM UTC+1, minf...@arcor.de wrote:
Inspired by another thread I came up with the iterative code shown below. The challenge: find a shorter AND faster version of USQRT!Done:
4 3
14 4
44 6
141 8
447 9
1414 11
4472 13
14142 14
: sqrt-rem ( n -- sqrt rem)
r 0 1 begin dup r@ > except 4 * repeatbegin \ find a power of 4 greater than TORS
dup 1 > \ compute while greater than unity
while
2/ 2/ swap over over + negate r@ + \ integer divide by 4
dup 0< if drop 2/ else rdrop >r 2/ over + then swap
repeat drop r> ( sqrt rem)
;
BTW - less iterations doesn't always mean "faster". I got another pretty fast one, but it always requires 31 iterations on a 64 bit architecture.
BTW what is that non-standard word: EXCEPT (guessing: 0= WHILE) ?You guessed correctly, Sir. I always hated "0= WHILE" and "0= IF", so I made myself a few alternatives.
Inspired by another thread I came up with the iterative code shown below.I'll give you that: it is FAST! However, this one is even faster (on my system):
The challenge: find a shorter AND faster version of USQRT!
On Tuesday, December 6, 2022 at 3:40:35 PM UTC+1, minf...@arcor.de wrote:
Inspired by another thread I came up with the iterative code shown below. The challenge: find a shorter AND faster version of USQRT!I'll give you that: it is FAST! However, this one is even faster (on my system):
: usqrt
r 0 r@ 2/ ( ct x0 R: u)begin
r@ over / over + 2/ swap over over < ( ct x1 x0 f)
while
drop swap 1+ swap
repeat rdrop nip nip
;
time pp4th -x usqrt.4th
Yours:
real 0m0,762s
user 0m0,761s
sys 0m0,000s
Mine:
real 0m0,682s
user 0m0,682s
sys 0m0,000s
...
Nice, although it has more loops and more operations in the main loop,
That one shaves another 8/100 of a second from my version. ;-)Nice, although it has more loops and more operations in the main loop,This one duplicates your logic.
"minf...@arcor.de" <minforth@arcor.de> writes:
Inspired by another thread I came up with the iterative code shown below. >>The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)
\ code:
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;
This is the specialization of the Newton-Raphson method for the square
root, see ><https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>
For the initial value I would use
1 bits/cell u clz - 2/ lshift
where clz ( x -- u ) is the count-leading-zeros function that many >architectures have built-in. Newton-Raphson profits a lot from a good >initial value.
- anton
Inspired by another thread I came up with the iterative code shown below.My take on this.
The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)
\ code:
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;
CR 20 usqrt
CR 200 usqrt
CR 2000 usqrt
CR 20000 usqrt
CR 200000 usqrt
CR 2000000 usqrt
CR 20000000 usqrt
CR 200000000 usqrt
\ results:
include usqrt.fth
root:4 iterations: 2
root:14 iterations: 4
root:44 iterations: 6
root:141 iterations: 8
root:447 iterations: 10
root:1414 iterations: 12
root:4472 iterations: 14
root:14142 iterations: 16 ok
Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)
\ code:
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;
Newton is only crazy fast if the start value is good.[..]
The following takes 10 iterations away:
\ For N return FLOOR of the square root of n.
: SQRT DUP >R 10 RSHIFT 1024 MAX \ Minimize iterations.
BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE
NIP REPEAT DROP RDROP ;
S[ ] OK -1 1 RSHIFT
S[ 9223372036854775807 ] OK SQRT
S[ 3037000499 ] OK
(that is MAX-INT , if you had not noticed.
"minf...@arcor.de" <minf...@arcor.de> writes:
Inspired by another thread I came up with the iterative code shown below. >The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)
\ code:This is the specialization of the Newton-Raphson method for the square
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;
root, see <https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>
On Tuesday, December 6, 2022 at 6:35:24 PM UTC+1, none albert wrote:
[..]
Newton is only crazy fast if the start value is good.
The following takes 10 iterations away:
\ For N return FLOOR of the square root of n.
: SQRT DUP >R 10 RSHIFT 1024 MAX \ Minimize iterations.
BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE
NIP REPEAT DROP RDROP ;
S[ ] OK -1 1 RSHIFT
S[ 9223372036854775807 ] OK SQRT
S[ 3037000499 ] OK
(that is MAX-INT , if you had not noticed.[..]
FORTH> : SQRT DUP >R 10 RSHIFT 1024 MAX BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE NIP REPEAT DROP -R ;
Redefining `SQRT` ok
FORTH> : test timer-reset 0 #1000000000 0 do -1 1 RSHIFT sqrt + loop . .elapsed ; ok
FORTH> test 3037000499000000000 53.520 seconds elapsed. ok
FORTH> forget sqrt : test2 timer-reset 0 #1000000000 0 do -1 1 RSHIFT sqrt + loop . .elapsed ; ok
FORTH> test2 3037000499000000000 0.435 seconds elapsed. ok
In article <2022Dec6.175959@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
"minf...@arcor.de" <minforth@arcor.de> writes:
Inspired by another thread I came up with the iterative code shown below. >>>The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)
\ code:
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;
This is the specialization of the Newton-Raphson method for the square >>root, see >><https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>
For the initial value I would use
1 bits/cell u clz - 2/ lshift
where clz ( x -- u ) is the count-leading-zeros function that many >>architectures have built-in. Newton-Raphson profits a lot from a good >>initial value.
Where `clz is available you should use it. In this context I consider it
as cheating.
If clz is not available in hardware, it takes up as much iterations
(or more) than you save on the Newton iteration.
On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:
That one shaves another 8/100 of a second from my version. ;-)Nice, although it has more loops and more operations in the main loop,This one duplicates your logic.
On 7/12/2022 4:24 am, Hans Bezemer wrote:
On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:Locals are supposed to reduce 'stack juggling' and simplify code but in this instance it was the juggling between variables that proved extraneous.
That one shaves another 8/100 of a second from my version. ;-)Nice, although it has more loops and more operations in the main loop,This one duplicates your logic.
dxforth schrieb am Mittwoch, 7. Dezember 2022 um 01:09:23 UTC+1:
On 7/12/2022 4:24 am, Hans Bezemer wrote:
On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:Locals are supposed to reduce 'stack juggling' and simplify code but in this >> instance it was the juggling between variables that proved extraneous.
That one shaves another 8/100 of a second from my version. ;-)Nice, although it has more loops and more operations in the main loop, >>>> This one duplicates your logic.
Lol!! Tell that Heron of Alexandria, it is his root-finding method.
On 7/12/2022 11:49 am, minf...@arcor.de wrote:
dxforth schrieb am Mittwoch, 7. Dezember 2022 um 01:09:23 UTC+1:
On 7/12/2022 4:24 am, Hans Bezemer wrote:
On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:Locals are supposed to reduce 'stack juggling' and simplify code but in this
That one shaves another 8/100 of a second from my version. ;-)Nice, although it has more loops and more operations in the main loop, >>>> This one duplicates your logic.
instance it was the juggling between variables that proved extraneous.
Lol!! Tell that Heron of Alexandria, it is his root-finding method.Maybe but it was your implementation. So did the non-locals version
prove to be "shorter AND faster"? That was your challenge - no?
Lol!! Tell that Heron of Alexandria, it is his root-finding method. His biggest drawback
is that he was born before Charles Moore .. about 2000 years ... :o)
dxforth schrieb am Mittwoch, 7. Dezember 2022 um 02:22:16 UTC+1:
On 7/12/2022 11:49 am, minf...@arcor.de wrote:
dxforth schrieb am Mittwoch, 7. Dezember 2022 um 01:09:23 UTC+1:Maybe but it was your implementation. So did the non-locals version
On 7/12/2022 4:24 am, Hans Bezemer wrote:
On Tuesday, December 6, 2022 at 5:53:29 PM UTC+1, dxforth wrote:Locals are supposed to reduce 'stack juggling' and simplify code but in this
That one shaves another 8/100 of a second from my version. ;-)Nice, although it has more loops and more operations in the main loop, >>>>>> This one duplicates your logic.
instance it was the juggling between variables that proved extraneous.
Lol!! Tell that Heron of Alexandria, it is his root-finding method.
prove to be "shorter AND faster"? That was your challenge - no?
Admitted that, I won't yawn at that old "shun locals" talk again. If the proposal
just duplicates Heron's logic, it is just equivalent IMO. If speed depends on the
used compiler, it is not better really -
lxf seems even capable to generate the same
code for both versions.
"minf...@arcor.de" <minf...@arcor.de> writes:
Lol!! Tell that Heron of Alexandria, it is his root-finding method. His biggest drawbackBut the question is how Heron described it. Modern descriptions of
is that he was born before Charles Moore .. about 2000 years ... :o)
course use medern notation and terminology, but the original notation
and terminology is often quite different.
Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)
\ code:
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;
CR 20 usqrt
CR 200 usqrt
CR 2000 usqrt
CR 20000 usqrt
CR 200000 usqrt
CR 2000000 usqrt
CR 20000000 usqrt
CR 200000000 usqrt
\ results:
include usqrt.fth
root:4 iterations: 2
root:14 iterations: 4
root:44 iterations: 6
root:141 iterations: 8
root:447 iterations: 10
root:1414 iterations: 12
root:4472 iterations: 14
root:14142 iterations: 16 ok
Here's a recursive solution I wrote a few months ago based on: https://en.wikipedia.org/wiki/Integer_square_root#Using_bitwise_operations
I've put stack comments in so it can be related to the original in
Wikipedia.
: usqrt ( u -- u1 )
dup 2 u< if exit then
dup >r 2 rshift recurse
2* ( -- u sm )
1+ dup ( -- u la la )
dup * ( -- u la la^2 )
;if 1- then ( -- u1 )
Worked OK for me, I wasn't concerned about speed so don't know how it compares in that respect. Incidentally this provides an example of using
the stack vs locals.
On Wednesday, December 7, 2022 at 10:20:46 AM UTC+1, minf...@arcor.de wrote:
Yesterday I found a different algorithm in the net that avoids slow divisions and shouldQuick hack:
therefore be even faster. The ARM code at the end of the document is a gem: http://www.azillionmonkeys.com/qed/ulerysqroot.pdf
cell-bits 2 / 1- constant (bshift)
: sqrt
r (bshift) 1 over lshift >r 0 ( bs nh R: v b )begin
tuck 1 lshift r@ + over lshift ( nh bs t R: v b)
over r@ > ( nh bs t b f R: v)if nip rot else swap negate r> + >r rot over + then ( bs b nh R: v)
rot 1- rot 1 rshift >r swap r@ 0= ( bs nh R: v b)
until rdrop rdrop nip
;
Works on both 32- and 64-bit. Isn't surprisingly fast. As a matter of fact: I've seen implementations of binary algorithms that were faster.
Yesterday I found a different algorithm in the net that avoids slow divisions and shouldQuick hack:
therefore be even faster. The ARM code at the end of the document is a gem: http://www.azillionmonkeys.com/qed/ulerysqroot.pdf
Worked OK for me, I wasn't concerned about speed so don't know how it compares in that respect. Incidentally this provides an example of usingThe algorithm that started this discussion did 1M iterations in about 0.76s. Yours is 0.2s faster on my combo - even with all the function call overhead.
the stack vs locals.
: USQRT32 {: u | T H B:$8000 S:15 == H :}Oops - 32-bit only! It's actually two loops - one to figure out the first candidate for a binary square
On Wednesday, December 7, 2022 at 12:36:24 PM UTC+1, Gerry Jackson wrote:
Worked OK for me, I wasn't concerned about speed so don't know how itThe algorithm that started this discussion did 1M iterations in about 0.76s. Yours is 0.2s faster on my combo - even with all the function call overhead.
compares in that respect. Incidentally this provides an example of using
the stack vs locals.
HB
On 07/12/2022 13:44, Hans Bezemer wrote:
On Wednesday, December 7, 2022 at 12:36:24 PM UTC+1, Gerry Jackson wrote:
Worked OK for me, I wasn't concerned about speed so don't know how itThe algorithm that started this discussion did 1M iterations in about 0.76s.
compares in that respect. Incidentally this provides an example of using >> the stack vs locals.
Yours is 0.2s faster on my combo - even with all the function call overhead.
HBI've just realised it can be made slightly quicker using the well formed
flag trick in the last line to eliminate the IF ... THEN
: usqrt ( u -- u1 )
dup 2 u< if exit then
dup >r 2 rshift recurse
2* ( -- u sm )
1+ dup ( -- u la la )
dup * ( -- u la la^2 )
\ r> u> if 1- then ( -- u1 )
;+ ( -- u1 )
I believe 4th uses 1 for TRUE so the last line for 4th would be:
-
I'm a bit dubious about whether a marginal saving like that is worthwhile.
Gerry Jackson schrieb am Mittwoch, 7. Dezember 2022 um 16:30:20 UTC+1:
On 07/12/2022 13:44, Hans Bezemer wrote:
On Wednesday, December 7, 2022 at 12:36:24 PM UTC+1, Gerry Jackson wrote: >>>> Worked OK for me, I wasn't concerned about speed so don't know how itI've just realised it can be made slightly quicker using the well formed
compares in that respect. Incidentally this provides an example of using >>>> the stack vs locals.The algorithm that started this discussion did 1M iterations in about 0.76s.
Yours is 0.2s faster on my combo - even with all the function call overhead.
HB
flag trick in the last line to eliminate the IF ... THEN
: usqrt ( u -- u1 )
dup 2 u< if exit then
dup >r 2 rshift recurse
2* ( -- u sm )
1+ dup ( -- u la la )
dup * ( -- u la la^2 )
\ r> u> if 1- then ( -- u1 )
;+ ( -- u1 )
I believe 4th uses 1 for TRUE so the last line for 4th would be:
-
I'm a bit dubious about whether a marginal saving like that is worthwhile.
\ with microscopic change & tested
: USQRT
dup 2 u< IF EXIT THEN
dup >r 2 rshift recurse
2* dup 1+ dup * r> u< - ;
Locals are supposed to reduce 'stack juggling' and simplify code but in this >instance it was the juggling between variables that proved extraneous.
...
I've just realised it can be made slightly quicker using the well formed flag trick in the last line to eliminate the IF ... THEN
...
I'm a bit dubious about whether a marginal saving like that is worthwhile.
dxforth <dxforth@gmail.com> writes:
Locals are supposed to reduce 'stack juggling' and simplify code but in this >> instance it was the juggling between variables that proved extraneous.
I have collected some of the versions posted here into
http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th
I then created some versions from it that have the stack effect ( +n1
-- n2 ) and that do not count the iterations:
...
\ derived from dxforths locals-less version
: USQRT4 ( +n -- +root )
dup 2/ >r
BEGIN ( +n r:x0 )
dup r@ / r@ + 2/ ( +n x1 R:x0 )
dup r@ <
WHILE
r> drop >r
REPEAT
2drop r> ;
On 8/12/2022 9:06 am, Anton Ertl wrote:
dxforth <dxf...@gmail.com> writes:
Locals are supposed to reduce 'stack juggling' and simplify code but in this
instance it was the juggling between variables that proved extraneous.
I have collected some of the versions posted here into
http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th
I then created some versions from it that have the stack effect ( +n1
-- n2 ) and that do not count the iterations:
...
\ derived from dxforths locals-less versionWith no count to maintain, it can be simplified to:
: USQRT4 ( +n -- +root )
dup 2/ >r
BEGIN ( +n r:x0 )
dup r@ / r@ + 2/ ( +n x1 R:x0 )
dup r@ <
WHILE
drop >rREPEAT
2drop r> ;
: USQRT4 ( +n -- +root )
dup 2/ dup
BEGIN drop
2dup / over + 2/
2dup >
WHILE
swap
REPEAT
rot 2drop ;
Here's a LOG2 definition (originally by Wil Baden
<wilbadenE...@netcom.com>, with his peculiarities removed, extended to 64 bits, and slightly more efficient):
: log2 ( n -- log2 )
-1 swap ( log2 n)
dup $ffffffff00000000 and if swap 32 + swap 32 rshift then
dup $00000000ffff0000 and if swap 16 + swap 16 rshift then
dup $000000000000ff00 and if swap 8 + swap 8 rshift then
dup $00000000000000f0 and if swap 4 + swap 4 rshift then
dup $000000000000000c and if swap 2 + swap 2 rshift then
dup $0000000000000002 and if swap 1 + swap 1 rshift then
+ ;
did Wil Baden explain in detail how the algorithm works?
The routine works properly and the code is clear, but the algorithm
behind it is not easy to understand. I would be grateful if you have
any reference with more information on this Log2 algorithm.
On 8/12/2022 2:30 am, Gerry Jackson wrote:
...Especially when an optimizing compiler is compelled to generate the 'well-formed'
I've just realised it can be made slightly quicker using the well formed flag trick in the last line to eliminate the IF ... THEN
...
I'm a bit dubious about whether a marginal saving like that is worthwhile.
flag. I don't think it was ever a given that 'calculate' was better than 'test
and branch'. Each situation according to its merits.
It's quite simple and elegant when you think of it.
Hans Bezemer
On Tuesday, December 6, 2022 at 11:14:13 PM UTC+1, Anton Ertl wrote:
Here's a LOG2 definition (originally by Wil Baden
<wilbadenE...@netcom.com>, with his peculiarities removed, extended to 64 bits, and slightly more efficient):
: log2 ( n -- log2 )
-1 swap ( log2 n)
dup $ffffffff00000000 and if swap 32 + swap 32 rshift then
dup $00000000ffff0000 and if swap 16 + swap 16 rshift then
dup $000000000000ff00 and if swap 8 + swap 8 rshift then
dup $00000000000000f0 and if swap 4 + swap 4 rshift then
dup $000000000000000c and if swap 2 + swap 2 rshift then
dup $0000000000000002 and if swap 1 + swap 1 rshift then
+ ;
Hello Anton,
did Wil Baden explain in detail how the algorithm works?
The routine works properly and the code is clear, but the algorithm
behind it is not easy to understand. I would be grateful if you have
any reference with more information on this Log2 algorithm.
With no count to maintain, it can be simplified to:
: USQRT4 ( +n -- +root )
dup 2/ dup
BEGIN drop
2dup / over + 2/
2dup >
WHILE
swap
REPEAT
rot 2drop ;
dxforth <dxf...@gmail.com> writes:
With no count to maintain, it can be simplified to:
: USQRT4 ( +n -- +root )The version above is called USQRTB (or B) below. It inspired me to:
dup 2/ dup
BEGIN drop
2dup / over + 2/
2dup >
WHILE
swap
REPEAT
rot 2drop ;
: usqrtc ( u -- root )
dup 2/ dup
begin ( u x0 x1 )
nip ( u x1 )
2dup / over + 2/ ( u x1 x2 )
2dup <= until
drop nip ;
The idea here is to never move a value on the critical path to memory
in VFX (i.e., the critical path value should be on the TOS at the
basic block boundary).
Here are the numbers, I'll show the code in a later posting:
for i in 1 4 9 a b c; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null
for i in 1 4 9 a b c; do perf stat -x" " -e cycles:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'
1 4 9 a b c
187.7 257.1 191.3 179.7 177.7 141.0 instructions per invocation
166.3 141.2 139.0 161.4 160.6 136.3 cycles per invocation on Rocket Lake 169.6 105.1 157.5 162.8 168.5 95.7 cycles per invocation on Zen3
The cycles for USQRT4 vary a lot on the Rocket Lake; they are often
174 cycles, but here I show a good result.
Anyway, the idea behind USQRTC seems to work well: It wins in the
number of instructions, and the number of cycles on both CPUs.
On Thursday, 8 December 2022 at 01:10:06 UTC+1, dxforth wrote:
On 8/12/2022 2:30 am, Gerry Jackson wrote:This is exactly what happens with my codegenerator. The "optimized" version is
...Especially when an optimizing compiler is compelled to generate the 'well-formed'
I've just realised it can be made slightly quicker using the well formed flag trick in the last line to eliminate the IF ... THEN
...
I'm a bit dubious about whether a marginal saving like that is worthwhile.
flag. I don't think it was ever a given that 'calculate' was better than 'test
and branch'. Each situation according to its merits.
5% larger (but just 3 bytes) and more surprisingly have 30% longer runtime! .
Gerry's original recursive version is anyway 5 times faster then your version!
Could be the division is slow on my processor, an E5-4657L v2
But its name is wrong. It is not unsigned. It requires a signed positive number
as seen also by the original stack comment .
dxforth <dxforth@gmail.com> writes:
With no count to maintain, it can be simplified to:
: USQRT4 ( +n -- +root )
dup 2/ dup
BEGIN drop
2dup / over + 2/
2dup >
WHILE
swap
REPEAT
rot 2drop ;
The version above is called USQRTB (or B) below. It inspired me to:
: usqrtc ( u -- root )
dup 2/ dup
begin ( u x0 x1 )
nip ( u x1 )
2dup / over + 2/ ( u x1 x2 )
2dup <= until
drop nip ;
The idea here is to never move a value on the critical path to memory
in VFX (i.e., the critical path value should be on the TOS at the
basic block boundary).
Here are the numbers, I'll show the code in a later posting:
for i in 1 4 9 a b c; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null
for i in 1 4 9 a b c; do perf stat -x" " -e cycles:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'
1 4 9 a b c
187.7 257.1 191.3 179.7 177.7 141.0 instructions per invocation
166.3 141.2 139.0 161.4 160.6 136.3 cycles per invocation on Rocket Lake >169.6 105.1 157.5 162.8 168.5 95.7 cycles per invocation on Zen3
The cycles for USQRT4 vary a lot on the Rocket Lake; they are often
174 cycles, but here I show a good result.
Anyway, the idea behind USQRTC seems to work well: It wins in the
number of instructions, and the number of cycles on both CPUs.
On Thursday, December 8, 2022 at 7:52:02 PM UTC+1, Anton Ertl wrote:
[..]
FORTH> in ( #100,000,000 iterations )
303700049900000000 7.753 seconds elapsed. minforth
303700049900000000 7.039 seconds elapsed. dxforth #1
303700049900000000 4.424 seconds elapsed. Albert
303700049900000000 0.681 seconds elapsed. Anton #1
303700049900000000 6.229 seconds elapsed. Gerry
303700049900000000 6.540 seconds elapsed. Anton #2
303700049900000000 6.208 seconds elapsed. Anton #3
303700049900000000 6.554 seconds elapsed. dxforth #2
303700049900000000 6.497 seconds elapsed. Anton #3 ok
303700049900000000 0.036 seconds elapsed. iForth ok
FORTH> .TICKER-INFO
AMD Ryzen 7 5800X 8-Core Processor, TICKS-GET uses iTSC at 4192MHz
Locals cost something, but a smart algorithm obviously wins.
dxforth <dxforth@gmail.com> writes:
With no count to maintain, it can be simplified to:
: USQRT4 ( +n -- +root )
dup 2/ dup
BEGIN drop
2dup / over + 2/
2dup >
WHILE
swap
REPEAT
rot 2drop ;
The version above is called USQRTB (or B) below. It inspired me to:
: usqrtc ( u -- root )
dup 2/ dup
begin ( u x0 x1 )
nip ( u x1 )
2dup / over + 2/ ( u x1 x2 )
2dup <= until
drop nip ;
dxforth <dxforth@gmail.com> writes:
Locals are supposed to reduce 'stack juggling' and simplify code but in this >>instance it was the juggling between variables that proved extraneous.
I have collected some of the versions posted here into
http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th
Concerning usqrtd, I could not use exactly, what I outlined earlier, but
had to adapt it slightly to guarantee an initial value that's larger
than the root (required by the terminating condition of the loop):
iForth has the log2 method implemented as a table (it is very
frequently used in my type of applications).
"minf...@arcor.de" <minf...@arcor.de> writes:
Inspired by another thread I came up with the iterative code shown below. >The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)
\ code:This is the specialization of the Newton-Raphson method for the square
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;
root, see <https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>
For the initial value I would use
1 bits/cell u clz - 2/ lshift
where clz ( x -- u ) is the count-leading-zeros function that many architectures have built-in. Newton-Raphson profits a lot from a good
initial value.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net
Marcel Hendrix <m...@iae.nl> writes:
iForth has the log2 method implemented as a table (it is very
frequently used in my type of applications).
Note that Intel and AMD have bsr (since the 386) and lzcnt (if the
BMI1 or ABM extensions are present) that allow a fast implementation
of log2.
On Friday, December 9, 2022 at 7:48:45 PM UTC+1, Anton Ertl wrote:
Marcel Hendrix <m...@iae.nl> writes:
iForth has the log2 method implemented as a table (it is very
frequently used in my type of applications).
Sorry, I mixed log2 up with 2^x ( u -- 2^u ) ...
Note that Intel and AMD have bsr (since the 386) and lzcnt (if the
BMI1 or ABM extensions are present) that allow a fast implementation
of log2.
FORTH> : test 7 log2 . ; ok
FORTH> see test
Flags: ANSI
$0133DC40 : test
$0133DC4A mov rbx, 7 d#
$0133DC51 mov rax, $FFFFFFFF:FFFFFFFF q#
$0133DC5B bsr rax, rbx
$0133DC5F mov rbx, rax
$0133DC62 push rbx
$0133DC63 jmp .+10 ( $0124A102 ) offset NEAR
$0133DC68 ;
bsr's result is not defined if the source=0, so I would write
something like
bsr dest, source
bne nonzero
mov dest, -1
nonzero:
# do something with dest
On Friday, December 9, 2022 at 11:27:13 PM UTC+1, Anton Ertl wrote:
[..]
bsr's result is not defined if the source=0, so I would write
something like
bsr dest, source
bne nonzero
mov dest, -1
nonzero:
# do something with dest
The log2 of 0 is undefined anyway, so I prefer to let the user
of log2 decide how it should be handled...
On Friday, December 9, 2022 at 11:27:13 PM UTC+1, Anton Ertl wrote:
[..]
bsr's result is not defined if the source=0, so I would write
something like
bsr dest, source
bne nonzero
mov dest, -1
nonzero:
# do something with dest
The log2 of 0 is undefined anyway, so I prefer to let the user
of log2 decide how it should be handled...
Or do you mean that bsr with 0 does not work the same on all >architectures/chip generations?
FORTH> : test 0 log2 . ; ok
FORTH> see test ( on AMD )
Flags: ANSI
$0133DC40 : test
$0133DC4A xor rbx, rbx
$0133DC4D mov rax, $FFFFFFFF:FFFFFFFF q#
$0133DC57 bsr rax, rbx
$0133DC5B mov rbx, rax
$0133DC5E push rbx
$0133DC5F jmp .+10 ( $0124A102 ) offset NEAR
FORTH> test -1 ok
Marcel Hendrix <m...@iae.nl> writes:
On Friday, December 9, 2022 at 11:27:13 PM UTC+1, Anton Ertl wrote:
[..]
bsr's result is not defined if the source=0, so I would write
something like
bsr dest, source
bne nonzero
mov dest, -1
nonzero:
# do something with dest
The log2 of 0 is undefined anyway, so I prefer to let the userWhat happens in programming is that some user happens to write a
of log2 decide how it should be handled...
program that just uses the result as-is, and happens to work as
intended even for input 0. And if you want to provide a high-quality experience, your programming system should better be consistent in
what it returns in that case.
Gforth returns -1 for 0 LOG2.
One benefit of returning -1 is that you can get a count of leading
zeros with
( x ) bits/cell 1- swap log2 -
Or do you mean that bsr with 0 does not work the same on all >architectures/chip generations?Intel certainly reserves the right to do it that way by saying that
the result is undefined in that case. However, for the same reason as discussed above, they (and AMD) better stick with the behaviour of
their original implementation, or their shiny new processor might get
a reputation for incompatibility (yes, Intel may blame the
application, but the customer does not care who is to blame).
So they might just as well document the actual behaviour.
FORTH> : test 0 log2 . ; okIt seems that on your hardware (Zen3 IIRC) bsr leaves the destination register unchanged when the source is 0, and iForth is one application
FORTH> see test ( on AMD )
Flags: ANSI
$0133DC40 : test
$0133DC4A xor rbx, rbx
$0133DC4D mov rax, $FFFFFFFF:FFFFFFFF q#
$0133DC57 bsr rax, rbx
$0133DC5B mov rbx, rax
$0133DC5E push rbx
$0133DC5F jmp .+10 ( $0124A102 ) offset NEAR
FORTH> test -1 ok
that relies on this behaviour in order to return -1 (otherwise the instruction at $0133DC4D would be superfluous); I just tried in on a
Skylake, and unsurprisingly it behaves the same way. If you don't
want to rely on Intel and AMD sticking to the same behaviour, you can
change the central instructions as shown above or as
mov rax, -1
bsr rbx,rbx
cmove rbx, rax
Interstingly, AMD has added an instruction lzcnt with the ABM
extension since the K10 (2007), and Intel has picked it up in the BMI1 extension since the Haswell (2013). lzcnt delivers a complementary
result (e.g., all-bits-set produces 0, and 0 produces 64), but in
particular it is also defined for 0. It is faster than bsr on the AMD
CPUs listed on https://gmplib.org/~tege/x86-timing.pdf (e.g., 1 cycle
latency on Zen2 and 4 lzcnt/cycle throughput, vs. 4 cycle latency for
bsr and 1 bsr every 4 cycles throughput. On the Intel CPUs listed
they both have 3 cycles latency, and 1 instruction/cycle throughput.
I wonder what makes bsr so hard to implement on the AMD CPUs that they
added lzcnt.
Of course for log2 you would need something like:
lzcnt rax, rbx
mov rbx, 63
sub rbx, rax
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net
Gforth returns -1 for 0 LOG2.
One benefit of returning -1 is that you can get a count of leading
zeros with
( x ) bits/cell 1- swap log2 -
OTOH simplest implementations return 0:
: LOG2 ( u1 -- u2 ) \ Rick C.
0 begin swap u2/ dup while swap 1+ repeat drop ;
To return something else requires it be special-cased - which can be
left to the application.
My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
But it does not flag an error if executed, instead it executes bsr! and give a wrong
result. This is also stated in the Intel manual.
P Falth <peter.m.falth@gmail.com> writes:
My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
But it does not flag an error if executed, instead it executes bsr! and give a wrong
result. This is also stated in the Intel manual.
Interesting. Looking at the encoding, lzcount is encoded as repne
bsr, and the repne prefix normally is ignored except for the string instructions. Apparently nobody encoded bsr with a repne prefix,
otherwise they could not have encoded lzcnt in this way.
On Sun, 11 Dec 2022 09:28:07 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
[log2 to give a first estimate for a sqrt program]
P Falth <peter.m.falth@gmail.com> writes:
My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
But it does not flag an error if executed, instead it executes bsr! and give a wrong
result. This is also stated in the Intel manual.
Interesting. Looking at the encoding, lzcount is encoded as repne
bsr, and the repne prefix normally is ignored except for the string instructions. Apparently nobody encoded bsr with a repne prefix,
otherwise they could not have encoded lzcnt in this way.
I just tried a leading zero count in asm, trying "repnz shr ax,1"; the assembler (nasm) was OK encoding it but on execution it dropped out
after one shift, even though the flags register shows 'nz'. This is on a
20 year old Pentium.
Intel's Instruction Reference (ASDM) says rep[] only applies to string ops, so I can't really complain.
I rarely use rep lods[x] though!
xpost to ala, as it's more about asm.
On Sunday, December 11, 2022 at 12:56:52 AM UTC+1, dxforth wrote:
OTOH simplest implementations return 0:
: LOG2 ( u1 -- u2 ) \ Rick C.
0 begin swap u2/ dup while swap 1+ repeat drop ;
To return something else requires it be special-cased - which can be
left to the application.
You can rewrite this as:
: LOG2 ( u1 -- u2 ) -1 SWAP BEGIN DUP WHILE SWAP 1+ SWAP U2/ REPEAT DROP ;
Same number of operations, but u1 = 0 returns u2 = -1.
For u1 = 0, the output u2 = -1 is the best solution. We need a way to find out that the output
is illegal after the calculation has been done (i.e. the argument u1 is no longer available).
As Anton stated, -1 has certain advantages. It also has the advantage that we can use
0< to check whether the output is correct or illegal.
u2 = 0 should allow us to conclude that u1 = 1.
I would be inclined to test the input rather than the output. Testing
the output is a good indicator as any that it wasn't 'the best solution'.
In <20221211115527.bf15c41eb56b7d462eedf5a0@127.0.0.1>, "Kerr-Mudd, John" <admin@127.0.0.1> writes:
On Sun, 11 Dec 2022 11:48:30 +0000
"Kerr-Mudd, John" <admin@127.0.0.1> wrote:
On Sun, 11 Dec 2022 09:28:07 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
[log2 to give a first estimate for a sqrt program]
P Falth <peter.m.falth@gmail.com> writes:
My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
But it does not flag an error if executed, instead it executes bsr! and give a wrong
result. This is also stated in the Intel manual.
Interesting. Looking at the encoding, lzcount is encoded as repne
bsr, and the repne prefix normally is ignored except for the string
instructions. Apparently nobody encoded bsr with a repne prefix,
otherwise they could not have encoded lzcnt in this way.
I just tried a leading zero count in asm, trying "repnz shr ax,1"; the
assembler (nasm) was OK encoding it but on execution it dropped out
after one shift, even though the flags register shows 'nz'. This is on a >> 20 year old Pentium.
Intel's Instruction Reference (ASDM) says rep[] only applies to string ops,
so I can't really complain.
I rarely use rep lods[x] though!
xpost to ala, as it's more about asm.
Hi,
It seems strange to use such an encoding?
On Sun, 11 Dec 2022 11:48:30 +0000
"Kerr-Mudd, John" <admin@127.0.0.1> wrote:
On Sun, 11 Dec 2022 09:28:07 GMT/alt/.lang.asm
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
[log2 to give a first estimate for a sqrt program]
P Falth <peter.m.falth@gmail.com> writes:
My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
But it does not flag an error if executed, instead it executes bsr! and give a wrong
result. This is also stated in the Intel manual.
Interesting. Looking at the encoding, lzcount is encoded as repne
bsr, and the repne prefix normally is ignored except for the string
instructions. Apparently nobody encoded bsr with a repne prefix,
otherwise they could not have encoded lzcnt in this way.
I just tried a leading zero count in asm, trying "repnz shr ax,1"; the
assembler (nasm) was OK encoding it but on execution it dropped out
after one shift, even though the flags register shows 'nz'. This is on a
20 year old Pentium.
Intel's Instruction Reference (ASDM) says rep[] only applies to string ops, >> so I can't really complain.
I rarely use rep lods[x] though!
xpost to ala, as it's more about asm.
doh!
--
Bah, and indeed Humbug.
I can't prove it, but it seems that beginning with a value greater than the square root and terminate when Xn - Xn+1 < 2 always gives the correct floor function SQRTF.
: sqrtf \ u -- n
locals| u |
u sqrtc~
begin dup u over / + 2/
tuck - 2 <
until ;
On Monday, December 12, 2022 at 3:35:16 AM UTC+1, dxforth wrote:
I would be inclined to test the input rather than the output. Testing
the output is a good indicator as any that it wasn't 'the best solution'.
Absolutely correct. What I want to say is that for u1=0 the output should either be undefined, or it should be u2=-1. Not any other value.
On 12/12/2022 10:04 pm, Heinrich Hohl wrote:
On Monday, December 12, 2022 at 3:35:16 AM UTC+1, dxforth wrote:
I would be inclined to test the input rather than the output. Testing
the output is a good indicator as any that it wasn't 'the best solution'.
Absolutely correct. What I want to say is that for u1=0 the output should either be undefined, or it should be u2=-1. Not any other value.It should be u2=-1 only until such time as someone comes along and says
u2=0 is useful. I don't have an integer example but I do have an f/p
example:
\ return the decimal exponent of a number
100e flog . 2 ok
10e flog . 1 ok
1e flog . 0 ok
0e flog . 0 ok
0.1e flog . -1 ok
0.01e flog . -2 ok
\ return the decimal exponent of a number
100e flog . 2 ok
10e flog . 1 ok
1e flog . 0 ok
0e flog . 0 ok
0.1e flog . -1 ok
0.01e flog . -2 ok
On Tuesday, December 13, 2022 at 4:52:47 AM UTC+1, dxforth wrote:
Looks correct at first sight, but at closer inspection we find:
10^2 = 100
10^1 = 10
10^0 = 1
10^-1 = 0.1
10^-2 = 0.01
The result 0 is not part of this series. 0 does not have an exponent.
For x --> 0, the exponent grows towards - infinity.
However, in a simple FP package it may be convenient to represent
f=0.0 as significand = 0 and exponent = 0.
\ return the decimal exponent of a number
100e flog . 2 ok
10e flog . 1 ok
1e flog . 0 ok
0e flog . 0 ok
0.1e flog . -1 ok
0.01e flog . -2 ok
On Tuesday, December 13, 2022 at 4:52:47 AM UTC+1, dxforth wrote:
Looks correct at first sight, but at closer inspection we find:
10^2 = 100
10^1 = 10
10^0 = 1
10^-1 = 0.1
10^-2 = 0.01
The result 0 is not part of this series. 0 does not have an exponent.
For x --> 0, the exponent grows towards - infinity.
However, in a simple FP package it may be convenient to represent
f=0.0 as significand = 0 and exponent = 0.
0e flog has to be a NAN.
If any, -INF as approximation, but never zero.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 300 |
Nodes: | 16 (2 / 14) |
Uptime: | 50:31:27 |
Calls: | 6,712 |
Calls today: | 5 |
Files: | 12,243 |
Messages: | 5,354,927 |
Posted today: | 1 |