Forum: >>> Magnum BBS <<<

Eliminating FDUP F*

From Krishna Myneni@21:1/5 to All on Wed Nov 29 07:29:04 2023

Use of the sequence "FDUP F*" is ubiquitous in Forth scientific code for
lack of a common word which squares an fp number. This not only is less readable but does not convey as much meaning to anyone who is reading
the code.

I've updated the FSL modules in kForth (32, Win32, and 64) to remove use
all instances of "FDUP F*" with the (built-in) word FSQUARE. Some FSL
modules provided definitions of FSQR for the same function (by MHX) and
I replaced these instances with FSQUARE which I find more readable and
less error-prone due to the proximity of FSQR to FSQRT.

--
Krishna Myneni

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to All on Wed Nov 29 07:30:32 2023

Use of the sequence "FDUP F*" is ubiquitous in Forth scientific code for
lack of a common word which squares an fp number. This not only is less readable but does not convey as much meaning to anyone who is reading
the code.

I've updated the FSL modules in kForth (32, Win32, and 64) to remove use
all instances of "FDUP F*" with the (built-in) word FSQUARE. Some FSL
modules provided definitions of FSQR for the same function (by MHX) and
I replaced these instances with FSQUARE which I find more readable and
less error-prone due to the proximity of FSQR to FSQRT.

--
Krishna Myneni

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to All on Wed Nov 29 14:29:09 2023

Thanks.

In my apps I added for convenience
FINV alias 1/F
F2* F2/
FHYPOT sqrt(a^2+b^2)
FMA horner a*b+c

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From mhx@21:1/5 to minforth on Wed Nov 29 18:05:25 2023

minforth wrote:

FHYPOT sqrt(a^2+b^2)

This is a nice one (that iForth does
not have) because FHYPOT is not only
more efficient but also documents a
tricky numerical problem.

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to mhx on Wed Nov 29 20:45:46 2023

mhx wrote:

FHYPOT sqrt(a^2+b^2)

This is a nice one (that iForth does
not have) because FHYPOT is not only
more efficient but also documents a
tricky numerical problem.

Perhaps as in here?
https://arxiv.org/pdf/1904.09481.pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to minforth on Wed Nov 29 19:02:45 2023

On 11/29/23 08:29, minforth wrote:

Thanks.

In my apps I added for convenience
FINV alias 1/F
F2* F2/
FHYPOT sqrt(a^2+b^2)
FMA horner a*b+c

FINV is also a commonly needed word, instead of writing

"1.0E0 FSWAP F/".

The other most useful word for vector/matrix code is F+!, which also
improves the efficiency, readability, and compactness of code. Use of
F+! can be found in the FSL modules.

F+! has common usage and is easily comprehensible so it may be time to
enter it formally into the Forth floating point lexicon.

--
KM

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to Krishna Myneni on Thu Nov 30 08:22:58 2023

Krishna Myneni wrote:

On 11/29/23 08:29, minforth wrote:

Thanks.

In my apps I added for convenience
FINV alias 1/F
F2* F2/
FHYPOT sqrt(a^2+b^2)
FMA horner a*b+c

FINV is also a commonly needed word, instead of writing

"1.0E0 FSWAP F/".

The other most useful word for vector/matrix code is F+!, which also
improves the efficiency, readability, and compactness of code. Use of
F+! can be found in the FSL modules.

F+! has common usage and is easily comprehensible so it may be time to
enter it formally into the Forth floating point lexicon.

May I add F*! for scalar operations on vector/matrix elements

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From none) (albert@21:1/5 to mhx on Thu Nov 30 10:51:59 2023

In article <104d31d0da99003ed0dc323134d1243c@news.novabbs.com>,
mhx <mhx@iae.nl> wrote:

minforth wrote:

FHYPOT sqrt(a^2+b^2)

This is a nice one (that iForth does
not have) because FHYPOT is not only
more efficient but also documents a
tricky numerical problem.

The hyp calculation is stable as hell, I can't think
of any numerical problem.
It is also useful. I added a `` HYPOs '' as a separate
screen to my fixed point screen.
(Using DSQRT that is not particularly difficult to
implement.)

-marcel

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat spinning. - the Wise from Antrim -

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to minforth on Thu Nov 30 15:26:32 2023

On 11/30/23 02:22, minforth wrote:

Krishna Myneni wrote:

...

The other most useful word for vector/matrix code is F+!, which also
improves the efficiency, readability, and compactness of code. Use of
F+! can be found in the FSL modules.

F+! has common usage and is easily comprehensible so it may be time to
enter it formally into the Forth floating point lexicon.

May I add F*! for scalar operations on vector/matrix elements

It should make the code for loops which scale arrays more compact, but typically, it is more rare to loop over a sequence of scalars which
multiply a single array element (value at a fixed address) than it is to
loop over a sequence of scalars which accumulate into a single array
element e.g. matrix multiplication.

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to Krishna Myneni on Fri Dec 1 12:32:03 2023

Krishna Myneni wrote:

It should make the code for loops which scale arrays more compact, but typically, it is more rare to loop over a sequence of scalars which
multiply a single array element (value at a fixed address) than it is to
loop over a sequence of scalars which accumulate into a single array
element e.g. matrix multiplication.

Matrix multiplication (if not available as a primitive or from an external library) is an example. In other numerical matrix algorithms, pivoting is
is rather common, which involves scalar column or row multiplication.
Most occurrences in my code involve shifting and scaling of vectors.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to minforth on Sat Dec 2 07:06:51 2023

minforth@gmx.net (minforth) writes:

Krishna Myneni wrote:

It should make the code for loops which scale arrays more compact, but
typically, it is more rare to loop over a sequence of scalars which
multiply a single array element (value at a fixed address) than it is to
loop over a sequence of scalars which accumulate into a single array
element e.g. matrix multiplication.

Matrix multiplication (if not available as a primitive or from an external >library) is an example.

Not in my experience. Matrix multiplication always multiplies one
element of one matrix with one element of the other matrix. Since you
still need both matrices, you do not want to use F*! for that. Matrix multiplication adds a number of the products of these multiplications;
e.g., for a 1000x1000 matrix multiply, it sums up 1000 products
resulting in one element of the target matrix. F+! can be used for
that.

But for these kinds of things, it's better to use specialized code,
such as OpenBLAS. E.g., if you look at slides 80 and 87 of https://www.complang.tuwien.ac.at/anton/lvas/efficient.pdf, you see
that OpenBLAS is >13 times as fast for 1000x1000 matrix multiplication
(on a Tiger Lake CPU) than a straightforward scalar implementation of
matrix multiplication that uses the best loop nesting. Compared to
the naive variant that uses a dot product, the speedup exceeds a
factor of 25 (slide 78). Even when the auto-vectorization of gcc
kicks in (with -O3), the result is still >5 times slower than
OpenBLAS.

"THP" on these slides means that transparent huge pages are enabled
and kick in (there is no guarantee that they kick in if they are
enabled).

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to Anton Ertl on Sat Dec 2 09:01:53 2023

Anton Ertl wrote:

Matrix multiplication (if not available as a primitive or from an external >>library) is an example.

Not in my experience. Matrix multiplication always multiplies one
element of one matrix with one element of the other matrix. Since you
still need both matrices, you do not want to use F*! for that. Matrix multiplication adds a number of the products of these multiplications;
e.g., for a 1000x1000 matrix multiply, it sums up 1000 products
resulting in one element of the target matrix. F+! can be used for
that.

But for these kinds of things, it's better to use specialized code,
such as OpenBLAS. E.g., if you look at slides 80 and 87 of https://www.complang.tuwien.ac.at/anton/lvas/efficient.pdf, you see
that OpenBLAS is >13 times as fast for 1000x1000 matrix multiplication
(on a Tiger Lake CPU) than a straightforward scalar implementation of
matrix multiplication that uses the best loop nesting. Compared to
the naive variant that uses a dot product, the speedup exceeds a
factor of 25 (slide 78). Even when the auto-vectorization of gcc
kicks in (with -O3), the result is still >5 times slower than
OpenBLAS.

"THP" on these slides means that transparent huge pages are enabled
and kick in (there is no guarantee that they kick in if they are
enabled).

Yes. On desktop systems, it makes little sense not to use numerical maths libraries for such problems. Large matrices are usually decomposed into
blocks, and sparse matrices require special techniques. It would be quite tedious to reinvent all the wheels and program them by hand in Forth code,
let alone debug and optimise your creation.

Things are different, however, if you don't have the space to hold fat
library files. In resource-constrained systems, you'll prefer in-place algorithms wherever possible. If you can do the calculations in background tasks, speed is not important. And LU decomposition helps a lot, but that
is no surprise.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From none) (albert@21:1/5 to Anton Ertl on Sat Dec 2 14:56:03 2023

In article <2023Dec2.080651@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

minforth@gmx.net (minforth) writes:

Krishna Myneni wrote:

It should make the code for loops which scale arrays more compact, but
typically, it is more rare to loop over a sequence of scalars which
multiply a single array element (value at a fixed address) than it is to >>> loop over a sequence of scalars which accumulate into a single array
element e.g. matrix multiplication.

Matrix multiplication (if not available as a primitive or from an external >>library) is an example.

Not in my experience. Matrix multiplication always multiplies one
element of one matrix with one element of the other matrix. Since you
still need both matrices, you do not want to use F*! for that. Matrix >multiplication adds a number of the products of these multiplications;
e.g., for a 1000x1000 matrix multiply, it sums up 1000 products
resulting in one element of the target matrix. F+! can be used for
that.

But for these kinds of things, it's better to use specialized code,
such as OpenBLAS. E.g., if you look at slides 80 and 87 of >https://www.complang.tuwien.ac.at/anton/lvas/efficient.pdf, you see
that OpenBLAS is >13 times as fast for 1000x1000 matrix multiplication
(on a Tiger Lake CPU) than a straightforward scalar implementation of
matrix multiplication that uses the best loop nesting. Compared to
the naive variant that uses a dot product, the speedup exceeds a
factor of 25 (slide 78). Even when the auto-vectorization of gcc
kicks in (with -O3), the result is still >5 times slower than
OpenBLAS.

"THP" on these slides means that transparent huge pages are enabled
and kick in (there is no guarantee that they kick in if they are
enabled).

This is an excellent opportunity to introduce a single assembler
routine that does a huge speed up.
Approximately a vector times vector multiplication with
specified start addresses, specified strides, and a length.

- anton
--

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat spinning. - the Wise from Antrim -

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to albert@cherry. on Sat Dec 2 16:44:33 2023

albert@cherry.(none) (albert) writes:

In article <2023Dec2.080651@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

But for these kinds of things, it's better to use specialized code,
such as OpenBLAS. E.g., if you look at slides 80 and 87 of >>https://www.complang.tuwien.ac.at/anton/lvas/efficient.pdf, you see
that OpenBLAS is >13 times as fast for 1000x1000 matrix multiplication
(on a Tiger Lake CPU) than a straightforward scalar implementation of >>matrix multiplication that uses the best loop nesting. Compared to
the naive variant that uses a dot product, the speedup exceeds a
factor of 25 (slide 78). Even when the auto-vectorization of gcc
kicks in (with -O3), the result is still >5 times slower than
OpenBLAS.

"THP" on these slides means that transparent huge pages are enabled
and kick in (there is no guarantee that they kick in if they are
enabled).

This is an excellent opportunity to introduce a single assembler
routine that does a huge speed up.
Approximately a vector times vector multiplication with
specified start addresses, specified strides, and a length.

You mean something like:

'v*' ( f-addr1 nstride1 f-addr2 nstride2 ucount -- r ) gforth-0.5 "v-star"
dot-product: r=v1*v2. The first element of v1 is at f_addr1, the
next at f_addr1+nstride1 and so on (similar for v2). Both vectors have
ucount elements.

However, note that the dot-product variant is slower than OpenBLAS by
a factor of 25. The best scalar implementation from slide 80 is quite
a bit faster (Factor 13 slower than OpenBLAS) and can be implemented
with

'faxpy' ( ra f-x nstridex f-y nstridey ucount -- ) gforth-0.5 "faxpy"
vy=ra*vx+vy

FAXPY can be implemented in a way that selects a vectorized
implementation if nstridex=nstridey=1 FLOATS. The result would be
slower than OpenBLAS by a factor of 5 (all numbers for 1000x1000
matrix multiplication).

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to Anton Ertl on Sun Dec 3 08:21:49 2023

Anton Ertl wrote:

'v*' ( f-addr1 nstride1 f-addr2 nstride2 ucount -- r ) gforth-0.5 "v-star"
dot-product: r=v1*v2. The first element of v1 is at f_addr1, the
next at f_addr1+nstride1 and so on (similar for v2). Both vectors have ucount elements.

However, note that the dot-product variant is slower than OpenBLAS by
a factor of 25. The best scalar implementation from slide 80 is quite
a bit faster (Factor 13 slower than OpenBLAS) and can be implemented
with

It is not only about speed, but also about minimising calculation errors.

For example, naive dot product summation in a single loop, which is unfortunately what gforth does, is prone to accumulating rounding errors.

Nothing to blame here, but library functions are often "very smart".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From none) (albert@21:1/5 to Anton Ertl on Sun Dec 3 12:57:30 2023

In article <2023Dec2.174433@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

albert@cherry.(none) (albert) writes:

This is an excellent opportunity to introduce a single assembler
routine that does a huge speed up.
Approximately a vector times vector multiplication with
specified start addresses, specified strides, and a length.

You mean something like:

'v*' ( f-addr1 nstride1 f-addr2 nstride2 ucount -- r ) gforth-0.5 "v-star"
dot-product: r=v1*v2. The first element of v1 is at f_addr1, the
next at f_addr1+nstride1 and so on (similar for v2). Both vectors have >ucount elements.

However, note that the dot-product variant is slower than OpenBLAS by
a factor of 25. The best scalar implementation from slide 80 is quite

Loosing that much imagining using all 8 registers of the 8087 stack
is astonishing, if V* really is implemented in assembler.

If you do a more sophisticated version with at least 8 fp registers
available, you can prefetch easily 2 fp numbers in advance for
each stride.

- anton

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat spinning. - the Wise from Antrim -

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From mhx@21:1/5 to none on Sun Dec 3 13:26:56 2023

none wrote:

In article <2023Dec2.174433@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

albert@cherry.(none) (albert) writes:

[..]

Loosing that much imagining using all 8 registers of the 8087 stack
is astonishing, if V* really is implemented in assembler.

That is because OpenBLAS uses AVX2 with all cores working
in parallel. Memory access patterns are accounted for, as is
every cycle possibly lost at the start and end of a loop.
It is of course possible to beat it with application specific
tricks (the most obvious and effective is exploiting sparseness).

iForth's DAXPY is sse2-based but uses only 1 core.
I have a lot to learn.

CLK 4192 MHz ( 8 core machine )
60x60 mm - normal algorithm 2.03 GFlops, 2.05 ticks/flop, 0.211 ms
60x60 mm - blocking, factor of 20 1.02 GFlops, 4.09 ticks/flop, 0.422 ms
60x60 mm - transposed B matrix 8.58 GFlops, 0.48 ticks/flop, 50.000 us
60x60 mm - transposed B matrix #2 8.43 GFlops, 0.49 ticks/flop, 51.000 us
60x60 mm - Robert's algorithm 9.36 GFlops, 0.44 ticks/flop, 46.000 us
60x60 mm - T. Maeno's algorithm, subarray 20x20 1.06 GFlops, 3.91 ticks/flop, 0.403 ms
60x60 mm - D. Warner's algorithm, subarray 20x20 1.02 GFlops, 4.07 ticks/flop, 0.419 ms
60x60 mm - generic mat* 30.27 GFlops, 0.13 ticks/flop, 14.000 us
60x60 mm - iForth DGEMM1 54.61 GFlops, 0.07 ticks/flop, 7.000 us
60x60 mm - iForth SMMD* 54.89 GFlops, 0.07 ticks/flop, 7.000 us
60x60 mm - iForth DAXPY based 7.76 GFlops, 0.53 ticks/flop, 55.000 us

120x120 mm - normal algorithm 3.36 GFlops, 1.24 ticks/flop, 1.027 ms
120x120 mm - blocking, factor of 20 0.99 GFlops, 4.19 ticks/flop, 3.461 ms
120x120 mm - transposed B matrix 12.07 GFlops, 0.34 ticks/flop, 0.286 ms
120x120 mm - transposed B matrix #2 11.97 GFlops, 0.35 ticks/flop, 0.288 ms
120x120 mm - Robert's algorithm 13.01 GFlops, 0.32 ticks/flop, 0.265 ms
120x120 mm - T. Maeno's algorithm, subarray 20x20 1.07 GFlops, 3.89 ticks/flop, 3.210 ms
120x120 mm - D. Warner's algorithm, subarray 20x20 1.03 GFlops, 4.04 ticks/flop, 3.335 ms
120x120 mm - generic mat* 111.25 GFlops, 0.03 ticks/flop, 31.000 us
120x120 mm - iForth DGEMM1 120.47 GFlops, 0.03 ticks/flop, 28.000 us
120x120 mm - iForth SMMD* 119.94 GFlops, 0.03 ticks/flop, 28.000 us
120x120 mm - iForth DAXPY based 13.22 GFlops, 0.31 ticks/flop, 0.261 ms

500x500 mm - normal algorithm 4.00 GFlops, 1.04 ticks/flop, 62.407 ms
500x500 mm - blocking, factor of 20 1.04 GFlops, 4.02 ticks/flop, 0.240 s
500x500 mm - transposed B matrix 16.75 GFlops, 0.25 ticks/flop, 14.919 ms
500x500 mm - transposed B matrix #2 16.55 GFlops, 0.25 ticks/flop, 15.099 ms
500x500 mm - Robert's algorithm 17.26 GFlops, 0.24 ticks/flop, 14.482 ms
500x500 mm - T. Maeno's algorithm, subarray 20x20 1.08 GFlops, 3.87 ticks/flop, 0.231 s
500x500 mm - D. Warner's algorithm, subarray 20x20 1.04 GFlops, 4.02 ticks/flop, 0.240 s
500x500 mm - generic mat* 14.35 GFlops, 0.29 ticks/flop, 17.410 ms
500x500 mm - iForth DGEMM1 67.18 GFlops, 0.06 ticks/flop, 3.721 ms
500x500 mm - iForth SMMD* 67.45 GFlops, 0.06 ticks/flop, 3.706 ms
500x500 mm - iForth DAXPY based 13.07 GFlops, 0.32 ticks/flop, 19.125 ms

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to minforth on Sun Dec 3 14:18:59 2023

minforth@gmx.net (minforth) writes:

Anton Ertl wrote:

'v*' ( f-addr1 nstride1 f-addr2 nstride2 ucount -- r ) gforth-0.5 "v-star" >> dot-product: r=v1*v2. The first element of v1 is at f_addr1, the
next at f_addr1+nstride1 and so on (similar for v2). Both vectors have
ucount elements.

However, note that the dot-product variant is slower than OpenBLAS by
a factor of 25. The best scalar implementation from slide 80 is quite
a bit faster (Factor 13 slower than OpenBLAS) and can be implemented
with

It is not only about speed, but also about minimising calculation errors.

For example, naive dot product summation in a single loop, which is >unfortunately what gforth does, is prone to accumulating rounding errors.

Nothing to blame here, but library functions are often "very smart".

The BLAS implementations seem to be only about speed. None that I am
aware of uses, e.g., Kahan summation to reduce rounding errors.

There are other libraries that are about accuracy, but not BLAS.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to Anton Ertl on Sun Dec 3 14:58:56 2023

Anton Ertl wrote:

The BLAS implementations seem to be only about speed. None that I am
aware of uses, e.g., Kahan summation to reduce rounding errors.

Kahan summation gives good results but can be very slow. As a good
compromise, I prefer recursive summation of vector halves for dot products, until their size is small enough to fit into vector chunks ready for CPU-supported vector operations or intrinsics.

Wikipedia has a small article on this called Pairwise Summation.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to albert@cherry. on Sun Dec 3 13:54:03 2023

albert@cherry.(none) (albert) writes:

In article <2023Dec2.174433@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

albert@cherry.(none) (albert) writes:

This is an excellent opportunity to introduce a single assembler
routine that does a huge speed up.
Approximately a vector times vector multiplication with
specified start addresses, specified strides, and a length.

You mean something like:

'v*' ( f-addr1 nstride1 f-addr2 nstride2 ucount -- r ) gforth-0.5 "v-star"
dot-product: r=v1*v2. The first element of v1 is at f_addr1, the
next at f_addr1+nstride1 and so on (similar for v2). Both vectors have >>ucount elements.

However, note that the dot-product variant is slower than OpenBLAS by
a factor of 25. The best scalar implementation from slide 80 is quite

Loosing that much imagining using all 8 registers of the 8087 stack
is astonishing, if V* really is implemented in assembler.

It does not use the 8087 stack at all.

If you do a more sophisticated version with at least 8 fp registers >available, you can prefetch easily 2 fp numbers in advance for
each stride.

That is irrelevant for the reasons given below, but it boils down to:
The Tiger Lake on which I measured these speedups is a CPU with
out-of-order execution (with 26 years of ancestry).

The code in question is:

0x000055ba08700990 <v_star+0>: pxor %xmm1,%xmm1
0x000055ba08700994 <v_star+4>: test %r8,%r8
0x000055ba08700997 <v_star+7>: je 0x55ba087009b8 <v_star+40>
0x000055ba08700999 <v_star+9>: nopl 0x0(%rax)
0x000055ba087009a0 <v_star+16>: movsd (%rdi),%xmm0
0x000055ba087009a4 <v_star+20>: mulsd (%rdx),%xmm0
0x000055ba087009a8 <v_star+24>: add %rsi,%rdi
0x000055ba087009ab <v_star+27>: add %rcx,%rdx
0x000055ba087009ae <v_star+30>: addsd %xmm0,%xmm1
0x000055ba087009b2 <v_star+34>: sub $0x1,%r8
0x000055ba087009b6 <v_star+38>: jne 0x55ba087009a0 <v_star+16>
0x000055ba087009b8 <v_star+40>: movapd %xmm1,%xmm0
0x000055ba087009bc <v_star+44>: ret

with the inner loop from 0x55ba087009a0 <v_star+16> to
0x000055ba087009b6 <v_star+38> (inclusive).

The performance is determined by the dependence of the FP addition
addsd on the result from the previous iteration. The latency of this
FP addition is 4 cycles, and the whole matrix multiplication benchmark
runs at 4.1 cycles per iteration of the inner loop (and the cost of
the rest of the benchmark is spread over these cycles; that's the 0.1
cycle).

So what happens in the steady state is that all the other instructions
are executed early (at around the same time as the addsd from 50
iterations earlier; the Tiger Lake has a reorder buffer of 352
instructions), so fetching two values into registers one iteration
earlier makes hardly any difference. Plus, the Tiger lake contains
hardware prefetchers that are very good at prefetching with constant
stride, as in V*.

What could be done to make this faster is to add up, say 4
intermediate sums in parallel, and finally compute the sum of these 4 intermediate sums.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From none) (albert@21:1/5 to minforth on Sun Dec 3 16:31:13 2023

In article <e93ff88202425b32916bae8123adf0b2@news.novabbs.com>,
minforth <minforth@gmx.net> wrote:

Anton Ertl wrote:

The BLAS implementations seem to be only about speed. None that I am
aware of uses, e.g., Kahan summation to reduce rounding errors.

Kahan summation gives good results but can be very slow. As a good >compromise, I prefer recursive summation of vector halves for dot products, >until their size is small enough to fit into vector chunks ready for >CPU-supported vector operations or intrinsics.

Wikipedia has a small article on this called Pairwise Summation.

Summing numbers that mean something, result in a sum whose error
is dominated with the maximum error of the summands.
Imagine a fly landing on the top of a church and a flee on top of
that. If you measure the height of the church precise to one mm,
the total height cannot be made more precise on reordering the
summands.
So I think it is mostly academic. The most precise calculation
I've done is 1/256 of an infrared wavelength over 60 m.
(That really required double precision floats. Chili ESO telescopes)
A more practical examples is the thickness of steel pipelines on the
Brent oil rigs. You have to be content with 3 significant digits at
the very most.

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat spinning. - the Wise from Antrim -

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to none on Sun Dec 3 16:35:46 2023

none wrote:

In article <e93ff88202425b32916bae8123adf0b2@news.novabbs.com>,
minforth <minforth@gmx.net> wrote:

Anton Ertl wrote:

The BLAS implementations seem to be only about speed. None that I am
aware of uses, e.g., Kahan summation to reduce rounding errors.

Kahan summation gives good results but can be very slow. As a good >>compromise, I prefer recursive summation of vector halves for dot products, >>until their size is small enough to fit into vector chunks ready for >>CPU-supported vector operations or intrinsics.

Wikipedia has a small article on this called Pairwise Summation.

Summing numbers that mean something, result in a sum whose error
is dominated with the maximum error of the summands.
Imagine a fly landing on the top of a church and a flee on top of
that. If you measure the height of the church precise to one mm,
the total height cannot be made more precise on reordering the
summands.
So I think it is mostly academic.

Well, we are not in the business of measuring academic bellfry bugs ;-),
but signal vectors in the order of up to tens of thousands of samples.
There it is good engineering practice to keep an eye on error propagation.

But you're right, under normal circumstances it doesn't matter. But when
you least expect it, it can ruin your day(s). Better be careful.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mhx on Sun Dec 3 17:58:07 2023

mhx@iae.nl (mhx) writes:

That is because OpenBLAS uses AVX2 with all cores working
in parallel.

I expect that it uses AVX-512 on the Tiger Lake which I measured. My measurements used only one core. Using more cores increases the CPU
cycles needed (due to parallelization overhead), although it reduces
the elapsed time.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to minforth on Sun Dec 3 18:02:08 2023

minforth@gmx.net (minforth) writes:

Anton Ertl wrote:

The BLAS implementations seem to be only about speed. None that I am
aware of uses, e.g., Kahan summation to reduce rounding errors.

Kahan summation gives good results but can be very slow. As a good >compromise, I prefer recursive summation of vector halves for dot products, >until their size is small enough to fit into vector chunks ready for >CPU-supported vector operations or intrinsics.

For multiplying big matrices (and why would you care in case of small matrices?), the question is how to combine that with the memory access
patterns that you want for efficiently using the memory subsystem for
matrix multiplication, if it is possible at all. OpenBLAS certainly
does not do that. The divide-and-conquer approach <https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm#Divide-and-conquer_algorithm>
deals well with the memory subsystem, and may exhibit some of the
properties you want, but at least in the implementation I did, I did
not form intermediate matrices, but added the intermediate results to
the appropriate elements in the target matrix C, so it does not have significantly better accuracy than the straightforward algorithm. If
one stored intermediate results elsewhere for adding them pairwise,
that would cost extra overhead. Maybe worth it, maybe not.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to albert@cherry. on Sun Dec 3 18:23:06 2023

albert@cherry.(none) (albert) writes:

Summing numbers that mean something, result in a sum whose error
is dominated with the maximum error of the summands.

1e30 1e f+ -1e30 f+ 1e 0e f~ .

produces 0 (false), even though with exact summation it would produce
true (-1). Of course, you may say that these numbers mean nothing to
you, but you are not the only one in the world.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From mhx@21:1/5 to Anton Ertl on Sun Dec 3 21:04:59 2023

Anton Ertl wrote:

1e30 1e f+ -1e30 f+ 1e 0e f~ .

produces 0 (false), even though with exact summation it would produce
true (-1). Of course, you may say that these numbers mean nothing to
you, but you are not the only one in the world.

Take the number of years the big bang happened (14.5 billion years ago),
square it and multiply by the height of Church St. Spirit in meters for
good measure. A photon will travel 1e30 meters in that amount of years.
Now add 1 meter ...

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From none) (albert@21:1/5 to Anton Ertl on Mon Dec 4 12:32:00 2023

In article <2023Dec3.185807@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

mhx@iae.nl (mhx) writes:

That is because OpenBLAS uses AVX2 with all cores working
in parallel.

I expect that it uses AVX-512 on the Tiger Lake which I measured. My >measurements used only one core. Using more cores increases the CPU
cycles needed (due to parallelization overhead), although it reduces
the elapsed time.

I would be interested to have a comparable time with the examples
done by OpenBlas with one core.
I tried to optimise on a transputer once by starting a simple
loop at a cell boundary (a transputer has byte based instructions).
The results were so puzzling that I kept away from trying.

- anton

--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat spinning. - the Wise from Antrim -

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From none) (albert@21:1/5 to Anton Ertl on Mon Dec 4 12:24:58 2023

In article <2023Dec3.192306@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

albert@cherry.(none) (albert) writes:

Summing numbers that mean something, result in a sum whose error
is dominated with the maximum error of the summands.

1e30 1e f+ -1e30 f+ 1e 0e f~ .

produces 0 (false), even though with exact summation it would produce
true (-1). Of course, you may say that these numbers mean nothing to
you, but you are not the only one in the world.

Try this with interval floats. Small explanation:
a number 9.000 represents an interval between 9.0005 and 8.9995
In this example the result is approximately
0 +/- 1E11 (with 19 precision floats)

There are rules for propagating the intervals through multiplication
and addition etc.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html >comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat spinning. - the Wise from Antrim -

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to minforth on Mon Dec 4 06:55:21 2023

On 12/1/23 06:32, minforth wrote:

Krishna Myneni wrote:

It should make the code for loops which scale arrays more compact, but
typically, it is more rare to loop over a sequence of scalars which
multiply a single array element (value at a fixed address) than it is
to loop over a sequence of scalars which accumulate into a single
array element e.g. matrix multiplication.

Matrix multiplication (if not available as a primitive or from an external library) is an example. In other numerical matrix algorithms, pivoting is
is rather common, which involves scalar column or row multiplication.
Most occurrences in my code involve shifting and scaling of vectors.

The example of matrix multiplication was not a good fit for F+!. We
usually accumulate the sum on the stack and then store it at the
destination in the matrix.

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mhx on Tue Dec 5 08:37:01 2023

mhx@iae.nl (mhx) writes:

Anton Ertl wrote:

1e30 1e f+ -1e30 f+ 1e 0e f~ .

produces 0 (false), even though with exact summation it would produce
true (-1). Of course, you may say that these numbers mean nothing to
you, but you are not the only one in the world.

Take the number of years the big bang happened (14.5 billion years ago), >square it and multiply by the height of Church St. Spirit in meters for
good measure. A photon will travel 1e30 meters in that amount of years.
Now add 1 meter ...

So? Yes, it seems that the typical answer to issues of numerical
errors has been to

1) Replace fixed point with floating point, so you don't have to do
analysis for scaling.

2) Use wider FP types, so you may be able to do without numerical
analysis (or if you still would need it, you have the hope of missing
the cases where you need it). I think that iForth uses 80-bit FP
numbers. Why?

3) Use examples like the above to convince themselves that numerical
analysis is not needed.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to albert@cherry. on Tue Dec 5 08:54:49 2023

albert@cherry.(none) (albert) writes:

I would be interested to have a comparable time with the examples
done by OpenBlas with one core.

It's not clear what you want, but for 1000x1000 matrix multiplication
OpenBLAS uses 0.16 cycles per iteration of the inner loop of the
straighforward implementation whe using one core (or 160M cycles for
the whole matrix multiplication).

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Krishna Myneni on Tue Dec 5 08:58:31 2023

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

The example of matrix multiplication was not a good fit for F+!. We
usually accumulate the sum on the stack and then store it at the
destination in the matrix.

Who is "we"?

Looking at
<http://theforth.net/package/matmul/current-view/matmul.4th>, the
fastest version on all systems that does not use a primitive FAXPY
is version 2, and that spends most of its time in:

: faxpy-nostride ( ra f_x f_y ucount -- )
\ vy=ra*vx+vy
dup >r 3 and 0 ?do
fdup over f@ f* dup f+! float+ swap float+ swap
loop
r> 2 rshift 0 ?do
fdup over f@ f* dup f+! float+ swap float+ swap
fdup over f@ f* dup f+! float+ swap float+ swap
fdup over f@ f* dup f+! float+ swap float+ swap
fdup over f@ f* dup f+! float+ swap float+ swap
\ better performance on gforth-fast:
\ fdup swap dup f@ f* float+ swap dup f@ f+ dup f! float+
loop
2drop fdrop ;

As you can see, it uses F+!.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From mhx@21:1/5 to Anton Ertl on Tue Dec 5 11:28:20 2023

Anton Ertl wrote:

mhx@iae.nl (mhx) writes:

Anton Ertl wrote:

1e30 1e f+ -1e30 f+ 1e 0e f~ .

produces 0 (false), even though with exact summation it would produce
true (-1). Of course, you may say that these numbers mean nothing to
you, but you are not the only one in the world.

Take the number of years the big bang happened (14.5 billion years ago), >>square it and multiply by the height of Church St. Spirit in meters for >>good measure. A photon will travel 1e30 meters in that amount of years.
Now add 1 meter ...

So? Yes, it seems that the typical answer to issues of numerical
errors has been to

1) Replace fixed point with floating point, so you don't have to do
analysis for scaling.

2) Use wider FP types, so you may be able to do without numerical
analysis (or if you still would need it, you have the hope of missing
the cases where you need it). I think that iForth uses 80-bit FP
numbers. Why?

[..]

Because of (2), because some algorithms I care about are based
on doing selected steps in higher precision, and because the FPU
provides transcendent functions without needing libraries.

Nowadays I use double precision for speed (80bit floats are about
2 to 3 times slower than 64bit floats).

3) Use examples like the above to convince themselves that numerical
analysis is not needed.

You misinterpret my posting. I find it illuminating when technical
problems are visualized ( "2nm line-width means four Si atoms across" ).

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jan Coombs@21:1/5 to Anton Ertl on Tue Dec 5 12:08:01 2023

On Tue, 05 Dec 2023 08:37:01 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

So? Yes, it seems that the typical answer to issues of numerical
errors has been to

1) Replace fixed point with floating point, so you don't have to do
analysis for scaling.

2) Use wider FP types, so you may be able to do without numerical
analysis (or if you still would need it, you have the hope of missing
the cases where you need it). I think that iForth uses 80-bit FP
numbers. Why?

3) Use examples like the above to convince themselves that numerical
analysis is not needed.

The need for numerical analysis could be reduced in a processor that
allows a data item to be of variable length, or to span multiple cells:

In his book "The End of Error"[1] John Gustafson presents a core model[2]
of his (Type 1) Unums. This data type allows both fields of a float to be
of variable length, so that '*/' is redundant, being numerically the same
as '* /'.

He also claims 50% processing power reduction for inherently compressed
data, and less supervision of data due to all bits in data being valid,
and none being lost by fixed-format constraints.

Might it be significantly simpler to implement variable-length data in
hardware on a zero-operand processor than a register based one?

Jan Coombs
--

[1] The End of Error Unum Computing By John L. Gustafson
[sample chapters were available from publisher - ask privately] https://www.taylorfrancis.com/books/mono/10.1201/9781315161532/end-error-john-gustafson

[2] [was available from publisher - ask privately]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to Anton Ertl on Tue Dec 5 06:23:29 2023

On 12/5/23 02:58, Anton Ertl wrote:

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

The example of matrix multiplication was not a good fit for F+!. We
usually accumulate the sum on the stack and then store it at the
destination in the matrix.

Who is "we"?

In the kForth double precision matrix multiply,

https://github.com/mynenik/kForth-64/blob/master/forth-src/fsl/extras/mmul.4th

The word DF_MUL_R1C2 obtains the sum of the products of a row from the
first matrix with a column from the second matrix. The sum is
accumulated on the stack.

\ Multiply row of a1 with col of a2, element by element,
\ and accumulate the sum.
: df_mul_r1c2 ( row1 col2 -- ) ( F: -- rsum )
df_r1c2>a1a2
0e
nc1 @ 0 DO
2dup f@ f@ f* f+
roffs2 @ +
swap dfloat+ swap
LOOP
2drop ;

The matrix multiplication word DF_MMUL subsequently stores the resulting
rsum in the destination matrix, after the call to DF_MUL_R1C2.

\ Multiply two double-precision matrices with data beginning at
\ a1 and a2, and store at a3. Proper memory allocation is
\ assumed, as are the dimensions for a2, i.e. nr2 = nc1 is
\ assumed. This word assumes an integrated data/fp stack.
: df_mmul ( a1 a2 a3 nr1 nc1 nc2 -- )
set_mmul_params
0 DO
nc2 @ 0 DO
J I df_mul_r1c2 dup f!
dfloat+
LOOP
LOOP
drop ;

Can the above made faster with use of F+! within kForth? Possibly.

--
Krishna

Looking at
<http://theforth.net/package/matmul/current-view/matmul.4th>, the
fastest version on all systems that does not use a primitive FAXPY
is version 2, and that spends most of its time in:

: faxpy-nostride ( ra f_x f_y ucount -- )
\ vy=ra*vx+vy
dup >r 3 and 0 ?do
fdup over f@ f* dup f+! float+ swap float+ swap
loop
r> 2 rshift 0 ?do
fdup over f@ f* dup f+! float+ swap float+ swap
fdup over f@ f* dup f+! float+ swap float+ swap
fdup over f@ f* dup f+! float+ swap float+ swap
fdup over f@ f* dup f+! float+ swap float+ swap
\ better performance on gforth-fast:
\ fdup swap dup f@ f* float+ swap dup f@ f+ dup f! float+
loop
2drop fdrop ;

As you can see, it uses F+!.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to jan Coombs on Tue Dec 5 12:52:36 2023

jan Coombs wrote:

The need for numerical analysis could be reduced in a processor that
allows a data item to be of variable length, or to span multiple cells:

In his book "The End of Error"[1] John Gustafson presents a core model[2]
of his (Type 1) Unums. This data type allows both fields of a float to be
of variable length, so that '*/' is redundant, being numerically the same
as '* /'.

He also claims 50% processing power reduction for inherently compressed
data, and less supervision of data due to all bits in data being valid,
and none being lost by fixed-format constraints.

Might it be significantly simpler to implement variable-length data in hardware on a zero-operand processor than a register based one?

Thanks for mentioning this. There is indeed a need for reduced, adaptable
fp formats, especially in AI systems. See also the 'Motivation' section in https://github.com/stillwater-sc/universal

There are already some experimental libraries using unum posits for various programming languages. Is there any Forth code that uses unums?

But development will be slow as long as GPU hardware is cheap and readily available for faster time-to-market: https://www.windowscentral.com/microsoft/microsoft-to-spend-dollar32-billion-on-uks-ai-infrastructure-that-should-bring-more-than-20000-of-the-most-advanced-gpus-to-the-uk-by-2026

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From none) (albert@21:1/5 to minforth on Tue Dec 5 14:07:05 2023

In article <37830e4e5246f79b7d97247e4a973b1a@news.novabbs.com>,
minforth <minforth@gmx.net> wrote:

jan Coombs wrote:

The need for numerical analysis could be reduced in a processor that
allows a data item to be of variable length, or to span multiple cells:

In his book "The End of Error"[1] John Gustafson presents a core model[2]
of his (Type 1) Unums. This data type allows both fields of a float to be
of variable length, so that '*/' is redundant, being numerically the same
as '* /'.

He also claims 50% processing power reduction for inherently compressed
data, and less supervision of data due to all bits in data being valid,
and none being lost by fixed-format constraints.

Might it be significantly simpler to implement variable-length data in
hardware on a zero-operand processor than a register based one?

Thanks for mentioning this. There is indeed a need for reduced, adaptable
fp formats, especially in AI systems. See also the 'Motivation' section in >https://github.com/stillwater-sc/universal

There are already some experimental libraries using unum posits for various >programming languages. Is there any Forth code that uses unums?

But development will be slow as long as GPU hardware is cheap and readily >available for faster time-to-market: >https://www.windowscentral.com/microsoft/microsoft-to-spend-dollar32-billion-on-uks-ai-infrastructure-that-should-bring-more-than-20000-of-the-most-advanced-gpus-to-the-uk-by-2026

I doubt the necessity of fp formats in ai. 256 levels of uncertainty
must be plenty.

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat spinning. - the Wise from Antrim -

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to All on Tue Dec 5 13:32:55 2023

But development will be slow as long as GPU hardware is cheap and readily >>available for faster time-to-market: >>https://www.windowscentral.com/microsoft/microsoft-to-spend-dollar32-billion-on-uks-ai-infrastructure-that-should-bring-more-than-20000-of-the-most-advanced-gpus-to-the-uk-by-2026

I doubt the necessity of fp formats in ai. 256 levels of uncertainty
must be plenty.

We probably won't live to see it, but embedded AI will be in every gadget
of the distant future. A minimum of energy consumption will then be required. Unums promise an improvement here.

Today's use of GPUs is only for big data centres. I am curious to know the
peak power consumption of ChatGPT alone.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mhx on Tue Dec 5 15:14:13 2023

mhx@iae.nl (mhx) writes:

Take the number of years the big bang happened (14.5 billion years ago), >>>square it and multiply by the height of Church St. Spirit in meters for >>>good measure. A photon will travel 1e30 meters in that amount of years. >>>Now add 1 meter ...

...

You misinterpret my posting. I find it illuminating when technical
problems are visualized ( "2nm line-width means four Si atoms across" ).

I fail to visualize "the number of years the big bang happened (14.5
billion years ago), square it and multiply by the height of Church
St. Spirit in meters". In particular, a squared timespan is pretty unintuitive; I also don't know "Church St. Spirit" and its height. A
better way to visualize 10^30 is: The volume of Earth relative to the
volume of a 1.3mm sphere; or, alternatively, the ratio between the
weight of Earth relative to a piece of rice weighing 5.972mg.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to jan Coombs on Tue Dec 5 15:24:26 2023

jan Coombs <jan4comp.lang.forth@murray-microft.co.uk> writes:

The need for numerical analysis could be reduced in a processor that
allows a data item to be of variable length, or to span multiple cells:

In his book "The End of Error"[1] John Gustafson presents a core model[2]
of his (Type 1) Unums. This data type allows both fields of a float to be
of variable length, so that '*/' is redundant, being numerically the same
as '* /'.

The need for numerical analysis is such a problem that lots of people
fall for snake-oil salesmen like Gustafson, but not enough that
anyone, not even gullible venture capitalists, invest significant
money in it. By contrast, Mike Cowlishaw used similar arguments to
convince IBM to implement his decimal FP nonsense in hardware (but for
IBM, it may be a good way to convince gullible corporate managers to
buy expensive IBM hardware, so it may be a win for them even though
technically it is bullshit), and to convince IEEE to standardize it.

Back to Gustafson: He apparently has seen that unums go nowhere, so in
recent years he has switched to a new snake oil called posits. These
are a variant of FP numbers, with the mantissa and exponent size
depending on the exponent value. Which essentially would mean that
you can throw all the numerical analysis up to now away and do it
again. That's not going anywhere, either.

Might it be significantly simpler to implement variable-length data in >hardware on a zero-operand processor than a register based one?

No. Variable-length data is always a pain. E.g., see strings in Forth.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to All on Wed Dec 6 08:00:36 2023

Today, the predominant cell size is bfloat16 in algebraic
computations (e.g. GEMMs for scaled matrix multiplication
and addition) within neural networks for deep learning. Fixed
(sub)tiles e.g. 128x256 are directly supported by GPUs and
can be computed in a single cycle clock. Although float32/64
can also be used, the performance decreases more than
quadratically with the cell size. In return, quantisation
and rounding errors are accepted, which are one of the causes
of incorrect neural network outputs.

So there is a real need for adaptive floats, and work is
actually being done on their use. The term 'snake oil' is
misleading (except perhaps in the old school Forth niche).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jan Coombs@21:1/5 to Anton Ertl on Wed Dec 6 10:28:23 2023

On Tue, 05 Dec 2023 15:24:26 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

jan Coombs <jan4comp.lang.forth@murray-microft.co.uk> writes:

[...]

Might it be significantly simpler to implement variable-length data in >hardware on a zero-operand processor than a register based one?

No. Variable-length data is always a pain. E.g., see strings in Forth.

Agreed, handling variable length data, even in hardware, is much more
complex than fixed size integers. A processor doing this would need
to have the current data set in cache, and preferably about twelve items
per thread in order to minimise fill and spill memory accesses.

Strings could be handled as single stack items, and use /mod to split, but where UTF-8, or other variable-length characters are used a little extra hardware support would be needed to unpack, manipulate, and pack them. This
may be useful, for example, to fetch an error message with a single (multi- cell) memory read, then forwarding it to the terminal.

Jan Coombs
--

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to Krishna Myneni on Mon Dec 11 22:42:45 2023

Krishna Myneni wrote:

Use of the sequence "FDUP F*" is ubiquitous in Forth scientific code for
lack of a common word which squares an fp number. This not only is less readable but does not convey as much meaning to anyone who is reading
the code.

I've updated the FSL modules in kForth (32, Win32, and 64) to remove use
all instances of "FDUP F*" with the (built-in) word FSQUARE. Some FSL
modules provided definitions of FSQR for the same function (by MHX) and
I replaced these instances with FSQUARE which I find more readable and
less error-prone due to the proximity of FSQR to FSQRT.

Regarding code readability when no fp locals are available:

Standard Forth only defines a reduced number of fp stack operations. I added

FPICK (like PICK)
FROLL (like ROLL)
-FROLL (like ROLL reversed)

R R>F (already discussed)

Of course this only works if the FP stack is fully accessible e.g. memory mapped.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to dxf on Mon Dec 11 19:49:10 2023

On 12/11/23 19:07, dxf wrote:
...

FSL has memory-mapped flocals. Can't be worse than reliance on FPICK and FROLL.

...

The flocals implementation in the FSL is substantially worse. Unlike
using fp stack operations, one can't write re-entrant words with the FSL implementation of flocals. Unfortunately the standard fp stack
operations in Forth 2012 proves insufficient; hence the consideration of
words like FRISE.

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to minforth on Mon Dec 11 19:44:48 2023

On 12/11/23 16:42, minforth wrote:

Krishna Myneni wrote:

Use of the sequence "FDUP F*" is ubiquitous in Forth scientific code
for lack of a common word which squares an fp number. This not only is
less readable but does not convey as much meaning to anyone who is
reading the code.

I've updated the FSL modules in kForth (32, Win32, and 64) to remove
use all instances of "FDUP F*" with the (built-in) word FSQUARE. Some
FSL modules provided definitions of FSQR for the same function (by
MHX) and I replaced these instances with FSQUARE which I find more
readable and less error-prone due to the proximity of FSQR to FSQRT.

Regarding code readability when no fp locals are available:

Standard Forth only defines a reduced number of fp stack operations. I
added
FPICK (like PICK) FROLL (like ROLL)
-FROLL (like ROLL reversed)

R R>F (already discussed)

Of course this only works if the FP stack is fully accessible e.g.
memory mapped.

Yes, I have added FPICK as an intrinsic word in kForth-64, and have
source definitions of F>R and FR> (your R>F, which is actually a better
name). But I think FRISE may reduce/eliminate the need for F>R etc. When
the FP stack resides in memory and can be accessed using a pointer, it's
easy to implement FRISE in source to assess its usefulness.

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to Krishna Myneni on Tue Dec 12 07:00:30 2023

Krishna Myneni wrote:

On 12/11/23 16:42, minforth wrote:

Standard Forth only defines a reduced number of fp stack operations. I
added
FPICK (like PICK) FROLL (like ROLL)
-FROLL (like ROLL reversed)

R R>F (already discussed)

Of course this only works if the FP stack is fully accessible e.g.
memory mapped.

Yes, I have added FPICK as an intrinsic word in kForth-64, and have
source definitions of F>R and FR> (your R>F, which is actually a better name). But I think FRISE may reduce/eliminate the need for F>R etc. When
the FP stack resides in memory and can be accessed using a pointer, it's
easy to implement FRISE in source to assess its usefulness.

You have defined RISE as in
2 RISE i*x a b c d -- i*x b a c d et cetera

I don't really have an application where a position swap in the depths
of the stack would fit, because Forth operations always only use the
top stack element(s).

Then rather something like in
2 FLIP i*x a b c d -- d b c a

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From mhx@21:1/5 to Krishna Myneni on Tue Dec 12 08:49:57 2023

Krishna Myneni wrote:

The flocals implementation in the FSL is substantially worse. Unlike
using fp stack operations, one can't write re-entrant words with the FSL implementation of flocals.

I don't understand. This should be awkward, but ok?

8 CONSTANT /flocals

: (frame) ( n -- ) FLOATS ALLOT ;

: FRAME|
0 >R
BEGIN BL WORD COUNT 1 =
SWAP C@ [CHAR] | =
AND 0=
WHILE POSTPONE F, R> 1+ >R
REPEAT
/FLOCALS R> - DUP 0< ABORT" too many flocals"
POSTPONE LITERAL POSTPONE (frame) ; IMMEDIATE

: |FRAME ( -- ) [ /FLOCALS NEGATE ] LITERAL (FRAME) ;

: &h HERE [ 1 FLOATS ] LITERAL - ;
: &g HERE [ 2 FLOATS ] LITERAL - ;
: &f HERE [ 3 FLOATS ] LITERAL - ;
: &e HERE [ 4 FLOATS ] LITERAL - ;
: &d HERE [ 5 FLOATS ] LITERAL - ;
: &c HERE [ 6 FLOATS ] LITERAL - ;
: &b HERE [ 7 FLOATS ] LITERAL - ;
: &a HERE [ 8 FLOATS ] LITERAL - ;

: a &a F@ ;
: b &b F@ ;
: c &c F@ ;
: d &d F@ ;
: e &e F@ ;
: f &f F@ ;
: g &g F@ ;
: h &h F@ ;

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From none) (albert@21:1/5 to minforth on Tue Dec 12 09:32:59 2023

In article <743d2d729862d0b2fb9ff9ce314935dd@news.novabbs.com>,
minforth <minforth@gmx.net> wrote:

Krishna Myneni wrote:

Use of the sequence "FDUP F*" is ubiquitous in Forth scientific code for
lack of a common word which squares an fp number. This not only is less
readable but does not convey as much meaning to anyone who is reading
the code.

I've updated the FSL modules in kForth (32, Win32, and 64) to remove use
all instances of "FDUP F*" with the (built-in) word FSQUARE. Some FSL
modules provided definitions of FSQR for the same function (by MHX) and
I replaced these instances with FSQUARE which I find more readable and
less error-prone due to the proximity of FSQR to FSQRT.

Regarding code readability when no fp locals are available:

Standard Forth only defines a reduced number of fp stack operations. I added

FPICK (like PICK)
FROLL (like ROLL)
-FROLL (like ROLL reversed)

R R>F (already discussed)

Of course this only works if the FP stack is fully accessible e.g. memory mapped.

If I wanted this it shouldn't be too hard on the 8087 8 register stack.
This stack rotates. If the R> is available as memory, storing FP stack
items there is also doable.

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat spinning. - the Wise from Antrim -

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From none) (albert@21:1/5 to mhx on Tue Dec 12 12:54:59 2023

In article <a23e7f246a3ab1c5d76d263804ebec69@news.novabbs.com>,
mhx <mhx@iae.nl> wrote:

Krishna Myneni wrote:

The flocals implementation in the FSL is substantially worse. Unlike
using fp stack operations, one can't write re-entrant words with the FSL
implementation of flocals.

I don't understand. This should be awkward, but ok?

8 CONSTANT /flocals

: (frame) ( n -- ) FLOATS ALLOT ;

: FRAME|
0 >R
BEGIN BL WORD COUNT 1 =
SWAP C@ [CHAR] | =
AND 0=
WHILE POSTPONE F, R> 1+ >R
REPEAT
/FLOCALS R> - DUP 0< ABORT" too many flocals"
POSTPONE LITERAL POSTPONE (frame) ; IMMEDIATE

: |FRAME ( -- ) [ /FLOCALS NEGATE ] LITERAL (FRAME) ;

: &h HERE [ 1 FLOATS ] LITERAL - ;
: &g HERE [ 2 FLOATS ] LITERAL - ;
: &f HERE [ 3 FLOATS ] LITERAL - ;
: &e HERE [ 4 FLOATS ] LITERAL - ;
: &d HERE [ 5 FLOATS ] LITERAL - ;
: &c HERE [ 6 FLOATS ] LITERAL - ;
: &b HERE [ 7 FLOATS ] LITERAL - ;
: &a HERE [ 8 FLOATS ] LITERAL - ;

: a &a F@ ;
: b &b F@ ;
: c &c F@ ;
: d &d F@ ;
: e &e F@ ;
: f &f F@ ;
: g &g F@ ;
: h &h F@ ;

Reentrant words come into play as the same code is concurrently
executed in the parallel threads (in Forth) or if the word
is in a dll or a resident library that can accessed by more processes
at the same time.
I can see that you address that situation here.
No that I worry about much about re-entrancy. I happily compile
separate code for each parallel thread.

-marcel

--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat spinning. - the Wise from Antrim -

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to mhx on Tue Dec 12 08:00:19 2023

On 12/12/23 02:49, mhx wrote:

Krishna Myneni wrote:

The flocals implementation in the FSL is substantially worse. Unlike
using fp stack operations, one can't write re-entrant words with the
FSL implementation of flocals.

I don't understand. This should be awkward, but ok?

8 CONSTANT /flocals

: (frame) ( n -- ) FLOATS ALLOT ;

: FRAME|
       0 >R
       BEGIN   BL WORD COUNT 1 =
               SWAP C@ [CHAR] | =
               AND 0=
       WHILE   POSTPONE F, R> 1+ >R
       REPEAT
       /FLOCALS R> - DUP 0< ABORT" too many flocals"
       POSTPONE LITERAL POSTPONE (frame) ; IMMEDIATE

: |FRAME ( -- ) [ /FLOCALS NEGATE ] LITERAL (FRAME) ;

: &h            HERE [ 1 FLOATS ] LITERAL - ;
: &g            HERE [ 2 FLOATS ] LITERAL - ;
: &f            HERE [ 3 FLOATS ] LITERAL - ;
: &e            HERE [ 4 FLOATS ] LITERAL - ;
: &d            HERE [ 5 FLOATS ] LITERAL - ;
: &c            HERE [ 6 FLOATS ] LITERAL - ;
: &b            HERE [ 7 FLOATS ] LITERAL - ;
: &a            HERE [ 8 FLOATS ] LITERAL - ;

: a             &a F@ ;
: b             &b F@ ;
: c             &c F@ ;
: d             &d F@ ;
: e             &e F@ ;
: f             &f F@ ;
: g             &g F@ ;
: h             &h F@ ;

The above looks ok. It has been a while since I looked at this code. The implementation above allots space each time for new locals on entry and
frees it on exit -- I seem to be wrong about the FSL implementation
killing re-entrancy.

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to minforth on Tue Dec 12 19:30:10 2023

On 12/12/23 01:00, minforth wrote:

Krishna Myneni wrote:

On 12/11/23 16:42, minforth wrote:

Standard Forth only defines a reduced number of fp stack operations.
I added
FPICK (like PICK) FROLL (like ROLL)
-FROLL (like ROLL reversed)

R R>F (already discussed)

Of course this only works if the FP stack is fully accessible e.g.
memory mapped.

Yes, I have added FPICK as an intrinsic word in kForth-64, and have
source definitions of F>R and FR> (your R>F, which is actually a
better name). But I think FRISE may reduce/eliminate the need for F>R
etc. When the FP stack resides in memory and can be accessed using a
pointer, it's easy to implement FRISE in source to assess its usefulness.

You have defined RISE as in 2 RISE i*x a b c d -- i*x b a c d et cetera

I don't really have an application where a position swap in the depths
of the stack would fit, because Forth operations always only use the
top stack element(s).

Then rather something like in
2 FLIP i*x a b c d -- d b c a

The depth 2 RISE/FRISE would provide the function I was originally
asking for, but the general version is similar to FPICK. Admittedly,
whether the general FRISE has application for other depths remains to be
seen. Perhaps an on-fpstack sorting routine?

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to Krishna Myneni on Wed Dec 13 08:43:35 2023

Krishna Myneni wrote:

The depth 2 RISE/FRISE would provide the function I was originally
asking for, but the general version is similar to FPICK. Admittedly,
whether the general FRISE has application for other depths remains to be seen.

For similar reasons I used to have a word called PATCH as a counterpart
to PICK: n PATCH overwrote the stack value in depth n, often handy to avoid ROLLs. But in the end, such words are just crutches if you don't have locals.

Perhaps an on-fpstack sorting routine?

Yeah, brings order to chaos ;o)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jan Coombs@21:1/5 to minforth on Fri Dec 15 15:30:01 2023

On Tue, 5 Dec 2023 12:52:36 +0000
minforth@gmx.net (minforth) wrote:

[about floats with improved performance]

Thanks for mentioning this. There is indeed a need for reduced, adaptable
fp formats, especially in AI systems. See also the 'Motivation' section in https://github.com/stillwater-sc/universal

Thanks, have added that to my reference docs.

There are already some experimental libraries using unum posits for various programming languages. Is there any Forth code that uses unums?

A quick route would be to buy a processor with posits[1] built-in, and install Forth. I thought RISC-V ones were available, but looking now only found a prototype[2], product announcement[3], and available HW designs.[4][5]

Jan Coombs
--

[1] Posits, a New Kind of Number, Improves the Math of AI The first posit- based processor core gave a ten-thousandfold accuracy boost https://spectrum.ieee.org/floating-point-numbers-posits-processor

[2] Researchers Build a RISC-V Chip That Calculates in Posits, Boosting Accuracy for ML Workloads https://www.hackster.io/news/researchers-build-a-risc-v-chip-that-calculates-in-posits-boosting-accuracy-for-ml-workloads-086b985bf0c1

[3] A Lightweight Posit Processing Unit for RISC-V Processors in Deep Neural Network Applications https://riscv.org/news/2021/10/a-lightweight-posit-processing-unit-for-risc-v-processors-in-deep-neural-network-applications-marco-cococcioni-federico-rossi-emanuele-ruffaldi-and-saponara-sergio-ieee-transactions-on-emerging/

[4] PERI: A Configurable Posit Enabled RISC-V Core https://dl.acm.org/doi/fullHtml/10.1145/3446210

[5] PERCIVAL: Open-Source Posit RISC-V Core with Quire Capability https://arxiv.org/abs/2111.15286

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	300
Nodes:	16 (2 / 14)
Uptime:	55:21:58
Calls:	6,712
Files:	12,243
Messages:	5,355,394

Eliminating FDUP F*

Who's Online

System Info