• Eliminating FDUP F*

    From Krishna Myneni@21:1/5 to All on Wed Nov 29 07:29:04 2023
    Use of the sequence "FDUP F*" is ubiquitous in Forth scientific code for
    lack of a common word which squares an fp number. This not only is less readable but does not convey as much meaning to anyone who is reading
    the code.

    I've updated the FSL modules in kForth (32, Win32, and 64) to remove use
    all instances of "FDUP F*" with the (built-in) word FSQUARE. Some FSL
    modules provided definitions of FSQR for the same function (by MHX) and
    I replaced these instances with FSQUARE which I find more readable and
    less error-prone due to the proximity of FSQR to FSQRT.

    --
    Krishna Myneni

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to All on Wed Nov 29 07:30:32 2023
    Use of the sequence "FDUP F*" is ubiquitous in Forth scientific code for
    lack of a common word which squares an fp number. This not only is less readable but does not convey as much meaning to anyone who is reading
    the code.

    I've updated the FSL modules in kForth (32, Win32, and 64) to remove use
    all instances of "FDUP F*" with the (built-in) word FSQUARE. Some FSL
    modules provided definitions of FSQR for the same function (by MHX) and
    I replaced these instances with FSQUARE which I find more readable and
    less error-prone due to the proximity of FSQR to FSQRT.

    --
    Krishna Myneni

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Wed Nov 29 14:29:09 2023
    Thanks.

    In my apps I added for convenience
    FINV alias 1/F
    F2* F2/
    FHYPOT sqrt(a^2+b^2)
    FMA horner a*b+c

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to minforth on Wed Nov 29 18:05:25 2023
    minforth wrote:

    FHYPOT sqrt(a^2+b^2)

    This is a nice one (that iForth does
    not have) because FHYPOT is not only
    more efficient but also documents a
    tricky numerical problem.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to mhx on Wed Nov 29 20:45:46 2023
    mhx wrote:
    FHYPOT sqrt(a^2+b^2)

    This is a nice one (that iForth does
    not have) because FHYPOT is not only
    more efficient but also documents a
    tricky numerical problem.

    Perhaps as in here?
    https://arxiv.org/pdf/1904.09481.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to minforth on Wed Nov 29 19:02:45 2023
    On 11/29/23 08:29, minforth wrote:
    Thanks.

    In my apps I added for convenience
    FINV alias 1/F
    F2* F2/
    FHYPOT    sqrt(a^2+b^2)
    FMA    horner a*b+c

    FINV is also a commonly needed word, instead of writing

    "1.0E0 FSWAP F/".

    The other most useful word for vector/matrix code is F+!, which also
    improves the efficiency, readability, and compactness of code. Use of
    F+! can be found in the FSL modules.

    F+! has common usage and is easily comprehensible so it may be time to
    enter it formally into the Forth floating point lexicon.

    --
    KM

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to Krishna Myneni on Thu Nov 30 08:22:58 2023
    Krishna Myneni wrote:

    On 11/29/23 08:29, minforth wrote:
    Thanks.

    In my apps I added for convenience
    FINV alias 1/F
    F2* F2/
    FHYPOT    sqrt(a^2+b^2)
    FMA    horner a*b+c

    FINV is also a commonly needed word, instead of writing

    "1.0E0 FSWAP F/".

    The other most useful word for vector/matrix code is F+!, which also
    improves the efficiency, readability, and compactness of code. Use of
    F+! can be found in the FSL modules.

    F+! has common usage and is easily comprehensible so it may be time to
    enter it formally into the Forth floating point lexicon.

    May I add F*! for scalar operations on vector/matrix elements

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to mhx on Thu Nov 30 10:51:59 2023
    In article <104d31d0da99003ed0dc323134d1243c@news.novabbs.com>,
    mhx <mhx@iae.nl> wrote:
    minforth wrote:

    FHYPOT sqrt(a^2+b^2)

    This is a nice one (that iForth does
    not have) because FHYPOT is not only
    more efficient but also documents a
    tricky numerical problem.

    The hyp calculation is stable as hell, I can't think
    of any numerical problem.
    It is also useful. I added a `` HYPOs '' as a separate
    screen to my fixed point screen.
    (Using DSQRT that is not particularly difficult to
    implement.)

    -marcel
    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat spinning. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to minforth on Thu Nov 30 15:26:32 2023
    On 11/30/23 02:22, minforth wrote:
    Krishna Myneni wrote:

    ...
    The other most useful word for vector/matrix code is F+!, which also
    improves the efficiency, readability, and compactness of code. Use of
    F+! can be found in the FSL modules.

    F+! has common usage and is easily comprehensible so it may be time to
    enter it formally into the Forth floating point lexicon.

    May I add F*! for scalar operations on vector/matrix elements

    It should make the code for loops which scale arrays more compact, but typically, it is more rare to loop over a sequence of scalars which
    multiply a single array element (value at a fixed address) than it is to
    loop over a sequence of scalars which accumulate into a single array
    element e.g. matrix multiplication.

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to Krishna Myneni on Fri Dec 1 12:32:03 2023
    Krishna Myneni wrote:
    It should make the code for loops which scale arrays more compact, but typically, it is more rare to loop over a sequence of scalars which
    multiply a single array element (value at a fixed address) than it is to
    loop over a sequence of scalars which accumulate into a single array
    element e.g. matrix multiplication.

    Matrix multiplication (if not available as a primitive or from an external library) is an example. In other numerical matrix algorithms, pivoting is
    is rather common, which involves scalar column or row multiplication.
    Most occurrences in my code involve shifting and scaling of vectors.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to minforth on Sat Dec 2 07:06:51 2023
    minforth@gmx.net (minforth) writes:
    Krishna Myneni wrote:
    It should make the code for loops which scale arrays more compact, but
    typically, it is more rare to loop over a sequence of scalars which
    multiply a single array element (value at a fixed address) than it is to
    loop over a sequence of scalars which accumulate into a single array
    element e.g. matrix multiplication.

    Matrix multiplication (if not available as a primitive or from an external >library) is an example.

    Not in my experience. Matrix multiplication always multiplies one
    element of one matrix with one element of the other matrix. Since you
    still need both matrices, you do not want to use F*! for that. Matrix multiplication adds a number of the products of these multiplications;
    e.g., for a 1000x1000 matrix multiply, it sums up 1000 products
    resulting in one element of the target matrix. F+! can be used for
    that.

    But for these kinds of things, it's better to use specialized code,
    such as OpenBLAS. E.g., if you look at slides 80 and 87 of https://www.complang.tuwien.ac.at/anton/lvas/efficient.pdf, you see
    that OpenBLAS is >13 times as fast for 1000x1000 matrix multiplication
    (on a Tiger Lake CPU) than a straightforward scalar implementation of
    matrix multiplication that uses the best loop nesting. Compared to
    the naive variant that uses a dot product, the speedup exceeds a
    factor of 25 (slide 78). Even when the auto-vectorization of gcc
    kicks in (with -O3), the result is still >5 times slower than
    OpenBLAS.

    "THP" on these slides means that transparent huge pages are enabled
    and kick in (there is no guarantee that they kick in if they are
    enabled).

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to Anton Ertl on Sat Dec 2 09:01:53 2023
    Anton Ertl wrote:
    Matrix multiplication (if not available as a primitive or from an external >>library) is an example.

    Not in my experience. Matrix multiplication always multiplies one
    element of one matrix with one element of the other matrix. Since you
    still need both matrices, you do not want to use F*! for that. Matrix multiplication adds a number of the products of these multiplications;
    e.g., for a 1000x1000 matrix multiply, it sums up 1000 products
    resulting in one element of the target matrix. F+! can be used for
    that.

    But for these kinds of things, it's better to use specialized code,
    such as OpenBLAS. E.g., if you look at slides 80 and 87 of https://www.complang.tuwien.ac.at/anton/lvas/efficient.pdf, you see
    that OpenBLAS is >13 times as fast for 1000x1000 matrix multiplication
    (on a Tiger Lake CPU) than a straightforward scalar implementation of
    matrix multiplication that uses the best loop nesting. Compared to
    the naive variant that uses a dot product, the speedup exceeds a
    factor of 25 (slide 78). Even when the auto-vectorization of gcc
    kicks in (with -O3), the result is still >5 times slower than
    OpenBLAS.

    "THP" on these slides means that transparent huge pages are enabled
    and kick in (there is no guarantee that they kick in if they are
    enabled).

    Yes. On desktop systems, it makes little sense not to use numerical maths libraries for such problems. Large matrices are usually decomposed into
    blocks, and sparse matrices require special techniques. It would be quite tedious to reinvent all the wheels and program them by hand in Forth code,
    let alone debug and optimise your creation.

    Things are different, however, if you don't have the space to hold fat
    library files. In resource-constrained systems, you'll prefer in-place algorithms wherever possible. If you can do the calculations in background tasks, speed is not important. And LU decomposition helps a lot, but that
    is no surprise.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to Anton Ertl on Sat Dec 2 14:56:03 2023
    In article <2023Dec2.080651@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    minforth@gmx.net (minforth) writes:
    Krishna Myneni wrote:
    It should make the code for loops which scale arrays more compact, but
    typically, it is more rare to loop over a sequence of scalars which
    multiply a single array element (value at a fixed address) than it is to >>> loop over a sequence of scalars which accumulate into a single array
    element e.g. matrix multiplication.

    Matrix multiplication (if not available as a primitive or from an external >>library) is an example.

    Not in my experience. Matrix multiplication always multiplies one
    element of one matrix with one element of the other matrix. Since you
    still need both matrices, you do not want to use F*! for that. Matrix >multiplication adds a number of the products of these multiplications;
    e.g., for a 1000x1000 matrix multiply, it sums up 1000 products
    resulting in one element of the target matrix. F+! can be used for
    that.

    But for these kinds of things, it's better to use specialized code,
    such as OpenBLAS. E.g., if you look at slides 80 and 87 of >https://www.complang.tuwien.ac.at/anton/lvas/efficient.pdf, you see
    that OpenBLAS is >13 times as fast for 1000x1000 matrix multiplication
    (on a Tiger Lake CPU) than a straightforward scalar implementation of
    matrix multiplication that uses the best loop nesting. Compared to
    the naive variant that uses a dot product, the speedup exceeds a
    factor of 25 (slide 78). Even when the auto-vectorization of gcc
    kicks in (with -O3), the result is still >5 times slower than
    OpenBLAS.

    "THP" on these slides means that transparent huge pages are enabled
    and kick in (there is no guarantee that they kick in if they are
    enabled).

    This is an excellent opportunity to introduce a single assembler
    routine that does a huge speed up.
    Approximately a vector times vector multiplication with
    specified start addresses, specified strides, and a length.


    - anton
    --

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat spinning. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to albert@cherry. on Sat Dec 2 16:44:33 2023
    albert@cherry.(none) (albert) writes:
    In article <2023Dec2.080651@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    But for these kinds of things, it's better to use specialized code,
    such as OpenBLAS. E.g., if you look at slides 80 and 87 of >>https://www.complang.tuwien.ac.at/anton/lvas/efficient.pdf, you see
    that OpenBLAS is >13 times as fast for 1000x1000 matrix multiplication
    (on a Tiger Lake CPU) than a straightforward scalar implementation of >>matrix multiplication that uses the best loop nesting. Compared to
    the naive variant that uses a dot product, the speedup exceeds a
    factor of 25 (slide 78). Even when the auto-vectorization of gcc
    kicks in (with -O3), the result is still >5 times slower than
    OpenBLAS.

    "THP" on these slides means that transparent huge pages are enabled
    and kick in (there is no guarantee that they kick in if they are
    enabled).

    This is an excellent opportunity to introduce a single assembler
    routine that does a huge speed up.
    Approximately a vector times vector multiplication with
    specified start addresses, specified strides, and a length.

    You mean something like:

    'v*' ( f-addr1 nstride1 f-addr2 nstride2 ucount -- r ) gforth-0.5 "v-star"
    dot-product: r=v1*v2. The first element of v1 is at f_addr1, the
    next at f_addr1+nstride1 and so on (similar for v2). Both vectors have
    ucount elements.

    However, note that the dot-product variant is slower than OpenBLAS by
    a factor of 25. The best scalar implementation from slide 80 is quite
    a bit faster (Factor 13 slower than OpenBLAS) and can be implemented
    with

    'faxpy' ( ra f-x nstridex f-y nstridey ucount -- ) gforth-0.5 "faxpy"
    vy=ra*vx+vy

    FAXPY can be implemented in a way that selects a vectorized
    implementation if nstridex=nstridey=1 FLOATS. The result would be
    slower than OpenBLAS by a factor of 5 (all numbers for 1000x1000
    matrix multiplication).

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to Anton Ertl on Sun Dec 3 08:21:49 2023
    Anton Ertl wrote:
    'v*' ( f-addr1 nstride1 f-addr2 nstride2 ucount -- r ) gforth-0.5 "v-star"
    dot-product: r=v1*v2. The first element of v1 is at f_addr1, the
    next at f_addr1+nstride1 and so on (similar for v2). Both vectors have ucount elements.

    However, note that the dot-product variant is slower than OpenBLAS by
    a factor of 25. The best scalar implementation from slide 80 is quite
    a bit faster (Factor 13 slower than OpenBLAS) and can be implemented
    with

    It is not only about speed, but also about minimising calculation errors.

    For example, naive dot product summation in a single loop, which is unfortunately what gforth does, is prone to accumulating rounding errors.

    Nothing to blame here, but library functions are often "very smart".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to Anton Ertl on Sun Dec 3 12:57:30 2023
    In article <2023Dec2.174433@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    albert@cherry.(none) (albert) writes:
    This is an excellent opportunity to introduce a single assembler
    routine that does a huge speed up.
    Approximately a vector times vector multiplication with
    specified start addresses, specified strides, and a length.

    You mean something like:

    'v*' ( f-addr1 nstride1 f-addr2 nstride2 ucount -- r ) gforth-0.5 "v-star"
    dot-product: r=v1*v2. The first element of v1 is at f_addr1, the
    next at f_addr1+nstride1 and so on (similar for v2). Both vectors have >ucount elements.


    However, note that the dot-product variant is slower than OpenBLAS by
    a factor of 25. The best scalar implementation from slide 80 is quite

    Loosing that much imagining using all 8 registers of the 8087 stack
    is astonishing, if V* really is implemented in assembler.

    If you do a more sophisticated version with at least 8 fp registers
    available, you can prefetch easily 2 fp numbers in advance for
    each stride.


    - anton

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat spinning. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to none on Sun Dec 3 13:26:56 2023
    none wrote:

    In article <2023Dec2.174433@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    albert@cherry.(none) (albert) writes:
    [..]
    Loosing that much imagining using all 8 registers of the 8087 stack
    is astonishing, if V* really is implemented in assembler.

    That is because OpenBLAS uses AVX2 with all cores working
    in parallel. Memory access patterns are accounted for, as is
    every cycle possibly lost at the start and end of a loop.
    It is of course possible to beat it with application specific
    tricks (the most obvious and effective is exploiting sparseness).

    iForth's DAXPY is sse2-based but uses only 1 core.
    I have a lot to learn.

    CLK 4192 MHz ( 8 core machine )
    60x60 mm - normal algorithm 2.03 GFlops, 2.05 ticks/flop, 0.211 ms
    60x60 mm - blocking, factor of 20 1.02 GFlops, 4.09 ticks/flop, 0.422 ms
    60x60 mm - transposed B matrix 8.58 GFlops, 0.48 ticks/flop, 50.000 us
    60x60 mm - transposed B matrix #2 8.43 GFlops, 0.49 ticks/flop, 51.000 us
    60x60 mm - Robert's algorithm 9.36 GFlops, 0.44 ticks/flop, 46.000 us
    60x60 mm - T. Maeno's algorithm, subarray 20x20 1.06 GFlops, 3.91 ticks/flop, 0.403 ms
    60x60 mm - D. Warner's algorithm, subarray 20x20 1.02 GFlops, 4.07 ticks/flop, 0.419 ms
    60x60 mm - generic mat* 30.27 GFlops, 0.13 ticks/flop, 14.000 us
    60x60 mm - iForth DGEMM1 54.61 GFlops, 0.07 ticks/flop, 7.000 us
    60x60 mm - iForth SMMD* 54.89 GFlops, 0.07 ticks/flop, 7.000 us
    60x60 mm - iForth DAXPY based 7.76 GFlops, 0.53 ticks/flop, 55.000 us

    120x120 mm - normal algorithm 3.36 GFlops, 1.24 ticks/flop, 1.027 ms
    120x120 mm - blocking, factor of 20 0.99 GFlops, 4.19 ticks/flop, 3.461 ms
    120x120 mm - transposed B matrix 12.07 GFlops, 0.34 ticks/flop, 0.286 ms
    120x120 mm - transposed B matrix #2 11.97 GFlops, 0.35 ticks/flop, 0.288 ms
    120x120 mm - Robert's algorithm 13.01 GFlops, 0.32 ticks/flop, 0.265 ms
    120x120 mm - T. Maeno's algorithm, subarray 20x20 1.07 GFlops, 3.89 ticks/flop, 3.210 ms
    120x120 mm - D. Warner's algorithm, subarray 20x20 1.03 GFlops, 4.04 ticks/flop, 3.335 ms
    120x120 mm - generic mat* 111.25 GFlops, 0.03 ticks/flop, 31.000 us
    120x120 mm - iForth DGEMM1 120.47 GFlops, 0.03 ticks/flop, 28.000 us
    120x120 mm - iForth SMMD* 119.94 GFlops, 0.03 ticks/flop, 28.000 us
    120x120 mm - iForth DAXPY based 13.22 GFlops, 0.31 ticks/flop, 0.261 ms

    500x500 mm - normal algorithm 4.00 GFlops, 1.04 ticks/flop, 62.407 ms
    500x500 mm - blocking, factor of 20 1.04 GFlops, 4.02 ticks/flop, 0.240 s
    500x500 mm - transposed B matrix 16.75 GFlops, 0.25 ticks/flop, 14.919 ms
    500x500 mm - transposed B matrix #2 16.55 GFlops, 0.25 ticks/flop, 15.099 ms
    500x500 mm - Robert's algorithm 17.26 GFlops, 0.24 ticks/flop, 14.482 ms
    500x500 mm - T. Maeno's algorithm, subarray 20x20 1.08 GFlops, 3.87 ticks/flop, 0.231 s
    500x500 mm - D. Warner's algorithm, subarray 20x20 1.04 GFlops, 4.02 ticks/flop, 0.240 s
    500x500 mm - generic mat* 14.35 GFlops, 0.29 ticks/flop, 17.410 ms
    500x500 mm - iForth DGEMM1 67.18 GFlops, 0.06 ticks/flop, 3.721 ms
    500x500 mm - iForth SMMD* 67.45 GFlops, 0.06 ticks/flop, 3.706 ms
    500x500 mm - iForth DAXPY based 13.07 GFlops, 0.32 ticks/flop, 19.125 ms

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to minforth on Sun Dec 3 14:18:59 2023
    minforth@gmx.net (minforth) writes:
    Anton Ertl wrote:
    'v*' ( f-addr1 nstride1 f-addr2 nstride2 ucount -- r ) gforth-0.5 "v-star" >> dot-product: r=v1*v2. The first element of v1 is at f_addr1, the
    next at f_addr1+nstride1 and so on (similar for v2). Both vectors have
    ucount elements.

    However, note that the dot-product variant is slower than OpenBLAS by
    a factor of 25. The best scalar implementation from slide 80 is quite
    a bit faster (Factor 13 slower than OpenBLAS) and can be implemented
    with

    It is not only about speed, but also about minimising calculation errors.

    For example, naive dot product summation in a single loop, which is >unfortunately what gforth does, is prone to accumulating rounding errors.

    Nothing to blame here, but library functions are often "very smart".

    The BLAS implementations seem to be only about speed. None that I am
    aware of uses, e.g., Kahan summation to reduce rounding errors.

    There are other libraries that are about accuracy, but not BLAS.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to Anton Ertl on Sun Dec 3 14:58:56 2023
    Anton Ertl wrote:
    The BLAS implementations seem to be only about speed. None that I am
    aware of uses, e.g., Kahan summation to reduce rounding errors.

    Kahan summation gives good results but can be very slow. As a good
    compromise, I prefer recursive summation of vector halves for dot products, until their size is small enough to fit into vector chunks ready for CPU-supported vector operations or intrinsics.

    Wikipedia has a small article on this called Pairwise Summation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to albert@cherry. on Sun Dec 3 13:54:03 2023
    albert@cherry.(none) (albert) writes:
    In article <2023Dec2.174433@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    albert@cherry.(none) (albert) writes:
    This is an excellent opportunity to introduce a single assembler
    routine that does a huge speed up.
    Approximately a vector times vector multiplication with
    specified start addresses, specified strides, and a length.

    You mean something like:

    'v*' ( f-addr1 nstride1 f-addr2 nstride2 ucount -- r ) gforth-0.5 "v-star"
    dot-product: r=v1*v2. The first element of v1 is at f_addr1, the
    next at f_addr1+nstride1 and so on (similar for v2). Both vectors have >>ucount elements.


    However, note that the dot-product variant is slower than OpenBLAS by
    a factor of 25. The best scalar implementation from slide 80 is quite

    Loosing that much imagining using all 8 registers of the 8087 stack
    is astonishing, if V* really is implemented in assembler.

    It does not use the 8087 stack at all.

    If you do a more sophisticated version with at least 8 fp registers >available, you can prefetch easily 2 fp numbers in advance for
    each stride.

    That is irrelevant for the reasons given below, but it boils down to:
    The Tiger Lake on which I measured these speedups is a CPU with
    out-of-order execution (with 26 years of ancestry).

    The code in question is:

    0x000055ba08700990 <v_star+0>: pxor %xmm1,%xmm1
    0x000055ba08700994 <v_star+4>: test %r8,%r8
    0x000055ba08700997 <v_star+7>: je 0x55ba087009b8 <v_star+40>
    0x000055ba08700999 <v_star+9>: nopl 0x0(%rax)
    0x000055ba087009a0 <v_star+16>: movsd (%rdi),%xmm0
    0x000055ba087009a4 <v_star+20>: mulsd (%rdx),%xmm0
    0x000055ba087009a8 <v_star+24>: add %rsi,%rdi
    0x000055ba087009ab <v_star+27>: add %rcx,%rdx
    0x000055ba087009ae <v_star+30>: addsd %xmm0,%xmm1
    0x000055ba087009b2 <v_star+34>: sub $0x1,%r8
    0x000055ba087009b6 <v_star+38>: jne 0x55ba087009a0 <v_star+16>
    0x000055ba087009b8 <v_star+40>: movapd %xmm1,%xmm0
    0x000055ba087009bc <v_star+44>: ret

    with the inner loop from 0x55ba087009a0 <v_star+16> to
    0x000055ba087009b6 <v_star+38> (inclusive).

    The performance is determined by the dependence of the FP addition
    addsd on the result from the previous iteration. The latency of this
    FP addition is 4 cycles, and the whole matrix multiplication benchmark
    runs at 4.1 cycles per iteration of the inner loop (and the cost of
    the rest of the benchmark is spread over these cycles; that's the 0.1
    cycle).

    So what happens in the steady state is that all the other instructions
    are executed early (at around the same time as the addsd from 50
    iterations earlier; the Tiger Lake has a reorder buffer of 352
    instructions), so fetching two values into registers one iteration
    earlier makes hardly any difference. Plus, the Tiger lake contains
    hardware prefetchers that are very good at prefetching with constant
    stride, as in V*.

    What could be done to make this faster is to add up, say 4
    intermediate sums in parallel, and finally compute the sum of these 4 intermediate sums.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to minforth on Sun Dec 3 16:31:13 2023
    In article <e93ff88202425b32916bae8123adf0b2@news.novabbs.com>,
    minforth <minforth@gmx.net> wrote:
    Anton Ertl wrote:
    The BLAS implementations seem to be only about speed. None that I am
    aware of uses, e.g., Kahan summation to reduce rounding errors.

    Kahan summation gives good results but can be very slow. As a good >compromise, I prefer recursive summation of vector halves for dot products, >until their size is small enough to fit into vector chunks ready for >CPU-supported vector operations or intrinsics.

    Wikipedia has a small article on this called Pairwise Summation.

    Summing numbers that mean something, result in a sum whose error
    is dominated with the maximum error of the summands.
    Imagine a fly landing on the top of a church and a flee on top of
    that. If you measure the height of the church precise to one mm,
    the total height cannot be made more precise on reordering the
    summands.
    So I think it is mostly academic. The most precise calculation
    I've done is 1/256 of an infrared wavelength over 60 m.
    (That really required double precision floats. Chili ESO telescopes)
    A more practical examples is the thickness of steel pipelines on the
    Brent oil rigs. You have to be content with 3 significant digits at
    the very most.

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat spinning. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to none on Sun Dec 3 16:35:46 2023
    none wrote:

    In article <e93ff88202425b32916bae8123adf0b2@news.novabbs.com>,
    minforth <minforth@gmx.net> wrote:
    Anton Ertl wrote:
    The BLAS implementations seem to be only about speed. None that I am
    aware of uses, e.g., Kahan summation to reduce rounding errors.

    Kahan summation gives good results but can be very slow. As a good >>compromise, I prefer recursive summation of vector halves for dot products, >>until their size is small enough to fit into vector chunks ready for >>CPU-supported vector operations or intrinsics.

    Wikipedia has a small article on this called Pairwise Summation.

    Summing numbers that mean something, result in a sum whose error
    is dominated with the maximum error of the summands.
    Imagine a fly landing on the top of a church and a flee on top of
    that. If you measure the height of the church precise to one mm,
    the total height cannot be made more precise on reordering the
    summands.
    So I think it is mostly academic.

    Well, we are not in the business of measuring academic bellfry bugs ;-),
    but signal vectors in the order of up to tens of thousands of samples.
    There it is good engineering practice to keep an eye on error propagation.

    But you're right, under normal circumstances it doesn't matter. But when
    you least expect it, it can ruin your day(s). Better be careful.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mhx on Sun Dec 3 17:58:07 2023
    mhx@iae.nl (mhx) writes:
    That is because OpenBLAS uses AVX2 with all cores working
    in parallel.

    I expect that it uses AVX-512 on the Tiger Lake which I measured. My measurements used only one core. Using more cores increases the CPU
    cycles needed (due to parallelization overhead), although it reduces
    the elapsed time.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to minforth on Sun Dec 3 18:02:08 2023
    minforth@gmx.net (minforth) writes:
    Anton Ertl wrote:
    The BLAS implementations seem to be only about speed. None that I am
    aware of uses, e.g., Kahan summation to reduce rounding errors.

    Kahan summation gives good results but can be very slow. As a good >compromise, I prefer recursive summation of vector halves for dot products, >until their size is small enough to fit into vector chunks ready for >CPU-supported vector operations or intrinsics.

    For multiplying big matrices (and why would you care in case of small matrices?), the question is how to combine that with the memory access
    patterns that you want for efficiently using the memory subsystem for
    matrix multiplication, if it is possible at all. OpenBLAS certainly
    does not do that. The divide-and-conquer approach <https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm#Divide-and-conquer_algorithm>
    deals well with the memory subsystem, and may exhibit some of the
    properties you want, but at least in the implementation I did, I did
    not form intermediate matrices, but added the intermediate results to
    the appropriate elements in the target matrix C, so it does not have significantly better accuracy than the straightforward algorithm. If
    one stored intermediate results elsewhere for adding them pairwise,
    that would cost extra overhead. Maybe worth it, maybe not.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to albert@cherry. on Sun Dec 3 18:23:06 2023
    albert@cherry.(none) (albert) writes:
    Summing numbers that mean something, result in a sum whose error
    is dominated with the maximum error of the summands.

    1e30 1e f+ -1e30 f+ 1e 0e f~ .

    produces 0 (false), even though with exact summation it would produce
    true (-1). Of course, you may say that these numbers mean nothing to
    you, but you are not the only one in the world.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to Anton Ertl on Sun Dec 3 21:04:59 2023
    Anton Ertl wrote:

    1e30 1e f+ -1e30 f+ 1e 0e f~ .

    produces 0 (false), even though with exact summation it would produce
    true (-1). Of course, you may say that these numbers mean nothing to
    you, but you are not the only one in the world.

    Take the number of years the big bang happened (14.5 billion years ago),
    square it and multiply by the height of Church St. Spirit in meters for
    good measure. A photon will travel 1e30 meters in that amount of years.
    Now add 1 meter ...

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to Anton Ertl on Mon Dec 4 12:32:00 2023
    In article <2023Dec3.185807@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    mhx@iae.nl (mhx) writes:
    That is because OpenBLAS uses AVX2 with all cores working
    in parallel.

    I expect that it uses AVX-512 on the Tiger Lake which I measured. My >measurements used only one core. Using more cores increases the CPU
    cycles needed (due to parallelization overhead), although it reduces
    the elapsed time.

    I would be interested to have a comparable time with the examples
    done by OpenBlas with one core.
    I tried to optimise on a transputer once by starting a simple
    loop at a cell boundary (a transputer has byte based instructions).
    The results were so puzzling that I kept away from trying.

    - anton
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat spinning. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to Anton Ertl on Mon Dec 4 12:24:58 2023
    In article <2023Dec3.192306@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    albert@cherry.(none) (albert) writes:
    Summing numbers that mean something, result in a sum whose error
    is dominated with the maximum error of the summands.

    1e30 1e f+ -1e30 f+ 1e 0e f~ .

    produces 0 (false), even though with exact summation it would produce
    true (-1). Of course, you may say that these numbers mean nothing to
    you, but you are not the only one in the world.

    Try this with interval floats. Small explanation:
    a number 9.000 represents an interval between 9.0005 and 8.9995
    In this example the result is approximately
    0 +/- 1E11 (with 19 precision floats)

    There are rules for propagating the intervals through multiplication
    and addition etc.


    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html >comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat spinning. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to minforth on Mon Dec 4 06:55:21 2023
    On 12/1/23 06:32, minforth wrote:
    Krishna Myneni wrote:
    It should make the code for loops which scale arrays more compact, but
    typically, it is more rare to loop over a sequence of scalars which
    multiply a single array element (value at a fixed address) than it is
    to loop over a sequence of scalars which accumulate into a single
    array element e.g. matrix multiplication.

    Matrix multiplication (if not available as a primitive or from an external library) is an example. In other numerical matrix algorithms, pivoting is
    is rather common, which involves scalar column or row multiplication.
    Most occurrences in my code involve shifting and scaling of vectors.

    The example of matrix multiplication was not a good fit for F+!. We
    usually accumulate the sum on the stack and then store it at the
    destination in the matrix.

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mhx on Tue Dec 5 08:37:01 2023
    mhx@iae.nl (mhx) writes:
    Anton Ertl wrote:

    1e30 1e f+ -1e30 f+ 1e 0e f~ .

    produces 0 (false), even though with exact summation it would produce
    true (-1). Of course, you may say that these numbers mean nothing to
    you, but you are not the only one in the world.

    Take the number of years the big bang happened (14.5 billion years ago), >square it and multiply by the height of Church St. Spirit in meters for
    good measure. A photon will travel 1e30 meters in that amount of years.
    Now add 1 meter ...

    So? Yes, it seems that the typical answer to issues of numerical
    errors has been to

    1) Replace fixed point with floating point, so you don't have to do
    analysis for scaling.

    2) Use wider FP types, so you may be able to do without numerical
    analysis (or if you still would need it, you have the hope of missing
    the cases where you need it). I think that iForth uses 80-bit FP
    numbers. Why?

    3) Use examples like the above to convince themselves that numerical
    analysis is not needed.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to albert@cherry. on Tue Dec 5 08:54:49 2023
    albert@cherry.(none) (albert) writes:
    I would be interested to have a comparable time with the examples
    done by OpenBlas with one core.

    It's not clear what you want, but for 1000x1000 matrix multiplication
    OpenBLAS uses 0.16 cycles per iteration of the inner loop of the
    straighforward implementation whe using one core (or 160M cycles for
    the whole matrix multiplication).

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Krishna Myneni on Tue Dec 5 08:58:31 2023
    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    The example of matrix multiplication was not a good fit for F+!. We
    usually accumulate the sum on the stack and then store it at the
    destination in the matrix.

    Who is "we"?

    Looking at
    <http://theforth.net/package/matmul/current-view/matmul.4th>, the
    fastest version on all systems that does not use a primitive FAXPY
    is version 2, and that spends most of its time in:

    : faxpy-nostride ( ra f_x f_y ucount -- )
    \ vy=ra*vx+vy
    dup >r 3 and 0 ?do
    fdup over f@ f* dup f+! float+ swap float+ swap
    loop
    r> 2 rshift 0 ?do
    fdup over f@ f* dup f+! float+ swap float+ swap
    fdup over f@ f* dup f+! float+ swap float+ swap
    fdup over f@ f* dup f+! float+ swap float+ swap
    fdup over f@ f* dup f+! float+ swap float+ swap
    \ better performance on gforth-fast:
    \ fdup swap dup f@ f* float+ swap dup f@ f+ dup f! float+
    loop
    2drop fdrop ;

    As you can see, it uses F+!.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to Anton Ertl on Tue Dec 5 11:28:20 2023
    Anton Ertl wrote:

    mhx@iae.nl (mhx) writes:
    Anton Ertl wrote:

    1e30 1e f+ -1e30 f+ 1e 0e f~ .

    produces 0 (false), even though with exact summation it would produce
    true (-1). Of course, you may say that these numbers mean nothing to
    you, but you are not the only one in the world.

    Take the number of years the big bang happened (14.5 billion years ago), >>square it and multiply by the height of Church St. Spirit in meters for >>good measure. A photon will travel 1e30 meters in that amount of years.
    Now add 1 meter ...

    So? Yes, it seems that the typical answer to issues of numerical
    errors has been to

    1) Replace fixed point with floating point, so you don't have to do
    analysis for scaling.

    2) Use wider FP types, so you may be able to do without numerical
    analysis (or if you still would need it, you have the hope of missing
    the cases where you need it). I think that iForth uses 80-bit FP
    numbers. Why?
    [..]

    Because of (2), because some algorithms I care about are based
    on doing selected steps in higher precision, and because the FPU
    provides transcendent functions without needing libraries.

    Nowadays I use double precision for speed (80bit floats are about
    2 to 3 times slower than 64bit floats).

    3) Use examples like the above to convince themselves that numerical
    analysis is not needed.

    You misinterpret my posting. I find it illuminating when technical
    problems are visualized ( "2nm line-width means four Si atoms across" ).

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jan Coombs@21:1/5 to Anton Ertl on Tue Dec 5 12:08:01 2023
    On Tue, 05 Dec 2023 08:37:01 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    So? Yes, it seems that the typical answer to issues of numerical
    errors has been to

    1) Replace fixed point with floating point, so you don't have to do
    analysis for scaling.

    2) Use wider FP types, so you may be able to do without numerical
    analysis (or if you still would need it, you have the hope of missing
    the cases where you need it). I think that iForth uses 80-bit FP
    numbers. Why?

    3) Use examples like the above to convince themselves that numerical
    analysis is not needed.

    The need for numerical analysis could be reduced in a processor that
    allows a data item to be of variable length, or to span multiple cells:

    In his book "The End of Error"[1] John Gustafson presents a core model[2]
    of his (Type 1) Unums. This data type allows both fields of a float to be
    of variable length, so that '*/' is redundant, being numerically the same
    as '* /'.

    He also claims 50% processing power reduction for inherently compressed
    data, and less supervision of data due to all bits in data being valid,
    and none being lost by fixed-format constraints.

    Might it be significantly simpler to implement variable-length data in
    hardware on a zero-operand processor than a register based one?

    Jan Coombs
    --

    [1] The End of Error Unum Computing By John L. Gustafson
    [sample chapters were available from publisher - ask privately] https://www.taylorfrancis.com/books/mono/10.1201/9781315161532/end-error-john-gustafson

    [2] [was available from publisher - ask privately]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to Anton Ertl on Tue Dec 5 06:23:29 2023
    On 12/5/23 02:58, Anton Ertl wrote:
    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    The example of matrix multiplication was not a good fit for F+!. We
    usually accumulate the sum on the stack and then store it at the
    destination in the matrix.

    Who is "we"?


    In the kForth double precision matrix multiply,

    https://github.com/mynenik/kForth-64/blob/master/forth-src/fsl/extras/mmul.4th

    The word DF_MUL_R1C2 obtains the sum of the products of a row from the
    first matrix with a column from the second matrix. The sum is
    accumulated on the stack.

    \ Multiply row of a1 with col of a2, element by element,
    \ and accumulate the sum.
    : df_mul_r1c2 ( row1 col2 -- ) ( F: -- rsum )
    df_r1c2>a1a2
    0e
    nc1 @ 0 DO
    2dup f@ f@ f* f+
    roffs2 @ +
    swap dfloat+ swap
    LOOP
    2drop ;

    The matrix multiplication word DF_MMUL subsequently stores the resulting
    rsum in the destination matrix, after the call to DF_MUL_R1C2.

    \ Multiply two double-precision matrices with data beginning at
    \ a1 and a2, and store at a3. Proper memory allocation is
    \ assumed, as are the dimensions for a2, i.e. nr2 = nc1 is
    \ assumed. This word assumes an integrated data/fp stack.
    : df_mmul ( a1 a2 a3 nr1 nc1 nc2 -- )
    set_mmul_params
    0 DO
    nc2 @ 0 DO
    J I df_mul_r1c2 dup f!
    dfloat+
    LOOP
    LOOP
    drop ;

    Can the above made faster with use of F+! within kForth? Possibly.

    --
    Krishna

    Looking at
    <http://theforth.net/package/matmul/current-view/matmul.4th>, the
    fastest version on all systems that does not use a primitive FAXPY
    is version 2, and that spends most of its time in:

    : faxpy-nostride ( ra f_x f_y ucount -- )
    \ vy=ra*vx+vy
    dup >r 3 and 0 ?do
    fdup over f@ f* dup f+! float+ swap float+ swap
    loop
    r> 2 rshift 0 ?do
    fdup over f@ f* dup f+! float+ swap float+ swap
    fdup over f@ f* dup f+! float+ swap float+ swap
    fdup over f@ f* dup f+! float+ swap float+ swap
    fdup over f@ f* dup f+! float+ swap float+ swap
    \ better performance on gforth-fast:
    \ fdup swap dup f@ f* float+ swap dup f@ f+ dup f! float+
    loop
    2drop fdrop ;

    As you can see, it uses F+!.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to jan Coombs on Tue Dec 5 12:52:36 2023
    jan Coombs wrote:

    The need for numerical analysis could be reduced in a processor that
    allows a data item to be of variable length, or to span multiple cells:

    In his book "The End of Error"[1] John Gustafson presents a core model[2]
    of his (Type 1) Unums. This data type allows both fields of a float to be
    of variable length, so that '*/' is redundant, being numerically the same
    as '* /'.

    He also claims 50% processing power reduction for inherently compressed
    data, and less supervision of data due to all bits in data being valid,
    and none being lost by fixed-format constraints.

    Might it be significantly simpler to implement variable-length data in hardware on a zero-operand processor than a register based one?

    Thanks for mentioning this. There is indeed a need for reduced, adaptable
    fp formats, especially in AI systems. See also the 'Motivation' section in https://github.com/stillwater-sc/universal

    There are already some experimental libraries using unum posits for various programming languages. Is there any Forth code that uses unums?

    But development will be slow as long as GPU hardware is cheap and readily available for faster time-to-market: https://www.windowscentral.com/microsoft/microsoft-to-spend-dollar32-billion-on-uks-ai-infrastructure-that-should-bring-more-than-20000-of-the-most-advanced-gpus-to-the-uk-by-2026

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to minforth on Tue Dec 5 14:07:05 2023
    In article <37830e4e5246f79b7d97247e4a973b1a@news.novabbs.com>,
    minforth <minforth@gmx.net> wrote:
    jan Coombs wrote:

    The need for numerical analysis could be reduced in a processor that
    allows a data item to be of variable length, or to span multiple cells:

    In his book "The End of Error"[1] John Gustafson presents a core model[2]
    of his (Type 1) Unums. This data type allows both fields of a float to be
    of variable length, so that '*/' is redundant, being numerically the same
    as '* /'.

    He also claims 50% processing power reduction for inherently compressed
    data, and less supervision of data due to all bits in data being valid,
    and none being lost by fixed-format constraints.

    Might it be significantly simpler to implement variable-length data in
    hardware on a zero-operand processor than a register based one?

    Thanks for mentioning this. There is indeed a need for reduced, adaptable
    fp formats, especially in AI systems. See also the 'Motivation' section in >https://github.com/stillwater-sc/universal

    There are already some experimental libraries using unum posits for various >programming languages. Is there any Forth code that uses unums?

    But development will be slow as long as GPU hardware is cheap and readily >available for faster time-to-market: >https://www.windowscentral.com/microsoft/microsoft-to-spend-dollar32-billion-on-uks-ai-infrastructure-that-should-bring-more-than-20000-of-the-most-advanced-gpus-to-the-uk-by-2026

    I doubt the necessity of fp formats in ai. 256 levels of uncertainty
    must be plenty.

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat spinning. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Tue Dec 5 13:32:55 2023
    But development will be slow as long as GPU hardware is cheap and readily >>available for faster time-to-market: >>https://www.windowscentral.com/microsoft/microsoft-to-spend-dollar32-billion-on-uks-ai-infrastructure-that-should-bring-more-than-20000-of-the-most-advanced-gpus-to-the-uk-by-2026

    I doubt the necessity of fp formats in ai. 256 levels of uncertainty
    must be plenty.

    We probably won't live to see it, but embedded AI will be in every gadget
    of the distant future. A minimum of energy consumption will then be required. Unums promise an improvement here.

    Today's use of GPUs is only for big data centres. I am curious to know the
    peak power consumption of ChatGPT alone.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mhx on Tue Dec 5 15:14:13 2023
    mhx@iae.nl (mhx) writes:
    Take the number of years the big bang happened (14.5 billion years ago), >>>square it and multiply by the height of Church St. Spirit in meters for >>>good measure. A photon will travel 1e30 meters in that amount of years. >>>Now add 1 meter ...
    ...
    You misinterpret my posting. I find it illuminating when technical
    problems are visualized ( "2nm line-width means four Si atoms across" ).

    I fail to visualize "the number of years the big bang happened (14.5
    billion years ago), square it and multiply by the height of Church
    St. Spirit in meters". In particular, a squared timespan is pretty unintuitive; I also don't know "Church St. Spirit" and its height. A
    better way to visualize 10^30 is: The volume of Earth relative to the
    volume of a 1.3mm sphere; or, alternatively, the ratio between the
    weight of Earth relative to a piece of rice weighing 5.972mg.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to jan Coombs on Tue Dec 5 15:24:26 2023
    jan Coombs <jan4comp.lang.forth@murray-microft.co.uk> writes:
    The need for numerical analysis could be reduced in a processor that
    allows a data item to be of variable length, or to span multiple cells:

    In his book "The End of Error"[1] John Gustafson presents a core model[2]
    of his (Type 1) Unums. This data type allows both fields of a float to be
    of variable length, so that '*/' is redundant, being numerically the same
    as '* /'.

    The need for numerical analysis is such a problem that lots of people
    fall for snake-oil salesmen like Gustafson, but not enough that
    anyone, not even gullible venture capitalists, invest significant
    money in it. By contrast, Mike Cowlishaw used similar arguments to
    convince IBM to implement his decimal FP nonsense in hardware (but for
    IBM, it may be a good way to convince gullible corporate managers to
    buy expensive IBM hardware, so it may be a win for them even though
    technically it is bullshit), and to convince IEEE to standardize it.

    Back to Gustafson: He apparently has seen that unums go nowhere, so in
    recent years he has switched to a new snake oil called posits. These
    are a variant of FP numbers, with the mantissa and exponent size
    depending on the exponent value. Which essentially would mean that
    you can throw all the numerical analysis up to now away and do it
    again. That's not going anywhere, either.

    Might it be significantly simpler to implement variable-length data in >hardware on a zero-operand processor than a register based one?

    No. Variable-length data is always a pain. E.g., see strings in Forth.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Wed Dec 6 08:00:36 2023
    Today, the predominant cell size is bfloat16 in algebraic
    computations (e.g. GEMMs for scaled matrix multiplication
    and addition) within neural networks for deep learning. Fixed
    (sub)tiles e.g. 128x256 are directly supported by GPUs and
    can be computed in a single cycle clock. Although float32/64
    can also be used, the performance decreases more than
    quadratically with the cell size. In return, quantisation
    and rounding errors are accepted, which are one of the causes
    of incorrect neural network outputs.

    So there is a real need for adaptive floats, and work is
    actually being done on their use. The term 'snake oil' is
    misleading (except perhaps in the old school Forth niche).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jan Coombs@21:1/5 to Anton Ertl on Wed Dec 6 10:28:23 2023
    On Tue, 05 Dec 2023 15:24:26 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    jan Coombs <jan4comp.lang.forth@murray-microft.co.uk> writes:
    [...]
    Might it be significantly simpler to implement variable-length data in >hardware on a zero-operand processor than a register based one?

    No. Variable-length data is always a pain. E.g., see strings in Forth.

    Agreed, handling variable length data, even in hardware, is much more
    complex than fixed size integers. A processor doing this would need
    to have the current data set in cache, and preferably about twelve items
    per thread in order to minimise fill and spill memory accesses.

    Strings could be handled as single stack items, and use /mod to split, but where UTF-8, or other variable-length characters are used a little extra hardware support would be needed to unpack, manipulate, and pack them. This
    may be useful, for example, to fetch an error message with a single (multi- cell) memory read, then forwarding it to the terminal.

    Jan Coombs
    --

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to Krishna Myneni on Mon Dec 11 22:42:45 2023
    Krishna Myneni wrote:

    Use of the sequence "FDUP F*" is ubiquitous in Forth scientific code for
    lack of a common word which squares an fp number. This not only is less readable but does not convey as much meaning to anyone who is reading
    the code.

    I've updated the FSL modules in kForth (32, Win32, and 64) to remove use
    all instances of "FDUP F*" with the (built-in) word FSQUARE. Some FSL
    modules provided definitions of FSQR for the same function (by MHX) and
    I replaced these instances with FSQUARE which I find more readable and
    less error-prone due to the proximity of FSQR to FSQRT.

    Regarding code readability when no fp locals are available:

    Standard Forth only defines a reduced number of fp stack operations. I added

    FPICK (like PICK)
    FROLL (like ROLL)
    -FROLL (like ROLL reversed)
    R R>F (already discussed)

    Of course this only works if the FP stack is fully accessible e.g. memory mapped.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to dxf on Mon Dec 11 19:49:10 2023
    On 12/11/23 19:07, dxf wrote:
    ...
    FSL has memory-mapped flocals. Can't be worse than reliance on FPICK and FROLL.

    ...

    The flocals implementation in the FSL is substantially worse. Unlike
    using fp stack operations, one can't write re-entrant words with the FSL implementation of flocals. Unfortunately the standard fp stack
    operations in Forth 2012 proves insufficient; hence the consideration of
    words like FRISE.

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to minforth on Mon Dec 11 19:44:48 2023
    On 12/11/23 16:42, minforth wrote:
    Krishna Myneni wrote:

    Use of the sequence "FDUP F*" is ubiquitous in Forth scientific code
    for lack of a common word which squares an fp number. This not only is
    less readable but does not convey as much meaning to anyone who is
    reading the code.

    I've updated the FSL modules in kForth (32, Win32, and 64) to remove
    use all instances of "FDUP F*" with the (built-in) word FSQUARE. Some
    FSL modules provided definitions of FSQR for the same function (by
    MHX) and I replaced these instances with FSQUARE which I find more
    readable and less error-prone due to the proximity of FSQR to FSQRT.

    Regarding code readability when no fp locals are available:

    Standard Forth only defines a reduced number of fp stack operations. I
    added
    FPICK (like PICK) FROLL (like ROLL)
    -FROLL (like ROLL reversed)
    R R>F (already discussed)
    Of course this only works if the FP stack is fully accessible e.g.
    memory mapped.

    Yes, I have added FPICK as an intrinsic word in kForth-64, and have
    source definitions of F>R and FR> (your R>F, which is actually a better
    name). But I think FRISE may reduce/eliminate the need for F>R etc. When
    the FP stack resides in memory and can be accessed using a pointer, it's
    easy to implement FRISE in source to assess its usefulness.

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to Krishna Myneni on Tue Dec 12 07:00:30 2023
    Krishna Myneni wrote:
    On 12/11/23 16:42, minforth wrote:
    Standard Forth only defines a reduced number of fp stack operations. I
    added
    FPICK (like PICK) FROLL (like ROLL)
    -FROLL (like ROLL reversed)
    R R>F (already discussed)
    Of course this only works if the FP stack is fully accessible e.g.
    memory mapped.

    Yes, I have added FPICK as an intrinsic word in kForth-64, and have
    source definitions of F>R and FR> (your R>F, which is actually a better name). But I think FRISE may reduce/eliminate the need for F>R etc. When
    the FP stack resides in memory and can be accessed using a pointer, it's
    easy to implement FRISE in source to assess its usefulness.

    You have defined RISE as in
    2 RISE i*x a b c d -- i*x b a c d et cetera

    I don't really have an application where a position swap in the depths
    of the stack would fit, because Forth operations always only use the
    top stack element(s).

    Then rather something like in
    2 FLIP i*x a b c d -- d b c a

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to Krishna Myneni on Tue Dec 12 08:49:57 2023
    Krishna Myneni wrote:

    The flocals implementation in the FSL is substantially worse. Unlike
    using fp stack operations, one can't write re-entrant words with the FSL implementation of flocals.

    I don't understand. This should be awkward, but ok?

    8 CONSTANT /flocals

    : (frame) ( n -- ) FLOATS ALLOT ;

    : FRAME|
    0 >R
    BEGIN BL WORD COUNT 1 =
    SWAP C@ [CHAR] | =
    AND 0=
    WHILE POSTPONE F, R> 1+ >R
    REPEAT
    /FLOCALS R> - DUP 0< ABORT" too many flocals"
    POSTPONE LITERAL POSTPONE (frame) ; IMMEDIATE

    : |FRAME ( -- ) [ /FLOCALS NEGATE ] LITERAL (FRAME) ;

    : &h HERE [ 1 FLOATS ] LITERAL - ;
    : &g HERE [ 2 FLOATS ] LITERAL - ;
    : &f HERE [ 3 FLOATS ] LITERAL - ;
    : &e HERE [ 4 FLOATS ] LITERAL - ;
    : &d HERE [ 5 FLOATS ] LITERAL - ;
    : &c HERE [ 6 FLOATS ] LITERAL - ;
    : &b HERE [ 7 FLOATS ] LITERAL - ;
    : &a HERE [ 8 FLOATS ] LITERAL - ;

    : a &a F@ ;
    : b &b F@ ;
    : c &c F@ ;
    : d &d F@ ;
    : e &e F@ ;
    : f &f F@ ;
    : g &g F@ ;
    : h &h F@ ;

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to minforth on Tue Dec 12 09:32:59 2023
    In article <743d2d729862d0b2fb9ff9ce314935dd@news.novabbs.com>,
    minforth <minforth@gmx.net> wrote:
    Krishna Myneni wrote:

    Use of the sequence "FDUP F*" is ubiquitous in Forth scientific code for
    lack of a common word which squares an fp number. This not only is less
    readable but does not convey as much meaning to anyone who is reading
    the code.

    I've updated the FSL modules in kForth (32, Win32, and 64) to remove use
    all instances of "FDUP F*" with the (built-in) word FSQUARE. Some FSL
    modules provided definitions of FSQR for the same function (by MHX) and
    I replaced these instances with FSQUARE which I find more readable and
    less error-prone due to the proximity of FSQR to FSQRT.

    Regarding code readability when no fp locals are available:

    Standard Forth only defines a reduced number of fp stack operations. I added

    FPICK (like PICK)
    FROLL (like ROLL)
    -FROLL (like ROLL reversed)
    R R>F (already discussed)

    Of course this only works if the FP stack is fully accessible e.g. memory mapped.

    If I wanted this it shouldn't be too hard on the 8087 8 register stack.
    This stack rotates. If the R> is available as memory, storing FP stack
    items there is also doable.

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat spinning. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to mhx on Tue Dec 12 12:54:59 2023
    In article <a23e7f246a3ab1c5d76d263804ebec69@news.novabbs.com>,
    mhx <mhx@iae.nl> wrote:
    Krishna Myneni wrote:

    The flocals implementation in the FSL is substantially worse. Unlike
    using fp stack operations, one can't write re-entrant words with the FSL
    implementation of flocals.

    I don't understand. This should be awkward, but ok?

    8 CONSTANT /flocals

    : (frame) ( n -- ) FLOATS ALLOT ;

    : FRAME|
    0 >R
    BEGIN BL WORD COUNT 1 =
    SWAP C@ [CHAR] | =
    AND 0=
    WHILE POSTPONE F, R> 1+ >R
    REPEAT
    /FLOCALS R> - DUP 0< ABORT" too many flocals"
    POSTPONE LITERAL POSTPONE (frame) ; IMMEDIATE

    : |FRAME ( -- ) [ /FLOCALS NEGATE ] LITERAL (FRAME) ;

    : &h HERE [ 1 FLOATS ] LITERAL - ;
    : &g HERE [ 2 FLOATS ] LITERAL - ;
    : &f HERE [ 3 FLOATS ] LITERAL - ;
    : &e HERE [ 4 FLOATS ] LITERAL - ;
    : &d HERE [ 5 FLOATS ] LITERAL - ;
    : &c HERE [ 6 FLOATS ] LITERAL - ;
    : &b HERE [ 7 FLOATS ] LITERAL - ;
    : &a HERE [ 8 FLOATS ] LITERAL - ;

    : a &a F@ ;
    : b &b F@ ;
    : c &c F@ ;
    : d &d F@ ;
    : e &e F@ ;
    : f &f F@ ;
    : g &g F@ ;
    : h &h F@ ;

    Reentrant words come into play as the same code is concurrently
    executed in the parallel threads (in Forth) or if the word
    is in a dll or a resident library that can accessed by more processes
    at the same time.
    I can see that you address that situation here.
    No that I worry about much about re-entrancy. I happily compile
    separate code for each parallel thread.

    -marcel
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat spinning. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to mhx on Tue Dec 12 08:00:19 2023
    On 12/12/23 02:49, mhx wrote:
    Krishna Myneni wrote:

    The flocals implementation in the FSL is substantially worse. Unlike
    using fp stack operations, one can't write re-entrant words with the
    FSL implementation of flocals.

    I don't understand. This should be awkward, but ok?

    8 CONSTANT /flocals

    : (frame) ( n -- ) FLOATS ALLOT ;

    : FRAME|
           0 >R
           BEGIN   BL WORD  COUNT  1 =
                   SWAP C@  [CHAR] | =
                   AND 0=
           WHILE   POSTPONE F,  R> 1+ >R
           REPEAT
           /FLOCALS R> - DUP 0< ABORT" too many flocals"
           POSTPONE LITERAL  POSTPONE (frame) ; IMMEDIATE

    : |FRAME ( -- ) [ /FLOCALS NEGATE ] LITERAL (FRAME) ;

    : &h            HERE [ 1 FLOATS ] LITERAL - ;
    : &g            HERE [ 2 FLOATS ] LITERAL - ;
    : &f            HERE [ 3 FLOATS ] LITERAL - ;
    : &e            HERE [ 4 FLOATS ] LITERAL - ;
    : &d            HERE [ 5 FLOATS ] LITERAL - ;
    : &c            HERE [ 6 FLOATS ] LITERAL - ;
    : &b            HERE [ 7 FLOATS ] LITERAL - ;
    : &a            HERE [ 8 FLOATS ] LITERAL - ;

    : a             &a F@ ;
    : b             &b F@ ;
    : c             &c F@ ;
    : d             &d F@ ;
    : e             &e F@ ;
    : f             &f F@ ;
    : g             &g F@ ;
    : h             &h F@ ;


    The above looks ok. It has been a while since I looked at this code. The implementation above allots space each time for new locals on entry and
    frees it on exit -- I seem to be wrong about the FSL implementation
    killing re-entrancy.

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to minforth on Tue Dec 12 19:30:10 2023
    On 12/12/23 01:00, minforth wrote:
    Krishna Myneni wrote:
    On 12/11/23 16:42, minforth wrote:
    Standard Forth only defines a reduced number of fp stack operations.
    I added
    FPICK (like PICK) FROLL (like ROLL)
    -FROLL (like ROLL reversed)
    R R>F (already discussed)
    Of course this only works if the FP stack is fully accessible e.g.
    memory mapped.

    Yes, I have added FPICK as an intrinsic word in kForth-64, and have
    source definitions of F>R and FR> (your R>F, which is actually a
    better name). But I think FRISE may reduce/eliminate the need for F>R
    etc. When the FP stack resides in memory and can be accessed using a
    pointer, it's easy to implement FRISE in source to assess its usefulness.

    You have defined RISE as in  2 RISE  i*x a b c d -- i*x b a c d et cetera

    I don't really have an application where a position swap in the depths
    of the stack would fit, because Forth operations always only use the
    top stack element(s).

    Then rather something like in
     2 FLIP   i*x a b c d -- d b c a

    The depth 2 RISE/FRISE would provide the function I was originally
    asking for, but the general version is similar to FPICK. Admittedly,
    whether the general FRISE has application for other depths remains to be
    seen. Perhaps an on-fpstack sorting routine?

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to Krishna Myneni on Wed Dec 13 08:43:35 2023
    Krishna Myneni wrote:
    The depth 2 RISE/FRISE would provide the function I was originally
    asking for, but the general version is similar to FPICK. Admittedly,
    whether the general FRISE has application for other depths remains to be seen.

    For similar reasons I used to have a word called PATCH as a counterpart
    to PICK: n PATCH overwrote the stack value in depth n, often handy to avoid ROLLs. But in the end, such words are just crutches if you don't have locals.

    Perhaps an on-fpstack sorting routine?

    Yeah, brings order to chaos ;o)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jan Coombs@21:1/5 to minforth on Fri Dec 15 15:30:01 2023
    On Tue, 5 Dec 2023 12:52:36 +0000
    minforth@gmx.net (minforth) wrote:

    [about floats with improved performance]
    Thanks for mentioning this. There is indeed a need for reduced, adaptable
    fp formats, especially in AI systems. See also the 'Motivation' section in https://github.com/stillwater-sc/universal

    Thanks, have added that to my reference docs.

    There are already some experimental libraries using unum posits for various programming languages. Is there any Forth code that uses unums?

    A quick route would be to buy a processor with posits[1] built-in, and install Forth. I thought RISC-V ones were available, but looking now only found a prototype[2], product announcement[3], and available HW designs.[4][5]

    Jan Coombs
    --

    [1] Posits, a New Kind of Number, Improves the Math of AI The first posit- based processor core gave a ten-thousandfold accuracy boost https://spectrum.ieee.org/floating-point-numbers-posits-processor

    [2] Researchers Build a RISC-V Chip That Calculates in Posits, Boosting Accuracy for ML Workloads https://www.hackster.io/news/researchers-build-a-risc-v-chip-that-calculates-in-posits-boosting-accuracy-for-ml-workloads-086b985bf0c1

    [3] A Lightweight Posit Processing Unit for RISC-V Processors in Deep Neural Network Applications https://riscv.org/news/2021/10/a-lightweight-posit-processing-unit-for-risc-v-processors-in-deep-neural-network-applications-marco-cococcioni-federico-rossi-emanuele-ruffaldi-and-saponara-sergio-ieee-transactions-on-emerging/

    [4] PERI: A Configurable Posit Enabled RISC-V Core https://dl.acm.org/doi/fullHtml/10.1145/3446210

    [5] PERCIVAL: Open-Source Posit RISC-V Core with Quire Capability https://arxiv.org/abs/2111.15286

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)