• Inefficiency of FSL matrices

    From Krishna Myneni@21:1/5 to All on Wed Dec 13 18:38:06 2023
    The array and matrix definitions given in the FSL utilities source,

    fsl-util.x
    or
    fsl_util.x.

    The arrays are quite nice for their flexibility in making arrays out of
    any type of data, and are useful in many instances. However, the source definitions are slow on non-optimizing Forth systems.

    I believe the design of the arrays and matrices traces back to Julian
    Noble's, "Scientific Forth." A 1-D array is named with a trailing left
    brace "{" while 2-D matrices have a trailing double left brace, "{{".
    This allows a convenient notation

    a{ I } \ resolves to address of array element I
    m{{ J I }} \ resolves to address of matrix element at row J, col I

    The indices are 0-origin, so the first element of the array named a{ is
    stored at address

    a{ 0 }

    and the top left element of the matrix named m{{ is stored at address

    m{{ 0 0 }}

    Creation of fixed size arrays and matrices is performed with the parsing
    words

    ARRAY
    MATRIX

    and the size of each element is one of the arguments on the stack, so, typically, we can write

    36 FLOAT ARRAY a{ \ create array named "a{" with 36 elements of size
    FLOAT (a constant, usually 8 bytes)

    120 48 INTEGER MATRIX m{{ \ create matrix named "m{{" with 120 rows and
    48 columns of size INTEGER (a cell size in bytes)

    There are provisions for creating dynamically allocated arrays as well.

    The operators "}" and "}}" resolve the adress of the indexed element of
    the specified array, and this is where the inefficiency lies. I've been
    tempted to make "}" and "}}" machine code primitives for these words to increase their speed in kForth. It seems unlikely though that the words
    "}" and "}}" would be generally adopted as common use words, so I have
    not made them built-in words.

    To give you an idea of the efficiency hit from using these words versus
    custom arrays of fixed size elements, I tested numerically intensive
    code using a custom matrix of double precision floats, with index and
    fetch, and index and store words:
    ]]F@
    ]]F!
    The code and definitions for the two words above are found in the
    program, pde1.4th, at the link provided.

    Next, I created a version of pde1.4th which used FSL matrices and no
    extra optimizations, with the notation described earlier e.g. for a
    floating point matrix,

    m{{ J I }} F@ and m{{ J I }} F!

    Some typical execution times with the custom matrix fetch and store
    words versus the typical FSL matrix word usage are,

    Custom matrices: 9656 ms
    FSL matrices: 20474 ms

    The above timings use kForth-64 to time the word SOLVE for the
    RECTangular boundary value problem, with a specified tolerance of 1E-16.

    kforth32 has similar timings (kforth32-fast performs much faster).

    Some optimization is clearly needed on the FSL array and matrix definitions.

    The FSL definitions of ARRAY and MATRIX in kForth may be found in its fsl-util.4th file at the link given below.

    --
    Krishna Myneni

    Links to code mentioned above:

    https://github.com/mynenik/kForth-64/blob/master/forth-src/pde1.4th

    https://github.com/mynenik/kForth-64/blob/master/forth-src/fsl/fsl-util.4th

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Thu Dec 14 12:44:32 2023
    Interesting, but the performance gain is not surprising
    if you use matrix-adapted primitives.

    Just for comparison:
    I use different internal representations for vectors and
    matrices. Vectors are simple fp-number fields. Matrices
    in the heap are fat data and have a header for dimension,
    numeric type and reference count. Matrices can only be
    referenced via matrix-values (or matrix-stack elements).

    The compiler automatically generates the correct and
    therefore performant type-dependent code for
    <indices> } -> matrix element
    <indices> }^ -> address of the element
    <indices> }! -> write matrix element

    This is of course not FSL syntax, but more readable IMO.
    There are also slices, but that is not the topic here.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to minforth on Thu Dec 14 09:51:44 2023
    On 12/14/23 06:44, minforth wrote:
    Interesting, but the performance gain is not surprising
    if you use matrix-adapted primitives.

    Just for comparison:
    I use different internal representations for vectors and
    matrices. Vectors are simple fp-number fields. Matrices
    in the heap are fat data and have a header for dimension,
    numeric type and reference count. Matrices can only be
    referenced via matrix-values (or matrix-stack elements).

    How do you implement vectors of different types other than fp-numbers?
    Do you use a full-blown matrix then?

    The compiler automatically generates the correct and
    therefore performant type-dependent code for

    In the FSL, this can be done for statically alloted arrays and matrices,
    when the "}" and "}}" words can determine the element size (+number of
    columns for a matrix) at compile-time, by making them IMMEDIATE words.

    I don't see how the element-size-specific reference can be done for
    dynamic arrays and matrices at compile-time, when the element size and
    number of columns are not known.

    Another drawback would be having to make "}" and "}}" state-dependent
    words (on single-xt systems).

    <indices> } -> matrix element
    <indices> }^ -> address of the element
    <indices> }!  -> write matrix element

    This is of course not FSL syntax, but more readable IMO.
    There are also slices, but that is not the topic here.

    I prefer the symmetry of the FSL syntax. If you used }@ it would be more symmetric but you might not like the added visual noise. Again, it would
    not be possible for }@ to compile an element-size-specific reference on dynamically declared matrices.

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to Krishna Myneni on Thu Dec 14 10:27:19 2023
    On 12/14/23 09:51, Krishna Myneni wrote:
    On 12/14/23 06:44, minforth wrote:
    ...
    The compiler automatically generates the correct and
    therefore performant type-dependent code for

    In the FSL, this can be done for statically alloted arrays and matrices,
    when the "}" and "}}" words can determine the element size (+number of columns for a matrix) at compile-time, by making them IMMEDIATE words.

    I don't see how the element-size-specific reference can be done for
    dynamic arrays and matrices at compile-time, when the element size and
    number of columns are not known.


    Correction: For FSL dynamic arrays and matrices the element size is
    known at compile time. For FSL dynamic matrices the number of columns is
    not known at compile time, so it would always have to retrieved at run-time.

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Fri Dec 15 08:50:38 2023
    The details of the implementation would go too far here. Essentially,
    my matrices are objects with methods for reading and writing cells
    (and slices). There is a matrix stack with pointers to a pre-allocated
    pool of empty matrix objects in the heap at the start of the programme.
    Other frequently used global matrix values are also predefined.
    This allows a lot of early-binding during compilation.

    In Forth jargon, the matrices do not have single cell execution tokens,
    but structs that carry type information (and other states). It is
    therefore not a standard Forth. Available primitive data types are
    integers, float24, float32, float64 (and compound types with companion bits).

    As visual syntax example an older PLU factorisation (today primitive):

    : MPLU ( m: m -- m: l p u ) PLU factorization
    ?square mrows meye mswap mrows dup mzeros mswap ( p l u ) mrows
    FOR n n mcpivot v0~ IF -408 throw THEN
    dup n = IF drop ELSE mrot dup n mrswap mrot dup n mrswap mrot n mrswap THEN
    m[ n n ] 1. m'[ n n ]! mrows n 1+
    ?DO m[ i n ]^ dup v@ fover f/ fdup i n mrmulsub 0. v! m'[ i n ]!
    LOOP fdrop
    NEXT ;

    (M[ means the top, M'[ the second matrix matrix on the matrix stack, and
    as you can see, floats have been assumed here)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to minforth on Fri Dec 15 10:47:21 2023
    minforth wrote:

    The details of the implementation would go too far here. Essentially,
    my matrices are objects with methods for reading and writing cells
    (and slices). There is a matrix stack with pointers to a pre-allocated
    pool of empty matrix objects in the heap at the start of the programme.
    Other frequently used global matrix values are also predefined.
    This allows a lot of early-binding during compilation.

    This is also the approach I have chosen for iForth. However, there I have
    no need for a matrix stack (the parameter stack is perfectly well
    suited to handle the matrix/array struct pointers).
    I also have a pool of "local" matrix objects. It is possible to write
    words for e.g. matrix inversion that work independent of type (byte,
    word, cell, sfloat, dfloat, float, xfloat, ddouble, complex, arbitrary precision ...)

    I kept the FSL notation A{{ }} B{ } but there are new words to
    access the struct fields: A{{ DIMS , B{ DADDR C{{ DSIZE etc..

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to mhx on Fri Dec 15 11:19:04 2023
    mhx wrote:

    minforth wrote:

    The details of the implementation would go too far here. Essentially,
    my matrices are objects with methods for reading and writing cells
    (and slices). There is a matrix stack with pointers to a pre-allocated
    pool of empty matrix objects in the heap at the start of the programme.
    Other frequently used global matrix values are also predefined.
    This allows a lot of early-binding during compilation.

    This is also the approach I have chosen for iForth. However, there I have
    no need for a matrix stack (the parameter stack is perfectly well
    suited to handle the matrix/array struct pointers).

    Yes, that works. The mstack is also there for historical reasons, because matrix pointers used to be fat pointers. With the introduction of copy-on-write,
    this was no longer suitable. In addition, the normal integer and fp data stacks are full enough with other things, so the mstack is also useful for clearer code.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to minforth on Fri Dec 15 11:28:46 2023
    minforth wrote:
    [..]
    In addition, the normal integer and fp data stacks
    are full enough with other things, so the mstack is also useful for clearer code.

    It probably comes as no surprise that I find locals to be a perfect solution for that.


    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to Krishna Myneni on Fri Dec 15 08:55:33 2023
    On 12/13/23 18:38, Krishna Myneni wrote:
    ...
    The operators "}" and "}}" resolve the adress of the indexed element of
    the specified array, and this is where the inefficiency lies. I've been tempted to make "}" and "}}" machine code primitives for these words to increase their speed in kForth. It seems unlikely though that the words
    "}" and "}}" would be generally adopted as common use words, so I have
    not made them built-in words.


    I think the best approach forward for me, in kForth, is to include
    efficient primitives in Forth system,

    FSL_ARRAY_EL_REF ( a_arr nidx -- )
    FSL_MATRIX_EL_REF ( a_mat nidx midx -- )

    Then, in fsl-util.4th, "}" and "}}" may be declared to be synonyms for
    these primitives, or something else if the user prefers a different implementation of arrays and matrices.

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to Krishna Myneni on Sun Dec 17 15:00:16 2023
    On 12/15/23 08:55, Krishna Myneni wrote:
    On 12/13/23 18:38, Krishna Myneni wrote:
    ...
    The operators "}" and "}}" resolve the adress of the indexed element
    of the specified array, and this is where the inefficiency lies. I've
    been tempted to make "}" and "}}" machine code primitives for these
    words to increase their speed in kForth. It seems unlikely though that
    the words "}" and "}}" would be generally adopted as common use words,
    so I have not made them built-in words.


    I think the best approach forward for me, in kForth, is to include
    efficient primitives in Forth system,

    FSL_ARRAY_EL_REF    ( a_arr nidx -- )
    FSL_MATRIX_EL_REF   ( a_mat nidx midx -- )

    ...

    I'm implementing "}}" as an intrinsic word in kForth, for use with FSL matrices. In order to test new implementations and benchmark them
    against different versions including the original source code version
    from the kForth fsl-util.4th file, I wrote a benchmark program which
    also performs elementary tests on "}}". The existing source code version
    of "}}" for FSL-style, row-ordered matrices, is

    : }} ( addr i j -- addr[i][j] )
    >R >R
    DUP [ 2 CELLS ] LITERAL - 2@ \ &a[0][0] size m
    R> * R> + * + ;

    In addition to the automated tests on the implementation of "}}", the
    benchmark program generates a random 256x256 floating point matrix and
    can perform a specified number of fetches from random row and column
    locations -- a typical number would be 1,000,000 random accesses from
    the matrix.

    --- example use ---
    include bench-fsl-matrix

    FSL-UTIL V1.3c 11 Sep 2021 EFC, KM

    TESTING }}

    Filling 256x256 floating point matrix with random values ...
    Benchmark: "ms@ 1000000 random-matrix-access ms@ swap - ." ok
    ok

    ms@ 1000000 random-matrix-access ms@ swap - .
    454 ok
    --- end of example ---

    The testing and benchmark program is given below.

    --- start of program ---
    \ bench-fsl-matrix.4th
    \
    \ Benchmark the implementation of }}
    \
    \ Krishna Myneni, 2023-12-17

    include ans-words
    include random
    include ttester
    include fsl/fsl-util

    \ Basic tests on }}
    4 constant NROWS
    16 constant NCOLS
    NROWS NCOLS float matrix b{{
    cr
    TESTING }}
    t{ b{{ b{{ 0 0 }} = -> true }t
    t{ b{{ float+ b{{ 0 1 }} = -> true }t
    t{ b{{ NCOLS floats + b{{ 1 0 }} = -> true }t
    t{ b{{ NROWS NCOLS * 1- floats + b{{ NROWS 1- NCOLS 1- }} = -> true }t

    \ Benchmarking matrix is 256x256
    256 constant NROWS
    NROWS constant NCOLS
    NROWS NCOLS float matrix m{{

    ms@ seed ! \
  • From Krishna Myneni@21:1/5 to Krishna Myneni on Sun Dec 17 21:36:18 2023
    On 12/17/23 15:00, Krishna Myneni wrote:
    ...

    : random-matrix-access ( -- )
        0 ?DO  m{{ random-row-col }} drop LOOP ;

    ...
    This doesn't measure what I thought it was measuring. The benchmarking
    of RANDOM-MATRIX-ACCESS measures the efficiency of RANDOM-ROW-COL rather
    than measuring the efficiency of "}}". Using fixed row and column
    numbers as indices gives a more accurate benchmark of "}}" which
    computes the address of the specified element.

    -- KM

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to Krishna Myneni on Sun Dec 17 21:58:37 2023
    On 12/15/23 08:55, Krishna Myneni wrote:
    On 12/13/23 18:38, Krishna Myneni wrote:
    ...
    The operators "}" and "}}" resolve the adress of the indexed element
    of the specified array, and this is where the inefficiency lies. I've
    been tempted to make "}" and "}}" machine code primitives for these
    words to increase their speed in kForth. ...

    I think the best approach forward for me, in kForth, is to include
    efficient primitives in Forth system,

    ...
    Experimenting with an assembly language implementation of "}}" I find
    that the following code in kforth64's vm64.s gives me a factor of 2x
    speedup over the Forth source code for double-precision fp matrices.
    This factor of 2 increase should not depend on the size of the matrix
    element e.g., it should hold even if each matrix element was a
    quad-precision floating point number (16 bytes) or a larger data
    structure -- I still have to test this speedup under such conditions.

    --
    KM


    L_fsl_mat_addr:
    LDSP
    INC_DSP
    mov (%rbx), %rcx # rcx = j (column index)
    INC_DSP
    STSP
    mov (%rbx), %rdx # rdx = i (row index)
    mov WSIZE(%rbx), %rax # adress of first element
    sub $2*WSIZE, %rax # rax = a - 2 cells
    mov %rax, %rdi
    mov (%rax), %rax # rax = ncols
    imulq %rdx # rax = i*ncols
    add %rax, %rcx # rcx = i*ncols + j
    mov %rdi, %rax
    add $WSIZE, %rax
    mov (%rax), %rax # rax = size
    imulq %rcx # rax = size*(i*ncols + j)
    add %rax, WSIZE(%rbx) # TOS = a + rax
    INC2_DTSP
    xor %rax, %rax
    NEXT

    vs

    : }} ( addr i j -- addr[i][j] )
    >R >R
    DUP [ 2 CELLS ] LITERAL - 2@ \ &a[0][0] size m
    R> * R> + * + ;

    In the experimental kforth64, the code L_fsl_mat_addr maps to the
    built-in word, "}}".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Mon Dec 18 09:35:25 2023
    Looks good for existing FSL code. it might even be faster if you
    introduced type-specific and thus shorter address computations
    like }}F }}SF }}I and }F }SF }I.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to minforth on Mon Dec 18 07:05:01 2023
    On 12/18/23 03:35, minforth wrote:
    Looks good for existing FSL code. it might even be faster if you
    introduced type-specific and thus shorter address computations
    like }}F }}SF }}I and }F }SF }I.

    Yes, I expect I might gain an additional 2x speedup for type-specific
    versions. But, I want }} to do the size-specific compilation,
    eventually. To do that properly, I want to go to a dual-xt system and
    avoid explicit STATE-dependence (which has other benefits we've
    discussed in the past). That's a bigger project for later.

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to Krishna Myneni on Mon Dec 18 13:32:20 2023
    Krishna Myneni wrote:
    To do that properly, I want to go to a dual-xt system and
    avoid explicit STATE-dependence (which has other benefits we've
    discussed in the past). That's a bigger project for later.

    Forge ahead. Having worked with both versions (state-smart and
    dual-xt), I have to say that the investment in converting a
    well-running, debugged Forth system to dual-xt only pays off
    if you "POSTPONE like crazy" or are in the habit of writing DSLs.

    Otherwise, state-smartness can be an itch, but is not bad per se.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to All on Mon Dec 18 07:37:26 2023
    On 12/18/23 07:05, Krishna Myneni wrote:
    ...

    I have pushed these changes in kForth-64 into the main branch at github:

    1. make FSL matrix addressing word }} an intrinsic word for higher
    efficiency.

    2. update the fsl-util.4th to only load the Forth source version of }}
    if it is undefined.

    The test code fsl-tester.4th (in forth-src/fsl/) may be used to verify
    that FSL code runs correctly with the faster version of }}.

    These changes will also appear in kForth-32 and in kForth-Win32, after a
    short period of testing.

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to minforth on Mon Dec 18 17:05:48 2023
    On 12/18/23 07:32, minforth wrote:
    Krishna Myneni wrote:
    To do that properly, I want to go to a dual-xt system and avoid
    explicit STATE-dependence (which has other benefits we've discussed in
    the past). That's a bigger project for later.

    Forge ahead. Having worked with both versions (state-smart and
    dual-xt), I have to say that the investment in converting a
    well-running, debugged Forth system to dual-xt only pays off
    if you "POSTPONE like crazy" or are in the habit of writing DSLs.

    Otherwise, state-smartness can be an itch, but is not bad per se.

    It's a tradeoff between having a utilitarian Forth system to compute
    various tasks efficiently, versus having a Forth system which provides
    tools for exploratory programming such as DSLs and introspection
    capabilities.

    For larger software projects, being able to write introspection tools
    with the underlying language have demonstrable utility (see past
    discussion on making word dependency trees). Dual xt systems with a
    couple of support words are one way to allow for this.

    Balancing these two goals without breaking an actively used Forth system requires a more cautious approach. And, of course, time is limited.

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to Krishna Myneni on Mon Dec 18 18:48:44 2023
    On 12/18/23 07:37, Krishna Myneni wrote:
    On 12/18/23 07:05, Krishna Myneni wrote:
    ...

    I have pushed these changes in kForth-64 into the main branch at github:

    1. make FSL matrix addressing word }} an intrinsic word for higher efficiency.

    2. update the fsl-util.4th to only load the Forth source version of }}
    if it is undefined.

    The test code fsl-tester.4th (in forth-src/fsl/) may be used to verify
    that FSL code runs correctly with the faster version of }}.

    These changes will also appear in kForth-32 and in kForth-Win32, after a short period of testing.


    I also introduced the word "*+" in the last commit in kForth-64.

    *+ ( a b c -- n ) \ n = a*b + c

    Note that *+ is not the same as the sequence "* +", It has the
    equivalent effect of the following:

    -ROT * +
    R * R> +

    *+ is similar to the Forth standard word, "*/". Such ternary operators
    reduce stack juggling and make code more readable. As an example, the
    custom floating point array words from pde1.4th can now be recoded from

    : ]]F@ ( a row col -- r ) >r GRIDSIZE * r> + floats + f@ ;
    : ]]F! ( r a row col -- ) >r GRIDSIZE * r> + floats + f! ;

    to

    : ]]F@ ( a row col -- r ) GRIDSIZE swap *+ floats + f@ ;
    : ]]F! ( r a row col -- ) GRIDSIZE swap *+ floats + f! ;

    It simplifies the code and increases the performance of SOLVE by about 10%.

    I expect the floating point version, F*+, to be equally, if not more
    useful for improving readability of fp code and increasing efficiency.

    F*+ ( F: r1 r2 r3 -- r ) \ r = r1*r2 + r3

    F*+ provides a intrinsic scalar linear transformation. I have not yet
    added this word.

    --
    Krishna



    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to mhx on Tue Dec 19 08:43:19 2023
    mhx wrote:
    minforth wrote:
    [..]
    In addition, the normal integer and fp data stacks
    are full enough with other things, so the mstack is also useful for clearer code.

    It probably comes as no surprise that I find locals to be a perfect solution for that.

    Yes and no, depending on whether you are working with global matrices à la FSL or with a stack of matrix pointers, where you can DUP and DROP matrices. Copy-on-write is a big plus here, otherwise you have to manage (temporary) matrix copies and perhaps free the data yourself.

    With copy-on-write in the background, matrix locals are easy.
    OTOH I don't see much advantage in using matrix locals for dealing with
    global FSL matrices.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to minforth on Tue Dec 19 10:34:05 2023
    minforth wrote:

    mhx wrote:
    OTOH I don't see much advantage in using matrix locals for dealing with global FSL matrices.

    I don't use the FSL for serious work but I stuck with its basic interface words. Maybe I'll change that when it comes to a 2nd version of iForth
    or iSPICE.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to mhx on Tue Dec 19 11:19:50 2023
    mhx wrote:
    I don't use the FSL for serious work but I stuck with its basic interface words. Maybe I'll change that when it comes to a 2nd version of iForth
    or iSPICE.

    "Don't fix it if it ain't broke" ;-)

    I'd rather invest in more readability i.e. easy transformation from usual Matlab notation to Forth. A simple example:
    Matlab: (v'Av) where A is a matrix and v' is the transposed vector of v
    Forth postfix:
    .. v' A m* v m* 0> IF ...

    A suffix recogniser comes in handy with coding linear algebra.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to minforth on Tue Dec 19 13:30:40 2023
    minforth wrote:

    I'd rather invest in more readability i.e. easy transformation from usual Matlab notation to Forth. A simple example:
    Matlab: (v'Av) where A is a matrix and v' is the transposed vector of v
    Forth postfix:
    .. v' A m* v m* 0> IF ...

    I have to say

    FORTH> 2 2 FLOAT MATRIX a{{
    FORTH> 1e 2e 3e 4e a{{ #=> ok
    FORTH> a{{ }}print
    1.000000e+0000 2.000000e+0000
    3.000000e+0000 4.000000e+0000 ok
    FORTH> a{{ 1 =rowget a{{ a{{ 1 =colget =mat* swap =mat* }}print
    3.000000e+0001 4.000000e+0001
    6.600000e+0001 8.800000e+0001 ok
    FORTH> a{{ }}print
    1.000000e+0000 2.000000e+0000
    3.000000e+0000 4.000000e+0000 ok

    But I also have XOPG:

    : test1 LET a = b*c-3.17e-5/TANH(w)+ABS(x): CR LET. a: ;
    LET w = 1.e-3: LET x = -2.5: CR CR test1 -- 2.4682999894333340377777106878373968247191492

    0-VALUE GVAL a 0-VALUE GVAL b 0-VALUE GVAL c
    0-VALUE GVAL disc ( Used for discriminant )
    : quadraticroot ( F: a b c -- y1 y2 )
    TO c TO b TO a Pickup coefficients.
    LET disc=SQRT(b*b-4*a*c): Set discriminant.
    LET ((-b+disc)/(2*a),(-b-disc)/(2*a)): Put values on f-stack.
    ;
    ( x*x-3*x+2=0 ) LET quadratic root (1,-3, 2) :
    CR .PRINT CR .PRINT -- prints 1 2
    ( goldenratio ) CR LET. MAX(quadra ticroot (1,-1,-1)): -- 1.6180339887498948482045868343656381177203

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to mhx on Tue Dec 19 14:28:01 2023
    mhx wrote:

    minforth wrote:

    I'd rather invest in more readability i.e. easy transformation from usual
    Matlab notation to Forth. A simple example:
    Matlab: (v'Av) where A is a matrix and v' is the transposed vector of v
    Forth postfix:
    .. v' A m* v m* 0> IF ...

    I have to say

    FORTH> 2 2 FLOAT MATRIX a{{
    FORTH> 1e 2e 3e 4e a{{ #=> ok
    FORTH> a{{ }}print
    1.000000e+0000 2.000000e+0000
    3.000000e+0000 4.000000e+0000 ok
    FORTH> a{{ 1 =rowget a{{ a{{ 1 =colget =mat* swap =mat* }}print
    3.000000e+0001 4.000000e+0001
    6.600000e+0001 8.800000e+0001 ok
    FORTH> a{{ }}print
    1.000000e+0000 2.000000e+0000
    3.000000e+0000 4.000000e+0000 ok

    But I also have XOPG:

    : test1 LET a = b*c-3.17e-5/TANH(w)+ABS(x): CR LET. a: ;
    LET w = 1.e-3: LET x = -2.5: CR CR test1 -- 2.4682999894333340377777106878373968247191492

    0-VALUE GVAL a 0-VALUE GVAL b 0-VALUE GVAL c
    0-VALUE GVAL disc ( Used for discriminant )
    : quadraticroot ( F: a b c -- y1 y2 )
    TO c TO b TO a Pickup coefficients.
    LET disc=SQRT(b*b-4*a*c): Set discriminant.
    LET ((-b+disc)/(2*a),(-b-disc)/(2*a)): Put values on f-stack.
    ;
    ( x*x-3*x+2=0 ) LET quadratic root (1,-3, 2) :
    CR .PRINT CR .PRINT -- prints 1 2
    ( goldenratio ) CR LET. MAX(quadra ticroot (1,-1,-1)): -- 1.6180339887498948482045868343656381177203

    Great! Some of it looks like a more evolved ftran à la Julian Noble.

    The matrix notation is a bit special, but when it works for you, okay.
    I'd start with your example perhaps this way
    #[ 1 2 ; 3 4 ] :=> A ok parse matrix elements, create A and fill it
    A #. read A and print
    [ 1. 2.
    3. 4. ] ok
    A' #. read A, transpose (but don't change A) and print
    [ 1. 3.
    2. 4. ] ok
    et cetera

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Krishna Myneni on Tue Dec 19 16:43:41 2023
    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    But, I want }} to do the size-specific compilation,
    eventually. To do that properly, I want to go to a dual-xt system and
    avoid explicit STATE-dependence (which has other benefits we've
    discussed in the past).

    Avoiding STATE is a good idea. However, what you have in mind seems
    to be an optimization, not something like S". Dual-xt words are good
    for stuff like S"; you can use them for optimization, but the
    intelligent COMPILE, is better for that.

    Gforth has had INTERPRET/COMPILE: (for defining a dual-xt word) for a quarter-century and has 9 uses of that in its image, none of them for optimizing (not even before we had SET-OPTIMIZER).

    Gforth has had SET-OPTIMIZER (for specifying what the intelligent
    COMPILE, does for a word) for less than a decade, and by now we have
    the following numbers of uses of SET-OPTIMIZER and related words in
    the image:

    23 SET-OPTIMIZER
    10 OPT:
    13 OPTIMIZES
    26 FOLDS

    where FOLDS can be and is used to specify the optimization of multiple
    words, in particular:

    ' fold1-0 folds drop
    ' fold1-1 folds invert abs negate >pow2
    ' fold1-1 folds 1+ 1- 2* 2/ cells cell/ cell+ cell-
    ' fold1-1 folds floats sfloats dfloats float+
    ' fold1-1 folds float/ sfloat/ dfloat/
    ' fold1-1 folds c>s w>s l>s w>< l>< x><
    ' fold1-1 folds wcwidth
    ' fold1-1 folds 0> 0= 0<
    ' fold1-2 folds dup s>d
    ' fold2-0 folds 2drop
    ' fold2-1 folds * and or xor
    ' fold2-1 folds min max umin umax
    ' fold2-1 folds nip
    ' fold2-1 folds rshift lshift arshift rol ror
    ' fold2-1 folds = > >= < <= u> u>= u< u<=
    ' fold2-1 folds d0> d0< d0=
    ' fold2-1 folds /s mods
    ' fold2-2 folds m* um* swap d2* /modf /mods u/mod bounds
    ' fold2-3 folds over tuck
    ' fold3-1 folds within select mux */f */s u*/
    ' fold3-2 folds um/mod fm/mod sm/rem du/mod */modf */mods u*/mod
    ' fold3-3 folds rot -rot
    ' fold4-1 folds d= d> d>= d< d<= du> du>= du< du<=
    ' fold4-2 folds d+ d- 2nip
    ' fold4-4 folds 2swap
    ' opt+- folds + -

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to Anton Ertl on Tue Dec 19 17:53:16 2023
    On 12/19/23 10:43, Anton Ertl wrote:
    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    But, I want }} to do the size-specific compilation,
    eventually. To do that properly, I want to go to a dual-xt system and
    avoid explicit STATE-dependence (which has other benefits we've
    discussed in the past).

    Avoiding STATE is a good idea. However, what you have in mind seems
    to be an optimization, not something like S". Dual-xt words are good
    for stuff like S"; you can use them for optimization, but the
    intelligent COMPILE, is better for that.


    We had a discussion about this earlier, and I did not like the design of SET-OPTIMIZER changing the behavior of COMPILE, . I don't see the
    drawback of changing the xt for compilation state as a method of
    optimization. Why add extra complexity with SET-OPTIMIZER when you don't
    have to?

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Wed Dec 20 08:05:36 2023
    mhx wrote:
    0-VALUE GVAL a 0-VALUE GVAL b 0-VALUE GVAL c
    0-VALUE GVAL disc ( Used for discriminant )
    : quadraticroot ( F: a b c -- y1 y2 )
    TO c TO b TO a Pickup coefficients.
    LET disc=SQRT(b*b-4*a*c): Set discriminant. >> LET ((-b+disc)/(2*a),(-b-disc)/(2*a)): Put values on f-stack.
    ;

    If I had your formula parser that would be here perhaps
    : QUADRATICROOT { f: a b c | d == y1 y2 }
    2.0 *to a
    let d = sqrt(fma(b,b,-2*a*c))
    let y1 = (-b+d)/a
    let y2 = (-b-d)/a ;

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to minforth on Wed Dec 20 10:22:23 2023
    minforth wrote:

    mhx wrote:
    0-VALUE GVAL a 0-VALUE GVAL b 0-VALUE GVAL c
    0-VALUE GVAL disc ( Used for discriminant )
    : quadraticroot ( F: a b c -- y1 y2 )
    TO c TO b TO a Pickup coefficients.
    LET disc=SQRT(b*b-4*a*c): Set discriminant.
    LET ((-b+disc)/(2*a),(-b-disc)/(2*a)): Put values on f-stack.
    ;

    If I had your formula parser that would be here perhaps
    : QUADRATICROOT { f: a b c | d == y1 y2 }
    2.0 *to a
    let d = sqrt(fma(b,b,-2*a*c))
    let y1 = (-b+d)/a
    let y2 = (-b-d)/a ;

    Or with Noble's ftran and ANS locals (extended for fp-numbers)
    : QUADROOT {: F: a b c | d -- y1 y2 :}
    ft" sqrt(b*b-4*a*c)" to d
    ft" (-b+d)/(2*a)" \ y1
    ft" (-b-d)/(2*a)" ; \ y1 y2
    IOW it is only a matter of personal preference for readability.

    Concerning the topic of this thread, the interesting question is
    whether the formula parser (infix to postfix translator) can be
    extended to work with mixed matrix and fp scalar types.

    Exemplary cases to be considered: matrix multiplication is not
    commutative, and symbols could be overloaded e.g. A*b means
    different things depending on which element is scalar or a matrix.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to minforth on Wed Dec 20 15:21:20 2023
    minforth wrote:

    mhx wrote:
    0-VALUE GVAL a 0-VALUE GVAL b 0-VALUE GVAL c
    0-VALUE GVAL disc ( Used for discriminant )
    : quadraticroot ( F: a b c -- y1 y2 )
    TO c TO b TO a Pickup coefficients.
    LET disc=SQRT(b*b-4*a*c): Set discriminant.
    LET ((-b+disc)/(2*a),(-b-disc)/(2*a)): Put values on f-stack.
    ;

    If I had your formula parser that would be here perhaps
    : QUADRATICROOT { f: a b c | d == y1 y2 }
    2.0 *to a
    let d = sqrt(fma(b,b,-2*a*c))
    let y1 = (-b+d)/a
    let y2 = (-b-d)/a ;

    Yes, of course :--) However, my XOPG works with any type
    variable, from bytes up to arbitrary-precision floats (here
    the output for 64-bit FP is shown). Although elegant (for
    some), adding arbitrary precision floats to local variables
    is too much, even for a bricklayer like me. Hence the GVAL
    global values.

    When you have arbitrary precision, it is dangerous
    to write something like "2.1 TO a", because 2.1e with the
    extended double precision of the interpreter may not round
    to exactly the 2.1 needed by, say, a 2048-bit arbitrary
    precision number. This can create very subtle bugs.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to PMF on Wed Dec 20 14:00:16 2023
    On 12/20/23 11:58, PMF wrote:
    Krishna Myneni wrote:

    On 12/19/23 10:43, Anton Ertl wrote:
    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    But, I want }} to do the size-specific compilation,
    eventually. To do that properly, I want to go to a dual-xt system and
    avoid explicit STATE-dependence (which has other benefits we've
    discussed in the past).

    Avoiding STATE is a good idea.  However, what you have in mind seems
    to be an optimization, not something like S".  Dual-xt words are good
    for stuff like S"; you can use them for optimization, but the
    intelligent COMPILE, is better for that.


    We had a discussion about this earlier, and I did not like the design
    of SET-OPTIMIZER changing the behavior of COMPILE, . I don't see the
    drawback of changing the xt for compilation state as a method of
    optimization. Why add extra complexity with SET-OPTIMIZER when you
    don't have to?

    --
    Krishna

    LXF has from the beginning used the compiling XT also for optimization
    (code generation). COMPILE, just compiles a call to the XT.
    This works perfectly well and is easy to implement.
    LXF has about 400 words that generate code while compiling.


    This is my preference -- COMPILE, should simply compile the code needed
    to execute the xt given to it and not do something clever by
    substituting a different xt for it.

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to Krishna Myneni on Fri Dec 22 09:00:46 2023
    Krishna Myneni wrote:
    This is my preference -- COMPILE, should simply compile the code needed
    to execute the xt given to it and not do something clever by
    substituting a different xt for it.

    Counterexample:
    Code inlining is a simple optimisation technique. Why not use it?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Krishna Myneni on Fri Dec 22 14:14:32 2023
    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    On 12/19/23 10:43, Anton Ertl wrote:
    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    But, I want }} to do the size-specific compilation,
    eventually. To do that properly, I want to go to a dual-xt system and
    avoid explicit STATE-dependence (which has other benefits we've
    discussed in the past).

    Avoiding STATE is a good idea. However, what you have in mind seems
    to be an optimization, not something like S". Dual-xt words are good
    for stuff like S"; you can use them for optimization, but the
    intelligent COMPILE, is better for that.


    We had a discussion about this earlier, and I did not like the design of >SET-OPTIMIZER changing the behavior of COMPILE, .

    If it changes the behaviour of COMPILE, (rather than the
    implementation of that behaviour), that's a mistake in the use of SET-OPTIMIZER: Whatever you do, it must not change the behaviour.

    I don't see the
    drawback of changing the xt for compilation state as a method of >optimization. Why add extra complexity with SET-OPTIMIZER when you don't
    have to?

    * It's a separation of concerns: SET-OPTIMIZER is for optimization and
    must not change the behaviour (i.e., if you replace the call to
    SET-OPTIMIZER with DROP, the program should still work), whereas
    SET->COMP (and related words such as INTERPRET/COMPILE:) changes the
    compilation semantics, i.e., the behaviour.

    * If you implement [COMPILE], you need to know if a word has
    non-default compilation semantics. If you have the separation of
    concerns above, that is easy: If it does not have the default
    NAME>COMPILE method (DEFAULT-NAME>COMP), it has non-default
    compilation semantics. If you are using this mechanism for a
    purpose that does not change the compilation semantics, you have to
    add a mechanism that tells the compiler about the difference, and
    the user has to provide this information in some way, too (e.g., by
    having INTERPRET/COMPILE: if the resulting word has non-default
    compilation semantics and INTERPRET/OPTIMIZE: if it has).

    * There is a difference in performance if an xt is COMPILE,ed; with
    the intelligent COMPILE, the result is as good as going through the
    text interpreter; with INTERPRET/COMPILE:, you get a generic call to
    the xt, while the text interpreter produces better code.

    * The COMPILE, methods get the xt that is COMPILE,d (they have the
    same stack effect as COMPILE,), which helps in using the same
    implementation for several xts. E.g., FOLD1-1 is the optimizer of
    29 words (all with the stack effect ( x -- x ). INTERPRET/COMPILE:
    lacks this flexibility. I guess you could have a way that provides
    the xt or nt of the word for which you are performing the
    compilation semantics; not sure how well that would work.

    Finally, my vision for the (far-away) future is that words such as S"
    and TO go away, and with them the need for words such as
    INTERPRET/COMPILE: or (worse) STATE-smart words.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Krishna Myneni on Fri Dec 22 16:59:49 2023
    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    This is my preference -- COMPILE, should simply compile the code needed
    to execute the xt given to it and not do something clever by
    substituting a different xt for it.

    So IYO COMPILE, should be defined as

    : compile, ( xt -- )
    postpone literal postpone execute ;

    ?

    And of course the POSTPONE EXECUTE must do something other than
    "COMPILE," the xt or EXECUTE, or this results in an endless recursion.

    Interestingly, development Gforth has this as "GENERAL-COMPILE," and
    this is the implementation of "COMPILE," if no more specialized
    version is installed. E.g., the more specialized implementation of
    COMPILE, for a constant is:

    : constant, ( xt -- )
    >body @ postpone literal ;

    This does not substitute a different xt for the given xt, it compiles
    the value of the constant as a literal. Why do you think we should
    compile with GENERAL-COMPILE, instead?

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to minforth on Fri Dec 22 18:43:10 2023
    On 12/22/23 03:00, minforth wrote:
    Krishna Myneni wrote:
    This is my preference -- COMPILE, should simply compile the code
    needed to execute the xt given to it and not do something clever by
    substituting a different xt for it.

    Counterexample:
    Code inlining is a simple optimisation technique. Why not use it?

    I do use code inlining. I don't understand what you mean by this being a counter-example. For example, you may need to give interpretation
    semantics for a word that performs a sequence of words, but the
    compilation semantics does inline compiling of the sequence.

    Or am I misunderstanding you?

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to krishna.myneni@ccreweb.org on Sat Dec 23 10:15:33 2023
    In article <um5aeu$1p3uf$1@dont-email.me>,
    Krishna Myneni <krishna.myneni@ccreweb.org> wrote:
    On 12/22/23 03:00, minforth wrote:
    Krishna Myneni wrote:
    This is my preference -- COMPILE, should simply compile the code
    needed to execute the xt given to it and not do something clever by
    substituting a different xt for it.

    Counterexample:
    Code inlining is a simple optimisation technique. Why not use it?

    I do use code inlining. I don't understand what you mean by this being a >counter-example. For example, you may need to give interpretation
    semantics for a word that performs a sequence of words, but the
    compilation semantics does inline compiling of the sequence.

    Or am I misunderstanding you?

    Look at my :I
    EXAMPLE:
    :I R> 1+ >R ;
    the code is just inlined without engaging the return stack


    :I compiles things and in compilation state it copies until (;)
    (an alias for exit), otherwise it executes a normal definition.

    : :I
    CREATE IMMEDIATE ] LATEST HIDDEN !CSP DOES> \ Just like :
    STATE @ IF
    BEGIN $@ DUP '(;) <> WHILE , REPEAT 2DROP
    ELSE
    >R \ Just like :
    THEN ;

    [In this simple ciforth there is no distinction between , and
    COMPILE, . ]

    Now it is beyond me, that this has anything to do with "optimisation". Optimisation in the sense, replacing code sequence with faster
    execution code sequences.

    --
    Krishna


    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat spinning. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to Krishna Myneni on Sat Dec 23 12:20:26 2023
    Krishna Myneni wrote:
    I do use code inlining. I don't understand what you mean by this being a counter-example. For example, you may need to give interpretation
    semantics for a word that performs a sequence of words, but the
    compilation semantics does inline compiling of the sequence.

    I am a little surprised that this is a topic of discussion. See also: https://ethz.ch/content/dam/ethz/special-interest/infk/ast-dam/documents/Theodoridis-ASPLOS22-Inlining-Paper.pdf

    Well, I consider compiling a single xt to be equivalent to compiling a
    function call. Including the function body (apart from reducing function
    call overhead)
    a) eliminates the xt, so it is no longer there for decompilation or introspection, and
    b) creates a wider area for further optimisations.
    For example, peephole optimisation can now extend across the boundaries
    of the host code and inlined code, et cetera.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to Anton Ertl on Sat Dec 23 09:01:50 2023
    On 12/22/23 10:59, Anton Ertl wrote:
    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    This is my preference -- COMPILE, should simply compile the code needed
    to execute the xt given to it and not do something clever by
    substituting a different xt for it.

    So IYO COMPILE, should be defined as

    : compile, ( xt -- )
    postpone literal postpone execute ;

    ?

    The definition of COMPILE, should be equivalent to that, yes. The actual implementation may be system dependent. This appears to me to be
    consistent with the standard,

    6.2.0945
    COMPILE,
    “compile-comma”
    CORE EXT
    Interpretation: Interpretation semantics for this word are undefined. Execution: ( xt – – )
    Append the execution semantics of the definition represented by xt to
    the execution semantics of the current definition.

    And, of course, the execution semantics of xt may be performed by EXECUTE.

    And of course the POSTPONE EXECUTE must do something other than
    "COMPILE," the xt or EXECUTE, or this results in an endless recursion.

    That is the programmer's concern.

    Interestingly, development Gforth has this as "GENERAL-COMPILE," and
    this is the implementation of "COMPILE," if no more specialized
    version is installed. E.g., the more specialized implementation of
    COMPILE, for a constant is:

    : constant, ( xt -- )
    >body @ postpone literal ;


    Why do you need a specialized version of COMPILE, for a constant? The
    following should work for a constant named C :

    In interpreter state:

    ' C EXECUTE

    In compilation state:

    : compile-C ['] C compile, ; immediate

    : test compile-C . ;

    --

    5 constant C
    ok
    ' C execute
    ok
    .s

    5
    ok
    : compile-C ['] C compile, ; immediate
    ok
    : test compile-C . ;
    ok
    test
    5 ok

    Of course, the point of our discussion is that you can make optimized
    code for C in compilation state.

    This does not substitute a different xt for the given xt, it compiles
    the value of the constant as a literal. Why do you think we should
    compile with GENERAL-COMPILE, instead?


    In a dual-xt system, you are free to specify whatever compilation
    sequence you wish -- isn't that the point?

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to minforth on Sat Dec 23 09:05:03 2023
    On 12/23/23 06:20, minforth wrote:
    Krishna Myneni wrote:
    I do use code inlining. I don't understand what you mean by this being
    a counter-example. For example, you may need to give interpretation
    semantics for a word that performs a sequence of words, but the
    compilation semantics does inline compiling of the sequence.

    I am a little surprised that this is a topic of discussion. See also: https://ethz.ch/content/dam/ethz/special-interest/infk/ast-dam/documents/Theodoridis-ASPLOS22-Inlining-Paper.pdf

    Well, I consider compiling a single xt to be equivalent to compiling a function call. Including the function body (apart from reducing function
    call overhead)
    a) eliminates the xt, so it is no longer there for decompilation or introspection, and
    b) creates a wider area for further optimisations.
    For example, peephole optimisation can now extend across the boundaries
    of the host code and inlined code, et cetera.

    Aren't we saying the same thing? Inlining avoids compiling a function
    call by including the function body in the function into which it is
    inlined.

    I didn't understand why you claimed inlining is a counterexample to
    using dual semantics for optimization.

    --
    KM

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Sat Dec 23 17:17:25 2023
    Quoting you:
    Krishna Myneni wrote:
    This is my preference -- COMPILE, should simply compile the code needed
    to execute the xt given to it and not do something clever by
    substituting a different xt for it.

    Generally spoken, inlining substitutes the original xt with different xts
    (the inlined code) or machine code. But okay, just a misunderstanding.

    Merry Xmas !! :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to minforth on Sat Dec 23 14:52:39 2023
    On 12/23/23 11:17, minforth wrote:
    Quoting you:
    Krishna Myneni wrote:
    This is my preference -- COMPILE, should simply compile the code
    needed to execute the xt given to it and not do something clever by
    substituting a different xt for it.

    Generally spoken, inlining substitutes the original xt with different xts (the inlined code) or machine code. But okay, just a misunderstanding.

    Merry Xmas !!  :-)

    Likewise!

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to Anton Ertl on Thu Dec 28 22:27:07 2023
    On 12/22/23 08:14, Anton Ertl wrote:
    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    On 12/19/23 10:43, Anton Ertl wrote:
    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    But, I want }} to do the size-specific compilation,
    eventually. To do that properly, I want to go to a dual-xt system and
    avoid explicit STATE-dependence (which has other benefits we've
    discussed in the past).

    Avoiding STATE is a good idea. However, what you have in mind seems
    to be an optimization, not something like S". Dual-xt words are good
    for stuff like S"; you can use them for optimization, but the
    intelligent COMPILE, is better for that.


    We had a discussion about this earlier, and I did not like the design of
    SET-OPTIMIZER changing the behavior of COMPILE, .

    If it changes the behaviour of COMPILE, (rather than the
    implementation of that behaviour), that's a mistake in the use of SET-OPTIMIZER: Whatever you do, it must not change the behaviour.

    I don't see the
    drawback of changing the xt for compilation state as a method of
    optimization. Why add extra complexity with SET-OPTIMIZER when you don't
    have to?

    * It's a separation of concerns: SET-OPTIMIZER is for optimization and
    must not change the behaviour (i.e., if you replace the call to
    SET-OPTIMIZER with DROP, the program should still work), whereas
    SET->COMP (and related words such as INTERPRET/COMPILE:) changes the
    compilation semantics, i.e., the behaviour.

    * If you implement [COMPILE], you need to know if a word has
    non-default compilation semantics. If you have the separation of
    concerns above, that is easy: If it does not have the default
    NAME>COMPILE method (DEFAULT-NAME>COMP), it has non-default
    compilation semantics. If you are using this mechanism for a
    purpose that does not change the compilation semantics, you have to
    add a mechanism that tells the compiler about the difference, and
    the user has to provide this information in some way, too (e.g., by
    having INTERPRET/COMPILE: if the resulting word has non-default
    compilation semantics and INTERPRET/OPTIMIZE: if it has).

    * There is a difference in performance if an xt is COMPILE,ed; with
    the intelligent COMPILE, the result is as good as going through the
    text interpreter; with INTERPRET/COMPILE:, you get a generic call to
    the xt, while the text interpreter produces better code.

    * The COMPILE, methods get the xt that is COMPILE,d (they have the
    same stack effect as COMPILE,), which helps in using the same
    implementation for several xts. E.g., FOLD1-1 is the optimizer of
    29 words (all with the stack effect ( x -- x ). INTERPRET/COMPILE:
    lacks this flexibility. I guess you could have a way that provides
    the xt or nt of the word for which you are performing the
    compilation semantics; not sure how well that would work.

    Finally, my vision for the (far-away) future is that words such as S"
    and TO go away, and with them the need for words such as
    INTERPRET/COMPILE: or (worse) STATE-smart words.


    I have not had time to respond to this, but will do so in more detail in
    a separate thread. This is an important topic, and it will take me a
    little while to digest the points you make above, as well as lookup our previous discussion about SET-OPTIMIZER.

    My impression is that, in Gforth, you implement xt for a word as follows:

    { xt-interp, xt-compile [,xt-opt] }

    and SET-OPTIMIZER allows you to specify xt-opt. Perhaps some pseudo-code
    for INTERPRET will be helpful in understanding how compilation occurs in Gforth.

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Krishna Myneni on Fri Dec 29 08:21:51 2023
    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    My impression is that, in Gforth, you implement xt for a word as follows:

    { xt-interp, xt-compile [,xt-opt] }

    and SET-OPTIMIZER allows you to specify xt-opt.

    Actually, Gforth has 8 header methods (method selectors) and 9 setter
    words for specifying the method implementation for a method selector
    in the current word:

    execute ( ... xt -- ... ) set-execute ( code-address -- )
    execute ( ... xt -- ... ) set-does> ( xt -- )
    compile, ( xt -- ) set-optimizer ( xt -- )
    name>interpret ( nt -- xt ) set->int ( xt -- )
    name>compile ( nt -- xt1 xt2 ) set->comp ( xt -- )
    (to) ( val xt -- ) set-to ( xt -- )
    defer@ ( xt1 -- xt2 ) set-defer@ ( xt -- )
    name>string ( nt -- c-addr u ) set->string ( xt -- )
    name>link ( nt1 -- nt2|0 ) set->link ( xt -- )

    In particular, you change the compilation semantics by specifying the NAME>COMPILE method implementation, which is somewhat different from
    providing xt-compsem (your xt-compile). Based on that, we also
    provide the convenience words:

    set-compsem ( xt -- )
    compsem: ( -- ) \ dedicated to Stephen Pelc
    intsem: ( -- ) \ special dedication to Stephen Pelc interpret/compile: ( xt-int xt-comp "name" -- )

    Usage examples, all equivalent:

    :noname ." compiling" ;
    : foo ." interpreting" ; set-compsem

    : foo ." interpreting" ; compsem: ." compiling" ;
    : foo ." compiling" ; intsem: ." interpreting" ;

    : foo-int ." interpreting" ;
    : foo-comp ." compiling" ;
    ' foo-int ' foo-comp interpret/compile: foo

    Note that several people (including me) have recommended to define,
    for every dual-semantics word like FOO, also FOO-INT and FOO-COMP.

    Usage:
    9 interpret/compile:
    1 set-compsem: (in the definition of COMPSEM:)
    1 compsem:
    0 intselm:
    23 set-optimizer (and there are other, derived words)

    Read all about it in:

    http://www.euroforth.org/ef19/papers/paysan.pdf

    Since then, we have moved nt and xt to point to the body.

    You can see the header methods (except the code field) with

    .hm ( nt -- )

    E.g., for seeing the code generator of "!", you say

    ``! .hm

    Let's compare the header methods of "!", ":", and "TO" and the
    constant K-F1:

    ``! .hm ``: .hm ``to .hm ``k-f1 .hm
    opt: peephole-compile, :, :, constant,
    to: no-to no-to no-to no-to
    extra: $0 $0 $0 $0
    int: noop noop a>int noop
    comp: default-name>comp default-name>comp i/c>comp default-name>comp >string: named>string named>string named>string named>string
    link: named>link named>link named>link named>link

    The "extra:" field is used for SET-DOES>.

    Perhaps some pseudo-code
    for INTERPRET will be helpful in understanding how compilation occurs in >Gforth.

    For dictionary words, what happens is, in essence:

    parse-name find-name dup 0= #-13 and throw ( nt )
    state @ if
    name>compile
    else
    name>interpret
    then
    execute

    You won't find it in this form in the current text interpreter,
    because the text interpreter is now written to cover all kinds of
    recognizers.

    What may also be interesting to you is what happens then: For words
    with default interpretation and compilation semantics (most words),
    the NAME>COMPILE implementation is (simplified)

    : default-name>comp ( nt -- xt1 xt2 )
    name>interpret ['] compile, ;

    For an immediate word, the NAME>COMPILE implementation is:

    : imm>comp ( nt -- xt1 xt2 )
    name>interpret ['] execute ;

    Not just optimization, but all code generation happens in the
    implementations of COMPILE,. E.g., ":,", "CONSTANT," and
    "PEEPHOLE-COMPILE," are (simplified):

    : :, ( xt -- ) >body ['] call peephole-compile, , ;

    : constant, ( xt -- ) >body @ postpone literal ;

    : peephole-compile, ( xt -- )
    \ compile xt, appending its code to the current dynamic superinstruction
    lits, here swap , compile-prim1 ;

    LITS, ensures that any literals on the literal stack are compiled
    before the primitive (this is part of the constant folding
    implementation), and COMPILE-PRIM1 ( addr -- ) is in the C part of
    Gforth; in gforth-fast, it performs stack caching, combines primitives
    into superinstructions, and performs native-code generation if these optimizations are enabled (they are by default), but at the very
    least, turns the code-field address (ITC) into a code address (DTC).

    Note that in the gforth and gforth-fast engines, "," alone does not
    work for compiling a word, because these engines use hybrid
    direct/indirect threaded code, which requires primitive-centric code
    for colon definitions: In a colon definition, every word is compiled
    to a primitive, possibly followed by an immediate argument; e.g., a
    colon definition is compiled to the primitive CALL followed by the
    address of the called word. See:

    @InProceedings{ertl02,
    author = {M. Anton Ertl},
    title = {Threaded Code Variations and Optimizations (Extended
    Version)},
    booktitle = {Forth-Tagung 2002},
    year = {2002},
    address = {Garmisch-Partenkirchen},
    url = {http://www.complang.tuwien.ac.at/papers/ertl02.ps.gz},
    abstract = {Forth has been traditionally implemented as indirect
    threaded code, where the code for non-primitives is
    the code-field address of the word. To get the
    maximum benefit from combining sequences of
    primitives into superinstructions, the code produced
    for a non-primitive should be a primitive followed
    by a parameter (e.g., \code{lit} \emph{addr} for
    variables). This paper takes a look at the steps
    from a traditional threaded-code implementation to
    superinstructions, and at the size and speed effects
    of the various steps.\comment{It also compares these
    variants of Gforth to various other Forth
    implementations on contemporary machines.} The use
    of superinstructions gives speedups of up to a
    factor of 2 on large benchmarks on processors with
    branch target buffers, but requires more space for
    the primitives and the optimization tables, and also
    a little more space for the threaded code.}
    }

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to Krishna Myneni on Sun Dec 31 10:27:20 2023
    On 12/18/23 18:48, Krishna Myneni wrote:
    On 12/18/23 07:37, Krishna Myneni wrote:
    ...
    I also introduced the word "*+" in the last commit in kForth-64.

    *+ ( a b c -- n )  \ n = a*b + c

    Note that *+ is not the same as the sequence "* +", ...

    I expect the floating point version, F*+, to be equally, if not more
    useful for improving readability of fp code and increasing efficiency.

    F*+ ( F: r1 r2 r3 -- r )  \ r = r1*r2 + r3

    F*+ provides a intrinsic scalar linear transformation. I have not yet
    added this word.


    I need to correct my terminology in the prior statement. It is incorrect
    to refer to the word F*+ (and its integer counterpart, *+) as a scalar
    "linear transformation." They allow efficient calculation of linear
    functions, of the form f(x) = a*x + b.

    A "linear transformation" is something else. A function effecting a
    linear transformation must have the property f(x=0) = 0, which is most certainly not true in the general case for F*+ or *+.

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to Krishna Myneni on Tue Jan 2 13:08:48 2024
    Krishna Myneni wrote:
    I need to correct my terminology in the prior statement. It is incorrect
    to refer to the word F*+ (and its integer counterpart, *+) as a scalar "linear transformation." They allow efficient calculation of linear functions, of the form f(x) = a*x + b.

    A "linear transformation" is something else. A function effecting a
    linear transformation must have the property f(x=0) = 0, which is most certainly not true in the general case for F*+ or *+.

    Mathematically correct. FMA or F*+ is used row-wise or column-wise within
    many matrix operations e.g. Gaussian elimination.

    Unsurprisingly many modern CPUs even support vectorized FMA.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to Anton Ertl on Tue Jan 2 20:11:37 2024
    On 12/29/23 02:21, Anton Ertl wrote:
    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    My impression is that, in Gforth, you implement xt for a word as follows:

    { xt-interp, xt-compile [,xt-opt] }

    and SET-OPTIMIZER allows you to specify xt-opt.

    Actually, Gforth has 8 header methods (method selectors) and 9 setter
    words for specifying the method implementation for a method selector
    in the current word:

    execute ( ... xt -- ... ) set-execute ( code-address -- )
    execute ( ... xt -- ... ) set-does> ( xt -- )
    compile, ( xt -- ) set-optimizer ( xt -- )
    name>interpret ( nt -- xt ) set->int ( xt -- )
    name>compile ( nt -- xt1 xt2 ) set->comp ( xt -- )
    (to) ( val xt -- ) set-to ( xt -- )
    defer@ ( xt1 -- xt2 ) set-defer@ ( xt -- )
    name>string ( nt -- c-addr u ) set->string ( xt -- )
    name>link ( nt1 -- nt2|0 ) set->link ( xt -- )

    In particular, you change the compilation semantics by specifying the NAME>COMPILE method implementation, which is somewhat different from providing xt-compsem (your xt-compile). Based on that, we also
    provide the convenience words:

    set-compsem ( xt -- )
    compsem: ( -- ) \ dedicated to Stephen Pelc
    intsem: ( -- ) \ special dedication to Stephen Pelc interpret/compile: ( xt-int xt-comp "name" -- )

    Usage examples, all equivalent:

    :noname ." compiling" ;
    : foo ." interpreting" ; set-compsem

    : foo ." interpreting" ; compsem: ." compiling" ;
    : foo ." compiling" ; intsem: ." interpreting" ;

    : foo-int ." interpreting" ;
    : foo-comp ." compiling" ;
    ' foo-int ' foo-comp interpret/compile: foo

    Note that several people (including me) have recommended to define,
    for every dual-semantics word like FOO, also FOO-INT and FOO-COMP.

    Usage:
    9 interpret/compile:
    1 set-compsem: (in the definition of COMPSEM:)
    1 compsem:
    0 intselm:
    23 set-optimizer (and there are other, derived words)

    Read all about it in:

    http://www.euroforth.org/ef19/papers/paysan.pdf

    Since then, we have moved nt and xt to point to the body.

    You can see the header methods (except the code field) with

    .hm ( nt -- )

    E.g., for seeing the code generator of "!", you say

    ``! .hm

    Let's compare the header methods of "!", ":", and "TO" and the
    constant K-F1:

    ``! .hm ``: .hm ``to .hm ``k-f1 .hm
    opt: peephole-compile, :, :, constant,
    to: no-to no-to no-to no-to
    extra: $0 $0 $0 $0
    int: noop noop a>int noop
    comp: default-name>comp default-name>comp i/c>comp default-name>comp >> string: named>string named>string named>string named>string
    link: named>link named>link named>link named>link

    The "extra:" field is used for SET-DOES>.

    Perhaps some pseudo-code
    for INTERPRET will be helpful in understanding how compilation occurs in
    Gforth.

    For dictionary words, what happens is, in essence:

    parse-name find-name dup 0= #-13 and throw ( nt )
    state @ if
    name>compile
    else
    name>interpret
    then
    execute

    You won't find it in this form in the current text interpreter,
    because the text interpreter is now written to cover all kinds of recognizers.

    What may also be interesting to you is what happens then: For words
    with default interpretation and compilation semantics (most words),
    the NAME>COMPILE implementation is (simplified)

    : default-name>comp ( nt -- xt1 xt2 )
    name>interpret ['] compile, ;

    For an immediate word, the NAME>COMPILE implementation is:

    : imm>comp ( nt -- xt1 xt2 )
    name>interpret ['] execute ;

    Not just optimization, but all code generation happens in the
    implementations of COMPILE,. E.g., ":,", "CONSTANT," and
    "PEEPHOLE-COMPILE," are (simplified):

    : :, ( xt -- ) >body ['] call peephole-compile, , ;

    : constant, ( xt -- ) >body @ postpone literal ;

    : peephole-compile, ( xt -- )
    \ compile xt, appending its code to the current dynamic superinstruction
    lits, here swap , compile-prim1 ;

    LITS, ensures that any literals on the literal stack are compiled
    before the primitive (this is part of the constant folding
    implementation), and COMPILE-PRIM1 ( addr -- ) is in the C part of
    Gforth; in gforth-fast, it performs stack caching, combines primitives
    into superinstructions, and performs native-code generation if these optimizations are enabled (they are by default), but at the very
    least, turns the code-field address (ITC) into a code address (DTC).

    Note that in the gforth and gforth-fast engines, "," alone does not
    work for compiling a word, because these engines use hybrid
    direct/indirect threaded code, which requires primitive-centric code
    for colon definitions: In a colon definition, every word is compiled
    to a primitive, possibly followed by an immediate argument; e.g., a
    colon definition is compiled to the primitive CALL followed by the
    address of the called word. See:

    @InProceedings{ertl02,
    author = {M. Anton Ertl},
    title = {Threaded Code Variations and Optimizations (Extended
    Version)},
    booktitle = {Forth-Tagung 2002},
    year = {2002},
    address = {Garmisch-Partenkirchen},
    url = {http://www.complang.tuwien.ac.at/papers/ertl02.ps.gz},
    abstract = {Forth has been traditionally implemented as indirect
    threaded code, where the code for non-primitives is
    the code-field address of the word. To get the
    maximum benefit from combining sequences of
    primitives into superinstructions, the code produced
    for a non-primitive should be a primitive followed
    by a parameter (e.g., \code{lit} \emph{addr} for
    variables). This paper takes a look at the steps
    from a traditional threaded-code implementation to
    superinstructions, and at the size and speed effects
    of the various steps.\comment{It also compares these
    variants of Gforth to various other Forth
    implementations on contemporary machines.} The use
    of superinstructions gives speedups of up to a
    factor of 2 on large benchmarks on processors with
    branch target buffers, but requires more space for
    the primitives and the optimization tables, and also
    a little more space for the threaded code.}
    }

    - anton

    Thank you for the details. There is a good deal to absorb here.

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to Anton Ertl on Wed Jan 3 13:42:26 2024
    Anton Ertl wrote:
    Not just optimization, but all code generation happens in the
    implementations of COMPILE,. E.g., ":,", "CONSTANT," and
    "PEEPHOLE-COMPILE," are (simplified):

    : :, ( xt -- ) >body ['] call peephole-compile, , ;

    : constant, ( xt -- ) >body @ postpone literal ;

    : peephole-compile, ( xt -- )
    compile xt, appending its code to the current dynamic superinstruction
    lits, here swap , compile-prim1 ;

    LITS, ensures that any literals on the literal stack are compiled
    before the primitive (this is part of the constant folding
    implementation), and COMPILE-PRIM1 ( addr -- ) is in the C part of
    Gforth; in gforth-fast, it performs stack caching, combines primitives
    into superinstructions, and performs native-code generation if these optimizations are enabled (they are by default), but at the very
    least, turns the code-field address (ITC) into a code address (DTC).

    For constant folding, superinstructions and peephole optimisation,
    my compiler has a FIFO token queue for delayed compilation.
    Optimisations are based on simple pattern recognition of the queue content. When active, this is fully automatic and does not require complicated contortions and a bag of special words.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to minforth on Thu Jan 4 11:33:04 2024
    minforth@gmx.net (minforth) writes:
    Anton Ertl wrote:
    Not just optimization, but all code generation happens in the
    implementations of COMPILE,. E.g., ":,", "CONSTANT," and
    "PEEPHOLE-COMPILE," are (simplified):

    : :, ( xt -- ) >body ['] call peephole-compile, , ;

    : constant, ( xt -- ) >body @ postpone literal ;

    : peephole-compile, ( xt -- )
    \ compile xt, appending its code to the current dynamic superinstruction >> lits, here swap , compile-prim1 ;

    LITS, ensures that any literals on the literal stack are compiled
    before the primitive (this is part of the constant folding
    implementation), and COMPILE-PRIM1 ( addr -- ) is in the C part of
    Gforth; in gforth-fast, it performs stack caching, combines primitives
    into superinstructions, and performs native-code generation if these
    optimizations are enabled (they are by default), but at the very
    least, turns the code-field address (ITC) into a code address (DTC).

    For constant folding, superinstructions and peephole optimisation,
    my compiler has a FIFO token queue for delayed compilation.
    Optimisations are based on simple pattern recognition of the queue content. >When active, this is fully automatic and does not require complicated >contortions and a bag of special words.

    No words for these optimizations in this compiler? So they are
    written in a different language than Forth.

    Gforth also has that, and COMPILE-PRIM1 is the interface to it. It
    combines primitives into superinstructions, it optimizes stack
    cacheing and it optimizes many instruction-pointer updates away. It
    does not use a FIFO, but compiles a superblock at a time. It is quite complicated (contorted?), but it does produce good results. You can
    see the results by disabling the optimizations. Here are results on a
    Ryzen 5800X.

    sieve bubble matrix fib fft
    0.037 0.035 0.013 0.031 0.021 gforth-fast
    0.058 0.061 0.034 0.054 0.020 gforth-fast --ss-states=0 --ss-number=0 --opt-ip-updates=0

    Gforth implements constant folding and some related optimizations,
    such as optimizing "2 PICK" to "THIRD" on the Forth level, and that,
    of course means that there is "a bag of special words". Matthias Koch
    invented this approach. It uses a literal stack, and "LITERAL" pushes
    its argument to that stack rather than compiling a literal.
    "COMPILE," implementations of many words look at the literal stack and
    perform constant folding if all arguments are constant, or sometimes
    other optimizations (e.g., of 2 PICK) if some of the arguments are
    constant.

    This is actually a simple and straightforward optimization, and has
    the nice property that most of the optimization code is very local to
    the word that is optimized. E.g., the 2 PICK optimization looks as
    follows:

    :noname ( xt -- )
    lits# 1 u>= if
    lits> case
    0 of postpone dup drop exit endof
    1 of postpone over drop exit endof
    2 of postpone third drop exit endof
    3 of postpone fourth drop exit endof
    dup >lits
    endcase
    then
    peephole-compile, ;
    optimizes pick

    And note that DUP also has a constant-folding optimization, but that's
    local to DUP. So if you actually compile

    : foo 3 0 pick ;

    you get the same code as with

    : foo 3 3 ;

    because first PICK is optimized to "COMPILE," DUP, and then DUP is
    optimized. I doubt that this particular optimization sequence will
    ever happen, but there have been other cases where I have been
    positively surprised how far this simple optimization got us.

    Anyway, back to the "special words". I guess you mean "LITS,". Yes,
    there are some places where "LITS," has to be used; these are 4 places
    in Gforth, three of them corresponding to code generation boundaries
    (BEGIN, the start of a closure, and BASIC-BLOCK-END), and the use in "PEEPHOLE-COMPILE," just before calling the lower-level compiler
    COMPILE-PRIM1 shown above. If you look at it from a higher level
    (e.g., at the level of COMPILE,), it's "fully automatic".

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to dxf on Fri Jan 5 11:27:15 2024
    dxf <dxforth@gmail.com> writes:
    On 4/01/2024 10:33 pm, Anton Ertl wrote:
    :noname ( xt -- )
    lits# 1 u>= if
    lits> case
    0 of postpone dup drop exit endof
    1 of postpone over drop exit endof
    2 of postpone third drop exit endof
    3 of postpone fourth drop exit endof
    dup >lits
    endcase
    then
    peephole-compile, ;
    optimizes pick

    What this demonstrates is how badly ANS CASE needs optimization.
    PICK not so much.

    What makes you think so?

    Sure, this could be rewritten as:

    create npicks ' dup , ' over , ' third , ' fourth ,

    :noname ( xt -- )
    lits# 1 u>= if
    lits> dup 4 u< if
    cells npicks + @ compile, exit then
    >lits then
    peephole-compile, ;
    optimizes pick

    and the result would be slightly faster when it is used, PICK is
    compiled not that often, so why optimize it? Especially given that,
    if the optimization was really useful, one could just write the latter
    code.

    Another question in this context is which version is easier to write
    correctly and easier to read and understand.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to dxforth@gmail.com on Sat Jan 6 14:14:33 2024
    In article <unaf0p$b7ql$1@dont-email.me>, dxf <dxforth@gmail.com> wrote:
    On 5/01/2024 10:27 pm, Anton Ertl wrote:
    dxf <dxforth@gmail.com> writes:
    On 4/01/2024 10:33 pm, Anton Ertl wrote:
    :noname ( xt -- )
    lits# 1 u>= if
    lits> case
    0 of postpone dup drop exit endof
    1 of postpone over drop exit endof
    2 of postpone third drop exit endof
    3 of postpone fourth drop exit endof
    dup >lits
    endcase
    then
    peephole-compile, ;
    optimizes pick

    What this demonstrates is how badly ANS CASE needs optimization.
    PICK not so much.

    What makes you think so?

    Is it not obvious to all that ENDOF jumps in the above are compiled but
    never used? I'm curious. What is it about ANS-FORTH that users and >implementers alike have become so uncritical?

    Sure, this could be rewritten as:

    create npicks ' dup , ' over , ' third , ' fourth ,

    :noname ( xt -- )
    lits# 1 u>= if
    lits> dup 4 u< if
    cells npicks + @ compile, exit then
    >lits then
    peephole-compile, ;
    optimizes pick

    and the result would be slightly faster when it is used, PICK is
    compiled not that often, so why optimize it? Especially given that,
    if the optimization was really useful, one could just write the latter
    code.

    Don't make a computer do what intelligence will do anyway.

    Another question in this context is which version is easier to write
    correctly and easier to read and understand.

    When the cases are 0 1 2 3 etc execution tables are easy to write and
    overall code is minimal. But such is not often the case.

    Optimiser for each word? IMHO this is peephole gone to far.
    I prefer an optimisation table.

    Adding
    "
    0 PICK | DUP |
    1 PICK | OVER |
    2 PICK | ROT DUP >R ROT ROT R> |
    "
    is easy enough.
    Is it worth it?

    From my optimiser:

    : (MATCH-TABLE) |
    \ `` MATCH-TABLE'' points here :
    'P EXECUTE | P | \ Execute optimisation
    'P + 'P + | 'P 'P + + | \ Associativity optimisation
    'P 'P D+ 'P 'P D+ | 'P 'P 'P 'P D+ D+ | \ Associativity optimisation
    'P + 'P - | 'P 'P - + |
    'P - 'P + | 'P 'P - - |
    'P - 'P - | 'P 'P + - |
    'P @ 'P ! | NOOP |
    'P ! 'P @ | DUP 'P ! |
    'P M* DROP 'P M* DROP | 'P 'P M* DROP M* DROP | \ Invalid if last drop removed!
    'P OR 'P OR | 'P 'P OR OR |
    'P AND 'P AND | 'P 'P AND AND |
    'P XOR 'P XOR | 'P 'P XOR XOR |
    [ 0 ]L + | NOOP | \ Shortcut evaluations
    [ 0 ]L - | NOOP |
    [ 0 ]L M* DROP | DROP 0 |
    [ 0 ]L OR | NOOP |
    [ 0 ]L AND | DROP 0 |
    [ 0 ]L XOR | NOOP |
    [ 1 ]L M* DROP | NOOP |
    [ 1 ]L / | NOOP |
    [ -1 ]L M* DROP | NEGATE |
    [ -1 ]L / | NEGATE |
    [ -1 ]L OR | DROP -1 |
    [ -1 ]L AND | NOOP |
    [ -1 ]L XOR | INVERT |
    'P LSHIFT 'P LSHIFT | 'P 'P + LSHIFT | \ Distributivity optimisation
    'P RSHIFT 'P RSHIFT | 'P 'P + RSHIFT |
    [ 0 ]L 0BRANCH [ 'P , ] | NOP1 NOP1 BRANCH [ 'P , ] | \ Branch optimisation
    'P 0BRANCH [ 'P , ] | NOOP | \ Non-zero, zero is matched by previous
    BRANCH [ 0 , ] | NOOP |
    0BRANCH [ 0 , ] | DROP |
    < 0= | 1+ > |
    0= | 1- < |
    R R> | NOOP |
    R | NOOP |
    ;

    This is on top of generalised folding and before inspecting the
    machine code.
    The machine code matcher relies heavily on my ciasdis assembler

    <! !Q LEA, XO| !!T 0 {L,} ~!!T
    !Q LEA, XO| !!T 0 {L,} ~!!T !>
    <A QX: !TALLY LEA, XO| 0 L, !TALLY A>
    { bufv L@ $FFFFFF AND bufc OR!U
    bufv 3 + L@ bufv 10 + L@ + bufc 3 + L! }
    optimisation lealea-pattern

    This generates an optimisation object, with a match pattern,
    a replace pattern and a replacing xt.
    This is definitively ugly. In view of special registers treatment in
    the 86 I am at the verge of giving up.
    Making a RISCV optimiser is probably a much more fruitful endeavour.

    (The optimised version of the original byte sieve benchmark comes
    with 10/20 procent of vfx and swiftforth versions, but this is unfair.)

    Groetjes Albert



    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat spinning. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to dxf on Sat Jan 6 16:35:36 2024
    dxf <dxforth@gmail.com> writes:
    On 5/01/2024 10:27 pm, Anton Ertl wrote:
    dxf <dxforth@gmail.com> writes:
    On 4/01/2024 10:33 pm, Anton Ertl wrote:
    :noname ( xt -- )
    lits# 1 u>= if
    lits> case
    0 of postpone dup drop exit endof
    1 of postpone over drop exit endof
    2 of postpone third drop exit endof
    3 of postpone fourth drop exit endof
    dup >lits
    endcase
    then
    peephole-compile, ;
    optimizes pick

    What this demonstrates is how badly ANS CASE needs optimization.
    PICK not so much.

    What makes you think so?

    Is it not obvious to all that ENDOF jumps in the above are compiled but
    never used?

    No. It depends on the Forth system. This optimization would be
    relatively easy to implement (see below), but I never found it
    worthwhile.

    How to implement it? When performing the compilation semantics of
    EXIT, AHEAD or AGAIN, just set a flag DEAD-CODE, and clear that flag
    when performing the compilation semantics of THEN or BEGIN, or when
    starting a new definition. Let the code-generation words ("COMPILE,",
    LITERAL etc.) skip the code generation action when DEAD-CODE is set.

    Gforth maintains DEAD-CODE for the automatic scoping of local
    variables, but does not skip code generation.

    This optimization would also eliminate the final exit in the pattern

    begin ... again ;

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to Krishna Myneni on Wed Jan 17 08:57:56 2024
    On 12/13/23 18:38, Krishna Myneni wrote:
    The array and matrix definitions given in the FSL utilities source,

    fsl-util.x
    or
    fsl_util.x.

    The arrays are quite nice for their flexibility in making arrays out of
    any type of data, and are useful in many instances. However, the source definitions are slow on non-optimizing Forth systems.

    I believe the design of the arrays and matrices traces back to Julian Noble's, "Scientific Forth." A 1-D array is named with a trailing left
    brace "{" while 2-D matrices have a trailing double left brace, "{{".
    This allows a convenient notation

    a{ I }     \ resolves to address of array element I
    m{{ J I }} \ resolves to address of matrix element at row J, col I

    ...

    For writing code which uses FSL arrays of FLOATS, we can improve the
    efficiency significantly by defining ]F@ and ]F! . Recently, I had to
    modify my Numerov integrator module in this manner, not primarily for efficiency, but to be able to pass the address of an arbitrary element
    index in an FSL array which is treated as the first element of a sub
    array (this can't be done with the usual FSL array indexing method).

    See

    https://github.com/mynenik/kForth-64/blob/master/forth-src/fsl/extras/numerov.4th

    --
    KM

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)