• Sieve of Atkin

    From Marcel Hendrix@21:1/5 to All on Fri May 20 05:53:24 2022
    Apparently the fastest possible Sieve (in the limit). https://www.baeldung.com/cs/prime-number-algorithms .

    Unfortunately, the pseudo-code (?) is somewhat ambiguous.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jan Coombs@21:1/5 to Marcel Hendrix on Fri May 20 16:48:42 2022
    On Fri, 20 May 2022 05:53:24 -0700 (PDT)
    Marcel Hendrix <mhx@iae.nl> wrote:

    Apparently the fastest possible Sieve (in the limit). https://www.baeldung.com/cs/prime-number-algorithms .

    Unfortunately, the pseudo-code (?) is somewhat ambiguous.

    The explanation is good, thanks. This seems to work, how's your python?

    This is the optimized implementation proposed by Zsolt KOVACS:
    from: https://stackoverflow.com/questions/21783160/sieve-of-atkin-implementation-in-python

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Jan Coombs on Fri May 20 18:41:28 2022
    On Friday, May 20, 2022 at 5:48:44 PM UTC+2, Jan Coombs wrote:
    On Fri, 20 May 2022 05:53:24 -0700 (PDT)
    [..]
    This is the optimized implementation proposed by Zsolt KOVACS:
    from: https://stackoverflow.com/questions/21783160/sieve-of-atkin-implementation-in-python

    Optimized? Maybe according to Python standards...

    (*
    * LANGUAGE : ANS Forth with extensions
    * PROJECT : Forth Environments
    * DESCRIPTION : Sieve of Atkin, fastest Sieve O(n)
    * CATEGORY : Example
    * AUTHOR : Marcel Hendrix
    * LAST CHANGE : 02:27:42, May 21, 2022, Marcel Hendrix
    *)


    NEEDS -miscutil

    REVISION -sieveofatkin "--- Sieve of Atkin Version 0.01 ---"

    PRIVATES

    0 VALUE P
    0 VALUE sz
    #16384 VALUE limit
    CREATE sieve PRIVATE limit 1+ ALLOT

    : SieveOfAtkin ( -- )
    sieve limit 1+ ERASE
    HERE TO P 2 , 3 , 2 TO sz
    limit SQRT 2+
    1 DO limit SQRT 2+
    1 DO J SQR 4 * I SQR + >S
    S limit <= S #12 MOD DUP 1 = SWAP 5 = OR AND IF sieve S + DUP C@ NOT SWAP C! ENDIF -S
    J SQR 3 * I SQR + >S
    S limit <= S #12 MOD 7 = AND IF sieve S + DUP C@ NOT SWAP C! ENDIF -S
    J SQR 3 * I SQR - >S
    J I > S limit <= AND S #12 MOD #11 = AND IF sieve S + DUP C@ NOT SWAP C! ENDIF -S
    LOOP
    LOOP
    limit SQRT 1+ 5 DO sieve I + C@ IF limit 2+ I SQR DO sieve I + C0! J SQR +LOOP ENDIF LOOP
    limit 1+ 5 DO sieve I + C@ IF I , 1 +TO sz ENDIF LOOP
    P sz ;

    : PSHOW ( addr u -- ) CR 0 ?DO @+ . LOOP DROP ;

    :ABOUT CR ." Try: SieveOfAtkin PSHOW " ;

    NESTING @ 1 = [IF] .ABOUT -sieveofatkin [THEN]

    DEPRIVE

    (* End of Source *)

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to Marcel Hendrix on Sat May 21 12:10:15 2022
    On 21/05/2022 11:41, Marcel Hendrix wrote:
    On Friday, May 20, 2022 at 5:48:44 PM UTC+2, Jan Coombs wrote:
    On Fri, 20 May 2022 05:53:24 -0700 (PDT)
    [..]
    This is the optimized implementation proposed by Zsolt KOVACS:
    from:
    https://stackoverflow.com/questions/21783160/sieve-of-atkin-implementation-in-python

    Optimized? Maybe according to Python standards...

    (*
    * LANGUAGE : ANS Forth with extensions
    *)

    "ANS Forth" ? Perhaps according to your standards :)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to dxforth on Sat May 21 01:00:25 2022
    On Saturday, May 21, 2022 at 4:10:20 AM UTC+2, dxforth wrote:
    On 21/05/2022 11:41, Marcel Hendrix wrote:
    [..]
    (*
    * LANGUAGE : ANS Forth with extensions
    *)

    "ANS Forth" ? Perhaps according to your standards :)

    What is there about "with extensions" that you don't understand?

    (*
    * LANGUAGE : ANS Forth with extensions
    * PROJECT : Forth Environments
    * DESCRIPTION : Sieve of Atkin, fastest Sieve O(n)
    * CATEGORY : Example
    * AUTHOR : Marcel Hendrix
    * LAST CHANGE : Saturday, May 21, 2022, 9:54 AM, mhx;
    *)


    NEEDS -miscutil

    REVISION -sieveofatkin "--- Sieve of Atkin Version 0.02 ---"

    PRIVATES

    #1000 =: #times
    #16384 VALUE limit
    limit SQRT 2+ =: SQRT(limit+2) PRIVATE
    limit SQRT 1+ =: SQRT(limit+1) PRIVATE
    CREATE sieve PRIVATE limit 1+ ALLOT

    : toggle ( ix -- ) sieve + DUP C@ NOT SWAP C! ; PRIVATE
    : f1: ( I J -- ) SQR 4 * SWAP SQR + ; PRIVATE
    : f2: ( I J -- ) SQR 3 * SWAP SQR + ; PRIVATE
    : f3: ( I J -- ) SQR 3 * SWAP SQR - ; PRIVATE

    : SieveOfAtkin ( -- u )
    sieve limit 1+ ERASE
    SQRT(limit+2)
    1 DO SQRT(limit+2)
    1 DO I J f1: >S S limit <= S #12 MOD DUP 1 = SWAP 5 = OR AND IF S toggle ENDIF -S
    I J f2: >S S limit <= S #12 MOD 7 = AND IF S toggle ENDIF -S
    I J f3: >S J I > S limit <= AND S #12 MOD #11 = AND IF S toggle ENDIF -S
    LOOP
    LOOP
    SQRT(limit+1) 5 DO sieve I + C@ IF limit 2+ I SQR DO sieve I + C0! J SQR +LOOP ENDIF LOOP
    2 ( 2 and 3 ) limit 1+ 5 DO sieve I + C@ 1 AND + LOOP ( sz) ;

    : PRIMES CR #times DEC. ." iterations." TIMER-RESET
    0 #times 0 DO DROP SieveOfAtkin LOOP
    MS? SWAP CR . ." primes found, " DEC. ." ms elapsed." ;

    :ABOUT CR ." Try: SieveOfAtkin . "
    CR ." PRIMES " ;

    NESTING @ 1 = [IF] .ABOUT -sieveofatkin [THEN]

    DEPRIVE

    (* End of Source *)


    FORTH> in
    Creating --- Sieve of Atkin Version 0.02 ---

    Try: SieveOfAtkin .
    PRIMES ok
    FORTH> SieveOfAtkin . 1900 ok
    FORTH> PRIMES
    1000 iterations.
    1900 primes found, 94 ms elapsed. ok

    FORTH> in sieve
    Creating --- SIEVE benchmark Version 1.00 ---
    Redefining `#times`
    Redefining `PRIMES`

    Sieve (100 iterations)
    Compile time Run time Primes
    Par.C 38.19 8.732 1899
    Inmos ICC 83.98 9.578 1899
    3L C 45.25 5.955 1899
    tForth (WS) 1.00 7.244 1899
    tForth (main) 1.00 9.253 1899
    iForth 386/33 0.3 5.767 1899
    iForth 486/50 0.1 1.595 1899
    iForth 486/66 0.1 1.559 1899
    iForth P5/200 0.1 0.195 1899
    iForth PIV/3G 0.1 0.076 1899

    Enter PRIMES to run the benchmark.
    ok
    FORTH> PRIMES
    1000 iterations.
    1899 primes found, 12 ms elapsed. ok
    FORTH>

    Out of the box, Atkin is 94 / 12 ~ 8x slower than the standard Sieve algorithm. The standard algorithm needs no division, so it will work for limit > 2^64, while
    Atkin must be extended for that.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to All on Sat May 21 01:59:54 2022
    On Saturday, May 21, 2022 at 10:00:26 AM UTC+2, Marcel Hendrix wrote:
    [..]

    Probably easier to understand. Speed unchanged.

    (*
    * LANGUAGE : ANS Forth with extensions
    * PROJECT : Forth Environments
    * DESCRIPTION : Sieve of Atkin, fastest Sieve O(n)
    * CATEGORY : Example
    * AUTHOR : Marcel Hendrix
    * LAST CHANGE : Saturday, May 21, 2022, 9:54 AM, mhx;
    *)


    NEEDS -miscutil

    REVISION -sieveofatkin "--- Sieve of Atkin Version 0.03 ---"

    PRIVATES

    #1000 =: #times
    #16384 VALUE limit PRIVATE
    limit SQRT 2+ =: SQRT(limit+2) PRIVATE
    limit SQRT 1+ =: SQRT(limit+1) PRIVATE
    CREATE sieve PRIVATE limit 1+ ALLOT

    : init ( -- ) sieve limit 1+ ERASE ; PRIVATE
    : flip ( ix -- ) sieve + DUP C@ NOT SWAP C! ; PRIVATE
    : fi1 ( I J -- ) SQR 4 * SWAP SQR + >S ; PRIVATE
    : fi2 ( I J -- ) SQR 3 * SWAP SQR + >S ; PRIVATE
    : fi3 ( I J -- ) SQR 3 * SWAP SQR - >S ; PRIVATE
    : test1 ( -- ) S limit <= S #12 MOD DUP 1 = SWAP 5 = OR AND IF S flip ENDIF -S ; PRIVATE
    : test2 ( -- ) S limit <= S #12 MOD 7 = AND IF S flip ENDIF -S ; PRIVATE
    : +test3 ( ? -- ) S limit <= AND S #12 MOD #11 = AND IF S flip ENDIF -S ; PRIVATE
    : no-p^2 ( -- ) SQRT(limit+1) 5 DO sieve I + C@ IF limit 2+ I SQR DO sieve I + C0! J SQR +LOOP ENDIF LOOP ; PRIVATE
    : cnt ( -- u ) 2 ( 2 and 3 ) limit 1+ 5 DO sieve I + C@ 1 AND + LOOP ; PRIVATE

    : SieveOfAtkin ( -- u )
    init
    SQRT(limit+2)
    1 DO SQRT(limit+2)
    1 DO I J fi1 test1
    I J fi2 test2
    I J fi3 J I > +test3
    LOOP
    LOOP
    no-p^2
    cnt ( sz) ;

    : PRIMES CR #times DEC. ." iterations." TIMER-RESET
    0 #times 0 DO DROP SieveOfAtkin LOOP
    MS? SWAP CR . ." primes found, " DEC. ." ms elapsed." ;

    :ABOUT CR ." Try: SieveOfAtkin . "
    CR ." PRIMES " ;

    NESTING @ 1 = [IF] .ABOUT -sieveofatkin [THEN]

    DEPRIVE

    (* End of Source *)

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jali Heinonen@21:1/5 to All on Sat May 21 09:28:52 2022
    perjantai 20. toukokuuta 2022 klo 15.53.26 UTC+3 Marcel Hendrix kirjoitti:
    Apparently the fastest possible Sieve (in the limit). https://www.baeldung.com/cs/prime-number-algorithms .

    Unfortunately, the pseudo-code (?) is somewhat ambiguous.

    -marcel

    Is it really faster than sieve implementation that is wheel factorized to the max?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Jali Heinonen on Sat May 21 14:02:24 2022
    On Saturday, May 21, 2022 at 6:28:54 PM UTC+2, Jali Heinonen wrote:
    perjantai 20. toukokuuta 2022 klo 15.53.26 UTC+3 Marcel Hendrix kirjoitti:
    Apparently the fastest possible Sieve (in the limit). https://www.baeldung.com/cs/prime-number-algorithms .

    Unfortunately, the pseudo-code (?) is somewhat ambiguous.

    -marcel
    Is it really faster than sieve implementation that is wheel factorized to the max?

    Well, so much for CS.

    I adapted both Atkin and the standard Sieve so that I could run them with
    array sizes between 1k and 10GByte, with the following results:

    Standard Sieve:
    n runtime (ms) primes
    1,000 0 302
    10,000 0 2,261
    100,000 0 17,983
    1,000,000 3 148,932
    10,000,000 29 1,270,606
    100,000,000 962 11,078,936
    1,000,000,000 12695 98,222,286
    10,000,000,000 149099 882,206,715

    Atkin Sieve
    n runtime (ms) primes
    1,000 0 168
    10,000 0 1,229
    100,000 0 9,592
    1,000,000 4 78,498
    10,000,000 60 664,579
    100,000,000 1728 5,761,455
    1,000,000,000 20900 50,847,534
    10,000,000,000 301417 455,052,511

    Atkin is about 4 times slower than the Standard Sieve (50% of the
    results in twice the time). It might slowly increase for even larger
    sieves, but with 20 GBytes there is not much evidence for that yet.
    Maybe somebody with a TB of RAM has more luck.

    The jump in runtime between 0.1GB and 1GB might have something to do
    with the working-set default of a Windows program (I see no evidence
    of disk-activity but ALLOCATE takes noticeably longer).

    Here is the final SieveOfAtkin. I fixed a bug that had to do with Python's logic operators apparently taking shortcuts without being told.

    (*
    * LANGUAGE : ANS Forth with extensions
    * PROJECT : Forth Environments
    * DESCRIPTION : Sieve of Atkin, fastest Sieve O(n)
    * CATEGORY : Example
    * AUTHOR : Marcel Hendrix
    * LAST CHANGE : Saturday, May 21, 2022, 9:54 AM, mhx;
    *)


    NEEDS -miscutil

    REVISION -sieveofatkin "--- Sieve of Atkin Version 0.04 ---"

    PRIVATES

    DOC
    (*
    n runtime (ms) primes
    1,000 0 168
    10,000 0 1,229
    100,000 0 9,592
    1,000,000 4 78,498
    10,000,000 60 664,579
    100,000,000 1728 5,761,455
    1,000,000,000 20900 50,847,534
    10,000,000,000 301417 455,052,511
    *)
    ENDDOC

    DEPTH 0= [IF] CR .( Stack Empty) ABORT [THEN]
    ( #16384 ) =: limit PRIVATE
    #1 =: #times PRIVATE
    limit SQRT 2+ =: SQRT(limit+2) PRIVATE
    limit SQRT 1+ =: SQRT(limit+1) PRIVATE
    limit 1+ ALLOCATE ?ALLOCATE =: sieve PRIVATE
    limit 1+ ALLOCATE ?ALLOCATE =: mods PRIVATE

    : set# ( -- ) limit 2+ 0 DO I #12 MOD mods I + C! LOOP ; PRIVATE set# : init ( -- ) sieve limit 1+ ERASE ( set# ) ; PRIVATE
    : flip ( ix -- ) sieve + DUP C@ NOT SWAP C! ; PRIVATE
    : fi1 ( I J -- u ) SQR 4 * SWAP SQR + ; PRIVATE
    : fi2 ( I J -- u ) SQR 3 * SWAP SQR + ; PRIVATE
    : fi3 ( I J -- u ) SQR 3 * SWAP SQR - ; PRIVATE
    : #MOD ( s -- u ) mods + C@ ; PRIVATE
    : test1 ( s -- ) DUP limit > IF DROP ELSE >S S #MOD DUP 1 = SWAP 5 = OR IF S flip ENDIF -S ENDIF ; PRIVATE
    : test2 ( s -- ) DUP limit > IF DROP ELSE >S S #MOD 7 = IF S flip ENDIF -S ENDIF ; PRIVATE
    : +test3 ( s ? -- ) OVER limit > OR IF DROP ELSE >S S #MOD #11 = IF S flip ENDIF -S ENDIF ; PRIVATE
    : no-p^2 ( -- ) SQRT(limit+1) 5 DO sieve I + C@ IF limit 2+ I SQR DO sieve I + C0! J SQR +LOOP ENDIF LOOP ; PRIVATE
    : cnt ( -- u ) 2 ( 2 and 3 ) limit 1+ 5 DO sieve I + C@ 1 AND + LOOP ; PRIVATE

    : SieveOfAtkin ( -- u )
    init
    SQRT(limit+2)
    1 DO SQRT(limit+2)
    1 DO I J fi1 test1
    I J fi2 test2
    I J fi3 J I <= +test3
    LOOP
    LOOP
    no-p^2
    cnt ( sz) ;

    : PRIMES CR #times DEC. ." iterations." TIMER-RESET
    0 #times 0 DO DROP SieveOfAtkin LOOP
    MS? SWAP CR . ." primes found, " DEC. ." ms elapsed." ;

    :ABOUT CR ." Try: SieveOfAtkin . "
    CR ." PRIMES " ;

    NESTING @ 1 = [IF] .ABOUT -sieveofatkin [THEN]

    DEPRIVE

    (* End of Source *)

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to mhx@iae.nl on Sun May 22 10:57:03 2022
    In article <4efed8c0-d736-42dd-9627-0913aece17edn@googlegroups.com>,
    Marcel Hendrix <mhx@iae.nl> wrote:
    On Saturday, May 21, 2022 at 4:10:20 AM UTC+2, dxforth wrote:
    On 21/05/2022 11:41, Marcel Hendrix wrote:
    [..]
    (*
    * LANGUAGE : ANS Forth with extensions
    *)

    "ANS Forth" ? Perhaps according to your standards :)

    What is there about "with extensions" that you don't understand?

    (*
    * LANGUAGE : ANS Forth with extensions
    * PROJECT : Forth Environments
    * DESCRIPTION : Sieve of Atkin, fastest Sieve O(n)
    * CATEGORY : Example
    * AUTHOR : Marcel Hendrix
    * LAST CHANGE : Saturday, May 21, 2022, 9:54 AM, mhx;
    *)


    NEEDS -miscutil

    REVISION -sieveofatkin "--- Sieve of Atkin Version 0.02 ---"

    PRIVATES

    #1000 =: #times
    #16384 VALUE limit
    limit SQRT 2+ =: SQRT(limit+2) PRIVATE
    limit SQRT 1+ =: SQRT(limit+1) PRIVATE
    CREATE sieve PRIVATE limit 1+ ALLOT

    : toggle ( ix -- ) sieve + DUP C@ NOT SWAP C! ; PRIVATE
    : f1: ( I J -- ) SQR 4 * SWAP SQR + ; PRIVATE
    : f2: ( I J -- ) SQR 3 * SWAP SQR + ; PRIVATE
    : f3: ( I J -- ) SQR 3 * SWAP SQR - ; PRIVATE

    : SieveOfAtkin ( -- u )
    sieve limit 1+ ERASE
    SQRT(limit+2)
    1 DO SQRT(limit+2)
    1 DO I J f1: >S S limit <= S #12 MOD DUP 1 = SWAP 5 = OR AND IF S
    toggle ENDIF -S
    I J f2: >S S limit <= S #12 MOD 7 = AND IF S toggle ENDIF -S
    I J f3: >S J I > S limit <= AND S #12 MOD #11 = AND IF S
    toggle ENDIF -S
    LOOP
    LOOP
    SQRT(limit+1) 5 DO sieve I + C@ IF limit 2+ I SQR DO sieve I + C0!
    J SQR +LOOP ENDIF LOOP
    2 ( 2 and 3 ) limit 1+ 5 DO sieve I + C@ 1 AND + LOOP ( sz) ;

    : PRIMES CR #times DEC. ." iterations." TIMER-RESET
    0 #times 0 DO DROP SieveOfAtkin LOOP
    MS? SWAP CR . ." primes found, " DEC. ." ms elapsed." ;

    :ABOUT CR ." Try: SieveOfAtkin . "
    CR ." PRIMES " ;

    NESTING @ 1 = [IF] .ABOUT -sieveofatkin [THEN]

    DEPRIVE

    (* End of Source *)


    FORTH> in
    Creating --- Sieve of Atkin Version 0.02 ---

    Try: SieveOfAtkin .
    PRIMES ok
    FORTH> SieveOfAtkin . 1900 ok
    FORTH> PRIMES
    1000 iterations.
    1900 primes found, 94 ms elapsed. ok

    FORTH> in sieve
    Creating --- SIEVE benchmark Version 1.00 ---
    Redefining `#times`
    Redefining `PRIMES`

    Sieve (100 iterations)
    Compile time Run time Primes
    Par.C 38.19 8.732 1899
    Inmos ICC 83.98 9.578 1899
    3L C 45.25 5.955 1899
    tForth (WS) 1.00 7.244 1899
    tForth (main) 1.00 9.253 1899
    iForth 386/33 0.3 5.767 1899
    iForth 486/50 0.1 1.595 1899
    iForth 486/66 0.1 1.559 1899
    iForth P5/200 0.1 0.195 1899
    iForth PIV/3G 0.1 0.076 1899

    Enter PRIMES to run the benchmark.
    ok
    FORTH> PRIMES
    1000 iterations.
    1899 primes found, 12 ms elapsed. ok
    FORTH>

    Out of the box, Atkin is 94 / 12 ~ 8x slower than the standard Sieve algorithm.
    The standard algorithm needs no division, so it will work for limit >
    2^64, while
    Atkin must be extended for that.

    This the deception for those who inspect the O(n) behaviour.
    Maybe Atkin uses less operations for n to infinity, there is a
    constant C in front, that can count instructions needed for
    a single operation.
    That may mean -- as you have demonstrated -- that over a practical
    range the O(n) doesn't tell the whole story.


    -marcel

    Groetjes Albert
    --
    "in our communism country Viet Nam, people are forced to be
    alive and in the western country like US, people are free to
    die from Covid 19 lol" duc ha
    albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jali Heinonen@21:1/5 to All on Sun May 22 06:10:19 2022
    sunnuntai 22. toukokuuta 2022 klo 0.02.26 UTC+3 Marcel Hendrix kirjoitti:
    On Saturday, May 21, 2022 at 6:28:54 PM UTC+2, Jali Heinonen wrote:
    perjantai 20. toukokuuta 2022 klo 15.53.26 UTC+3 Marcel Hendrix kirjoitti:
    Apparently the fastest possible Sieve (in the limit). https://www.baeldung.com/cs/prime-number-algorithms .

    Unfortunately, the pseudo-code (?) is somewhat ambiguous.

    -marcel
    Is it really faster than sieve implementation that is wheel factorized to the max?
    Well, so much for CS.

    I adapted both Atkin and the standard Sieve so that I could run them with array sizes between 1k and 10GByte, with the following results:

    Standard Sieve:
    n runtime (ms) primes
    1,000 0 302
    10,000 0 2,261
    100,000 0 17,983
    1,000,000 3 148,932
    10,000,000 29 1,270,606
    100,000,000 962 11,078,936
    1,000,000,000 12695 98,222,286
    10,000,000,000 149099 882,206,715

    Atkin Sieve
    n runtime (ms) primes
    1,000 0 168
    10,000 0 1,229
    100,000 0 9,592
    1,000,000 4 78,498
    10,000,000 60 664,579
    100,000,000 1728 5,761,455
    1,000,000,000 20900 50,847,534
    10,000,000,000 301417 455,052,511

    Atkin is about 4 times slower than the Standard Sieve (50% of the
    results in twice the time). It might slowly increase for even larger
    sieves, but with 20 GBytes there is not much evidence for that yet.
    Maybe somebody with a TB of RAM has more luck.

    The jump in runtime between 0.1GB and 1GB might have something to do
    with the working-set default of a Windows program (I see no evidence
    of disk-activity but ALLOCATE takes noticeably longer).

    Here is the final SieveOfAtkin. I fixed a bug that had to do with Python's logic operators apparently taking shortcuts without being told.
    (*
    * LANGUAGE : ANS Forth with extensions
    * PROJECT : Forth Environments
    * DESCRIPTION : Sieve of Atkin, fastest Sieve O(n)
    * CATEGORY : Example
    * AUTHOR : Marcel Hendrix
    * LAST CHANGE : Saturday, May 21, 2022, 9:54 AM, mhx;
    *)


    NEEDS -miscutil
    REVISION -sieveofatkin "--- Sieve of Atkin Version 0.04 ---"

    PRIVATES

    DOC
    (*
    n runtime (ms) primes
    1,000 0 168
    10,000 0 1,229
    100,000 0 9,592
    1,000,000 4 78,498
    10,000,000 60 664,579
    100,000,000 1728 5,761,455
    1,000,000,000 20900 50,847,534
    10,000,000,000 301417 455,052,511
    *)
    ENDDOC

    DEPTH 0= [IF] CR .( Stack Empty) ABORT [THEN]
    ( #16384 ) =: limit PRIVATE
    #1 =: #times PRIVATE
    limit SQRT 2+ =: SQRT(limit+2) PRIVATE
    limit SQRT 1+ =: SQRT(limit+1) PRIVATE
    limit 1+ ALLOCATE ?ALLOCATE =: sieve PRIVATE
    limit 1+ ALLOCATE ?ALLOCATE =: mods PRIVATE

    : set# ( -- ) limit 2+ 0 DO I #12 MOD mods I + C! LOOP ; PRIVATE set#
    : init ( -- ) sieve limit 1+ ERASE ( set# ) ; PRIVATE
    : flip ( ix -- ) sieve + DUP C@ NOT SWAP C! ; PRIVATE
    : fi1 ( I J -- u ) SQR 4 * SWAP SQR + ; PRIVATE
    : fi2 ( I J -- u ) SQR 3 * SWAP SQR + ; PRIVATE
    : fi3 ( I J -- u ) SQR 3 * SWAP SQR - ; PRIVATE
    : #MOD ( s -- u ) mods + C@ ; PRIVATE
    : test1 ( s -- ) DUP limit > IF DROP ELSE >S S #MOD DUP 1 = SWAP 5 = OR IF S flip ENDIF -S ENDIF ; PRIVATE
    : test2 ( s -- ) DUP limit > IF DROP ELSE >S S #MOD 7 = IF S flip ENDIF -S ENDIF ; PRIVATE
    : +test3 ( s ? -- ) OVER limit > OR IF DROP ELSE >S S #MOD #11 = IF S flip ENDIF -S ENDIF ; PRIVATE
    : no-p^2 ( -- ) SQRT(limit+1) 5 DO sieve I + C@ IF limit 2+ I SQR DO sieve I + C0! J SQR +LOOP ENDIF LOOP ; PRIVATE
    : cnt ( -- u ) 2 ( 2 and 3 ) limit 1+ 5 DO sieve I + C@ 1 AND + LOOP ; PRIVATE

    : SieveOfAtkin ( -- u )
    init
    SQRT(limit+2)
    1 DO SQRT(limit+2)
    1 DO I J fi1 test1
    I J fi2 test2
    I J fi3 J I <= +test3
    LOOP
    LOOP
    no-p^2
    cnt ( sz) ;

    : PRIMES CR #times DEC. ." iterations." TIMER-RESET
    0 #times 0 DO DROP SieveOfAtkin LOOP
    MS? SWAP CR . ." primes found, " DEC. ." ms elapsed." ;

    :ABOUT CR ." Try: SieveOfAtkin . "
    CR ." PRIMES " ;

    NESTING @ 1 = [IF] .ABOUT -sieveofatkin [THEN]

    DEPRIVE

    (* End of Source *)

    -marcel

    I adapted some code using wheel factorization for testing primes from the Inferno operating system sources for the 8th programming language.
    Available here: https://www.dropbox.com/s/96c58nykffo0wuu/wheel.8th?dl=0

    I have not done any timings as I currently have only RPI 4B board available for testing. Seems fast though and I would be surprised if it can't match the Sieve of Atkins.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Jali Heinonen on Sun May 22 07:53:44 2022
    On Sunday, May 22, 2022 at 3:10:20 PM UTC+2, Jali Heinonen wrote:
    [..]
    I adapted some code using wheel factorization for testing primes from the Inferno
    operating system sources for the 8th programming language.
    Available here: https://www.dropbox.com/s/96c58nykffo0wuu/wheel.8th?dl=0

    9007199254740992 constant bigx
    ...

    I have not done any timings as I currently have only RPI 4B board available for testing.
    Seems fast though and I would be surprised if it can't match the Sieve of Atkins.

    Well, let us know when you get to it. Generating between 10M or 100M primes should provide some relevant timings, 10G would be better though.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to none albert on Sun May 22 08:17:59 2022
    On Sunday, May 22, 2022 at 10:57:09 AM UTC+2, none albert wrote:
    [..]
    Out of the box, Atkin is 94 / 12 ~ 8x slower than the standard Sieve algorithm.
    The standard algorithm needs no division, so it will work for limit >
    2^64, while
    Atkin must be extended for that.
    This the deception for those who inspect the O(n) behaviour.
    Maybe Atkin uses less operations for n to infinity, there is a
    constant C in front, that can count instructions needed for
    a single operation.
    That may mean -- as you have demonstrated -- that over a practical
    range the O(n) doesn't tell the whole story.

    It was a bit of a surprise to me that it was so slow, given that the reference to this method came from a prime thread on the GMP mailing list.

    It is not easy to see which instructions are slow, given that the operation count is indeed minimal. The division operation is replaced by a lookup table, (building it leads to an imperceptible delay when loading the source file)
    as I assume that a fast MOD was possible. Multiplication is nowadays
    1 or 2 cycles, so I guess it must there is an ugly memory access problem,
    a stalled pipeline, or serious data dependencies.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jali Heinonen@21:1/5 to All on Sun May 22 10:11:09 2022
    sunnuntai 22. toukokuuta 2022 klo 17.53.46 UTC+3 Marcel Hendrix kirjoitti:
    On Sunday, May 22, 2022 at 3:10:20 PM UTC+2, Jali Heinonen wrote:
    [..]
    I adapted some code using wheel factorization for testing primes from the Inferno
    operating system sources for the 8th programming language.
    Available here: https://www.dropbox.com/s/96c58nykffo0wuu/wheel.8th?dl=0
    9007199254740992 constant bigx
    ...
    I have not done any timings as I currently have only RPI 4B board available for testing.
    Seems fast though and I would be surprised if it can't match the Sieve of Atkins.
    Well, let us know when you get to it. Generating between 10M or 100M primes should provide some relevant timings, 10G would be better though.

    -marcel

    I tested with generating first million primes on my RPI 4B and redirecting output into text file. Resulting time is over eight times faster than the time mentioned in your source listing and that time includes program startup and writing primes into file:

    root@DietPi:~# time /opt/8th/bin/rpi64/8th wheel.8th 0 15485863 >temp.txt

    real 0m57.158s
    user 0m45.213s
    sys 0m11.268s

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Marcel Hendrix on Sun May 22 17:05:55 2022
    Marcel Hendrix <mhx@iae.nl> writes:
    Multiplication is nowadays
    1 or 2 cycles

    More like 4.

    so I guess it must there is an ugly memory access problem,

    Numbers from a Skylake:

    [~/forth:130297] perf stat -e instructions -e cycles -e L1-dcache-load-misses -e LLC-load-misses iforth "1000000000 include $HOME/forth/sieveofatkin.4th primes bye"

    Try: SieveOfAtkin .
    PRIMES
    1 iterations.
    50847534 primes found, 19671 ms elapsed.

    [~/forth:130300] perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-load-misses iforth "1000000000 include $HOME/forth/sieveofatkin.4th primes bye"

    Try: SieveOfAtkin .
    PRIMES
    1 iterations.
    50847534 primes found, 19662 ms elapsed.

    Performance counter stats for 'iforth 1000000000 include /home/anton/forth/sieveofatkin.4th primes bye':

    91673986439 instructions # 0.82 insn per cycle
    111163950818 cycles
    23659587244 L1-dcache-loads
    1907344675 L1-dcache-load-misses # 8.06% of all L1-dcache accesses
    1340041914 LLC-load-misses

    27.797359314 seconds time elapsed

    27.607640000 seconds user
    0.184024000 seconds sys

    So 91 instructions, 111 cycles and 1.34 main memory accesses per
    number under consideration; yes, it could be all these cache misses
    (actually, with this many cache misses, it could be quite a bit worse;
    there seems to be some memory-level parallelism going on). Have you
    considered doing it piecewise, in cache-sized pieces? That made
    Erathostenes quite a bit faster.

    I also notice that you apparently don't measure the time for erasing
    the memory.

    Let's see about scalability:

    [~/forth:130301] perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-load-misses iforth "100000000 include $HOME/forth/sieveofatkin.4th primes bye"

    Try: SieveOfAtkin .
    PRIMES
    1 iterations.
    5761455 primes found, 1900 ms elapsed.

    Performance counter stats for 'iforth 100000000 include /home/anton/forth/sieveofatkin.4th primes bye':

    9516883875 instructions # 0.85 insn per cycle
    11136639151 cycles
    2477917529 L1-dcache-loads
    197185830 L1-dcache-load-misses # 7.96% of all L1-dcache accesses
    119105185 LLC-load-misses

    2.791729605 seconds time elapsed

    2.754480000 seconds user
    0.032028000 seconds sys

    10 times smaller, 10 times less time, just as O(n) predicts.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: http://www.forth200x.org/forth200x.html
    EuroForth 2021: https://euro.theforth.net/2021

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Anton Ertl on Sun May 22 11:58:45 2022
    On Sunday, May 22, 2022 at 7:43:01 PM UTC+2, Anton Ertl wrote:
    [..]
    So 91 instructions, 111 cycles and 1.34 main memory accesses per
    number under consideration; yes, it could be all these cache misses (actually, with this many cache misses, it could be quite a bit worse;
    there seems to be some memory-level parallelism going on). Have you considered doing it piecewise, in cache-sized pieces? That made
    Erathostenes quite a bit faster.

    That would be real work, I would need to sieve in batches. Maybe
    I don't understand what you mean.

    I also notice that you apparently don't measure the time for erasing
    the memory.

    That is included, but I don't count the time to build the MOD lookup.

    Let's see about scalability:

    [~/forth:130301] perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-load-misses iforth "100000000 include $HOME/forth/sieveofatkin.4th primes bye"

    Try: SieveOfAtkin .
    PRIMES
    1 iterations.
    5761455 primes found, 1900 ms elapsed.
    [..]
    10 times smaller, 10 times less time, just as O(n) predicts.

    On my AMD 5800X the interesting gap is between 10,000,000
    and 100,000,000, but thanks anyway.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Marcel Hendrix on Sun May 22 18:21:24 2022
    Marcel Hendrix <mhx@iae.nl> writes:
    Here is the final SieveOfAtkin.

    I have made it portable to Gforth, lxf, sf, vfx64, and posted the
    result on
    <https://www.complang.tuwien.ac.at/forth/programs/sieveofatkin.fs>.
    Basically all things that I changed, except for the timing code, are gratuitious incompatibilities.

    Results on a Ryzen 5800X (LLC-dcache-load-miss not supported) for n=1e8:

    cyc inst load L1 miss
    gforth-fast 33,784,978,330 50,096,698,991 14,716,155,398 157,613,972
    iforth 8,949,528,121 9,643,635,252 1,459,076,700 163,122,475
    lxf 17,879,785,896 15,849,776,266 4,578,122,141 157,093,923
    sf 22,653,189,337 20,410,579,910 6,135,891,074 158,402,472
    vfx64 8,896,377,480 9,533,502,029 1,212,120,001 155,895,135

    Cache misses don't seem to limit the performance, otherwise the cycle differences between the Forth systems would be much smaller.

    Command lines:
    LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses gforth-fast -e "100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"
    LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses iforth "100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"
    LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses lxf "100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"
    LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses sf "include /nfs/nfstmp/anton/SwiftForth-3.11.0/lib/options/fpmath.f 100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes
    bye"
    LC_NUMERIC=en_US.utf8 perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses vfx64 "100000000 include /nfs/unsafe/httpd/ftp/pub/forth/programs/sieveofatkin.fs primes bye"

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: http://www.forth200x.org/forth200x.html
    EuroForth 2021: https://euro.theforth.net/2021

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Jali Heinonen on Sun May 22 11:39:11 2022
    On Sunday, May 22, 2022 at 7:11:10 PM UTC+2, Jali Heinonen wrote:
    sunnuntai 22. toukokuuta 2022 klo 17.53.46 UTC+3 Marcel Hendrix kirjoitti:
    On Sunday, May 22, 2022 at 3:10:20 PM UTC+2, Jali Heinonen wrote:
    [..]
    I tested with generating first million primes on my RPI 4B and redirecting output
    into text file. Resulting time is over eight times faster than the time mentioned in
    your source listing and that time includes program startup and writing primes into file:

    root@DietPi:~# time /opt/8th/bin/rpi64/8th wheel.8th 0 15485863 >temp.txt

    real 0m57.158s
    user 0m45.213s
    sys 0m11.268s

    That is 57 *seconds*, with n=15,485,863, and finds 1,000,000 primes?
    Note that Forth's Atkin (above in the thread) does that in ~120 *milliseconds*, on an AMD Ryzen 7 5800X. According to Geekbench V, single-core 64bit, the expected difference with an RPI 4B is about 8x. Something must be wrong.

    It would be better to compare to the Standard Sieve?

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jali Heinonen@21:1/5 to All on Sun May 22 13:07:39 2022
    perjantai 20. toukokuuta 2022 klo 18.48.44 UTC+3 Jan Coombs kirjoitti:
    On Fri, 20 May 2022 05:53:24 -0700 (PDT)
    Marcel Hendrix <m...@iae.nl> wrote:

    Apparently the fastest possible Sieve (in the limit). https://www.baeldung.com/cs/prime-number-algorithms .

    Unfortunately, the pseudo-code (?) is somewhat ambiguous.
    The explanation is good, thanks. This seems to work, how's your python?

    This is the optimized implementation proposed by Zsolt KOVACS:
    from: https://stackoverflow.com/questions/21783160/sieve-of-atkin-implementation-in-python
    Does this linked Python code really work? With upper limit of 229 it seems to find the following primes:

    [2, 3, 5, 7, 13, 19, 29, 53, 67, 85, 103, 173, 199]

    There should 50 primes, not 13 primes?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Jali Heinonen on Sun May 22 12:46:42 2022
    On Sunday, May 22, 2022 at 9:19:57 PM UTC+2, Jali Heinonen wrote:
    [..]
    Can your Forth version really find all the 1,000,000 prime numbers between number
    range of 0 - 15,485,863, output all the prime numbers as formatted line of text and
    redirect output into text file in about 120 milliseconds? Interesting as running empty
    program on my RPI 4B takes about 177 milliseconds.

    I don't know about "and redirect output into text file", I would write-file to stdout if that
    is an allowed scenario? Bare numbers are independently duplicated by Anton
    in this thread for several Forth systems, including iForth.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jali Heinonen@21:1/5 to All on Sun May 22 12:19:56 2022
    sunnuntai 22. toukokuuta 2022 klo 21.39.12 UTC+3 Marcel Hendrix kirjoitti:
    On Sunday, May 22, 2022 at 7:11:10 PM UTC+2, Jali Heinonen wrote:
    sunnuntai 22. toukokuuta 2022 klo 17.53.46 UTC+3 Marcel Hendrix kirjoitti:
    On Sunday, May 22, 2022 at 3:10:20 PM UTC+2, Jali Heinonen wrote:
    [..]
    I tested with generating first million primes on my RPI 4B and redirecting output
    into text file. Resulting time is over eight times faster than the time mentioned in
    your source listing and that time includes program startup and writing primes into file:

    root@DietPi:~# time /opt/8th/bin/rpi64/8th wheel.8th 0 15485863 >temp.txt

    real 0m57.158s
    user 0m45.213s
    sys 0m11.268s
    That is 57 *seconds*, with n=15,485,863, and finds 1,000,000 primes?
    Note that Forth's Atkin (above in the thread) does that in ~120 *milliseconds*,
    on an AMD Ryzen 7 5800X. According to Geekbench V, single-core 64bit, the expected difference with an RPI 4B is about 8x. Something must be wrong.

    It would be better to compare to the Standard Sieve?

    -marcel

    Can your Forth version really find all the 1,000,000 prime numbers between number range of 0 - 15,485,863, output all the prime numbers as formatted line of text and redirect output into text file in about 120 milliseconds? Interesting as running empty
    program on my RPI 4B takes about 177 milliseconds.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Marcel Hendrix on Sun May 22 19:28:41 2022
    Marcel Hendrix <mhx@iae.nl> writes:
    I adapted both Atkin and the standard Sieve so that I could run them with >array sizes between 1k and 10GByte, with the following results:

    I have now tried to run it with 100GB on a machine with 128GB RAM that
    is not running anything significant. However, the out-of-memory
    killer reaped it after a while, despite "free" showing 125GB free; ok,
    100GB for the sieve and 100GB for MODS does not fit. Does MODS buy
    anything?

    I prepared a version without MODS <http://www.complang.tuwien.ac.at/forth/programs/sieveofatkin-nomods.fs>,
    and it is actually faster:

    On the 5800X with n=1e8, using vfx64:

    MODS no MODS
    8,896,377,480 4,943,212,037 cycles
    9,533,502,029 8,432,685,132 instructions
    1,212,120,001 998,289,650 L1-dcache-loads
    155,895,135 43,060,075 L1-dcache-load-misses

    Ok, let's see about the 100G. Our 5800X currently has only 96GB, so I
    used a Rocket Lake machine (Xeon W-1370P), where perf unfortunately
    does not report loads or misses. I used vfx64, because the version of
    iforth installed apparently does not like to allocate 10GB.

    cycles instructions n primes
    4,958,261,806 8,431,214,174 1e8 5761455
    56,409,350,162 84,181,549,986 1e9 50847534
    597,442,687,159 891,687,473,921 1e10 455052511
    6,068,869,349,719 8,916,633,565,847 1e11 4118054813

    The number of instructions is superlinear between n=1e9 and 1e10, but
    linear for 1e8->1e9 and 1e10->1e11. Hmm. Cycles rise slightly
    superlinearly, possibly because caches help less and less.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: http://www.forth200x.org/forth200x.html
    EuroForth 2021: https://euro.theforth.net/2021

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Jali Heinonen on Sun May 22 13:32:51 2022
    On Sunday, May 22, 2022 at 10:07:41 PM UTC+2, Jali Heinonen wrote:
    [..]
    https://stackoverflow.com/questions/21783160/sieve-of-atkin-implementation-in-python
    Does this linked Python code really work? With upper limit of 229 it seems to find the following primes:

    [2, 3, 5, 7, 13, 19, 29, 53, 67, 85, 103, 173, 199]

    There should 50 primes, not 13 primes?

    From primes.txt:
    2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83
    89 97 101 103 107 109 113 127 131 137 139 149 151 157 163 167
    173 179 181 191 193 197 199 211 223 227 229

    I guess that will up the runtime a few notches :--)

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Marcel Hendrix on Sun May 22 20:39:45 2022
    Marcel Hendrix <mhx@iae.nl> writes:
    On Sunday, May 22, 2022 at 7:43:01 PM UTC+2, Anton Ertl wrote:
    [..]
    So 91 instructions, 111 cycles and 1.34 main memory accesses per
    number under consideration; yes, it could be all these cache misses
    (actually, with this many cache misses, it could be quite a bit worse;
    there seems to be some memory-level parallelism going on). Have you
    considered doing it piecewise, in cache-sized pieces? That made
    Erathostenes quite a bit faster.

    That would be real work,

    Yes, it's more work for Erathostenes, don't know how amenable Atkin is
    for that.

    On my AMD 5800X the interesting gap is between 10,000,000
    and 100,000,000, but thanks anyway.

    For time results you will see cache effects, and complexity theory
    usually does not model that. The instruction counts are not affected
    by cache misses, and I would have expected the theoretically predicted
    linear behaviour there, but as reported in another posting, I saw an
    anomaly.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: http://www.forth200x.org/forth200x.html
    EuroForth 2021: https://euro.theforth.net/2021

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Jali Heinonen on Sun May 22 13:24:18 2022
    On Sunday, May 22, 2022 at 9:19:57 PM UTC+2, Jali Heinonen wrote:
    [..]
    Can your Forth version really find all the 1,000,000 prime numbers between number
    range of 0 - 15,485,863, output all the prime numbers as formatted line of text and
    redirect output into text file in about 120 milliseconds? Interesting as running empty
    program on my RPI 4B takes about 177 milliseconds.

    Outputting a million primes is indeed slower than the original ~120 ms (about 25 times).
    Other Forths can probably do this much faster as iForth is made to flush the output file.

    FORTH> in sieveofatkin
    Creating --- Sieve of Atkin Version 0.04 ---

    Try: SieveOfAtkin .
    PRIMES ok
    FORTH> PRIMES
    1 iterations.
    1000000 primes found, 3204 ms elapsed. ok
    FORTH>

    Here's how:
    : write ( -- )
    S" primes.txt" R/W BIN CREATE-FILE ?FILE out-FD !
    FILE-I//O SET-IODEVICE
    limit 1+ 2 DO sieve I + C@ IF I . ENDIF LOOP
    out-FD @ CLOSE-FILE -1 out-FD ! ?FILE
    STD-I//O SET-IODEVICE ;

    .. where "write" is made the last word of SieveOfAtkin

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Marcel Hendrix on Mon May 23 06:36:24 2022
    Marcel Hendrix <mhx@iae.nl> writes:
    On Sunday, May 22, 2022 at 10:57:09 AM UTC+2, none albert wrote:
    [..]
    Out of the box, Atkin is 94 / 12 ~ 8x slower than the standard Sieve algorithm.
    The standard algorithm needs no division, so it will work for limit >
    2^64, while
    Atkin must be extended for that.
    This the deception for those who inspect the O(n) behaviour.
    Maybe Atkin uses less operations for n to infinity, there is a
    constant C in front, that can count instructions needed for
    a single operation.
    That may mean -- as you have demonstrated -- that over a practical
    range the O(n) doesn't tell the whole story.

    It was a bit of a surprise to me that it was so slow, given that the reference >to this method came from a prime thread on the GMP mailing list.

    It is not easy to see which instructions are slow, given that the operation >count is indeed minimal.

    I looked up <https://en.wikipedia.org/wiki/Sieve_of_Atkin> and it
    discusses performance relative to Erathostenes quite a bit. I won't
    reproduce that here, but I note that Erathostenes is O(N log log N)
    compared to O(N) for Atkin. log log N grows very slowly:

    1e3 fln fln f. 1.93264473391607 ok
    1e6 fln fln f. 2.62579191447601 ok
    1e9 fln fln f. 3.03125702258418 ok
    1e12 fln fln f. 3.31893909503596 ok
    1e15 fln fln f. 3.54208264635017 ok

    That can be easily compensated for practically relevant N by a better
    constant factor for Erathostenes, and the page discusses various
    reasons for such a better constant factors for various implementation variations.

    As for "a prime thread on the GMP mailing list", even
    usually-competent people are not immune to fallacies, and giving more
    weight to mathematically proven results than they actually deserve,
    and interpreting things into mathematical results that they actually
    don't state are common fallacies.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: http://www.forth200x.org/forth200x.html
    EuroForth 2021: https://euro.theforth.net/2021

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Anton Ertl on Mon May 23 01:11:22 2022
    On Monday, May 23, 2022 at 9:02:11 AM UTC+2, Anton Ertl wrote:
    I won't
    reproduce that here, but I note that Erathostenes is O(N log log N)
    compared to O(N) for Atkin. log log N grows very slowly:

    1e3 fln fln f. 1.93264473391607 ok
    1e6 fln fln f. 2.62579191447601 ok
    1e9 fln fln f. 3.03125702258418 ok
    1e12 fln fln f. 3.31893909503596 ok
    1e15 fln fln f. 3.54208264635017 ok

    I learned a lot more from your level-headed experiment than from the
    Wikipedia page...

    Next, and probably last, step will be to limit memory use.
    The MOD hack should obviously go (as you've shown) and I can
    make sieve a bit array. That will indicate if is advantageous to compress
    more aggressively.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jali Heinonen@21:1/5 to All on Mon May 23 02:32:13 2022
    sunnuntai 22. toukokuuta 2022 klo 23.24.19 UTC+3 Marcel Hendrix kirjoitti:
    On Sunday, May 22, 2022 at 9:19:57 PM UTC+2, Jali Heinonen wrote:
    [..]
    Can your Forth version really find all the 1,000,000 prime numbers between number
    range of 0 - 15,485,863, output all the prime numbers as formatted line of text and
    redirect output into text file in about 120 milliseconds? Interesting as running empty
    program on my RPI 4B takes about 177 milliseconds.
    Outputting a million primes is indeed slower than the original ~120 ms (about 25 times).
    Other Forths can probably do this much faster as iForth is made to flush the output file.

    FORTH> in sieveofatkin
    Creating --- Sieve of Atkin Version 0.04 ---
    Try: SieveOfAtkin .
    PRIMES ok
    FORTH> PRIMES
    1 iterations.
    1000000 primes found, 3204 ms elapsed. ok
    FORTH>

    Here's how:
    : write ( -- )
    S" primes.txt" R/W BIN CREATE-FILE ?FILE out-FD !
    FILE-I//O SET-IODEVICE
    limit 1+ 2 DO sieve I + C@ IF I . ENDIF LOOP
    out-FD @ CLOSE-FILE -1 out-FD ! ?FILE
    STD-I//O SET-IODEVICE ;

    .. where "write" is made the last word of SieveOfAtkin

    -marcel

    Thanks, does your SieveOfAtkin use one huge buffer for primes? Version I posted uses currently just 1000 element array and needs a shit loads of iterations to find and output millions of primes. It also needs to zero out the primes array on every
    iteration. I should probably increase the buffer size to eliminate the overhead. Up side is that it uses currently very little memory.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to mhx@iae.nl on Mon May 23 12:08:46 2022
    In article <74043ddb-6fd4-4389-8af1-778e71f1dd1bn@googlegroups.com>,
    Marcel Hendrix <mhx@iae.nl> wrote:
    <SNIP>
    Here is the final SieveOfAtkin. I fixed a bug that had to do with Python's >logic operators apparently taking shortcuts without being told.

    If you mean this: https://docs.python.org/3/library/stdtypes.html#boolean-operations-and-or-not Short circuit operation in Python is a feature, not a bug.

    This is actually widespread e.g. this is c-idiom:
    if ( pc || *pc ) { ... }
    Short circuit evaluation prevents accessing address NULL if pc
    has not been filled in.

    <SNIP>

    -marcel

    Groetjes Albert
    --
    "in our communism country Viet Nam, people are forced to be
    alive and in the western country like US, people are free to
    die from Covid 19 lol" duc ha
    albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jan Coombs@21:1/5 to Jali Heinonen on Mon May 23 11:54:59 2022
    On Sun, 22 May 2022 13:07:39 -0700 (PDT)
    Jali Heinonen <jali.heinonen@gmail.com> wrote:

    perjantai 20. toukokuuta 2022 klo 18.48.44 UTC+3 Jan Coombs kirjoitti:
    On Fri, 20 May 2022 05:53:24 -0700 (PDT)
    Marcel Hendrix <m...@iae.nl> wrote:

    Apparently the fastest possible Sieve (in the limit). https://www.baeldung.com/cs/prime-number-algorithms .

    Unfortunately, the pseudo-code (?) is somewhat ambiguous.
    The explanation is good, thanks. This seems to work, how's your python?

    This is the optimized implementation proposed by Zsolt KOVACS:
    from: https://stackoverflow.com/questions/21783160/sieve-of-atkin-implementation-in-python
    Does this linked Python code really work? With upper limit of 229 it seems to find the following primes:

    [2, 3, 5, 7, 13, 19, 29, 53, 67, 85, 103, 173, 199]

    There should 50 primes, not 13 primes?

    jan@mypc:$ python SieveOfAtkin.py 229
    [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61,
    67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137,
    139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199,
    211, 223, 227]

    Email me for source, it was beyond my ability/patience to change it.

    Jan Coombs
    --

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to Anton Ertl on Mon May 23 12:23:59 2022
    In article <2022May22.212841@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Marcel Hendrix <mhx@iae.nl> writes:
    I adapted both Atkin and the standard Sieve so that I could run them with >>array sizes between 1k and 10GByte, with the following results:

    I have now tried to run it with 100GB on a machine with 128GB RAM that
    is not running anything significant. However, the out-of-memory
    killer reaped it after a while, despite "free" showing 125GB free; ok,
    100GB for the sieve and 100GB for MODS does not fit. Does MODS buy
    anything?

    I prepared a version without MODS ><http://www.complang.tuwien.ac.at/forth/programs/sieveofatkin-nomods.fs>,
    and it is actually faster:

    On the 5800X with n=1e8, using vfx64:

    MODS no MODS
    8,896,377,480 4,943,212,037 cycles
    9,533,502,029 8,432,685,132 instructions
    1,212,120,001 998,289,650 L1-dcache-loads
    155,895,135 43,060,075 L1-dcache-load-misses

    Ok, let's see about the 100G. Our 5800X currently has only 96GB, so I
    used a Rocket Lake machine (Xeon W-1370P), where perf unfortunately
    does not report loads or misses. I used vfx64, because the version of
    iforth installed apparently does not like to allocate 10GB.

    cycles instructions n primes
    4,958,261,806 8,431,214,174 1e8 5761455
    56,409,350,162 84,181,549,986 1e9 50847534
    597,442,687,159 891,687,473,921 1e10 455052511
    6,068,869,349,719 8,916,633,565,847 1e11 4118054813

    The number of instructions is superlinear between n=1e9 and 1e10, but
    linear for 1e8->1e9 and 1e10->1e11. Hmm. Cycles rise slightly >superlinearly, possibly because caches help less and less.

    I've seen code in this thread that made the sieve of Atkin a
    test of the speed of printing code.
    Long ago the original byte benchmark turned into measuring the
    speed of `` 1899 . '' instead of the speed of sieving.

    The difference between quicksort N.logN and bubble sort N^2
    should wave a red flag. That is a big difference.

    However logN missing or adding here or there -- O(n) versus O(N.logn) -- doesn't mask if the algorithm needs a MOD versus pure addition,
    or the effects or accessing main memory.

    The wikipedia Atking page needs a caveat of this sort,
    who are going to add it?


    - anton


    Groetjes Albert
    --
    "in our communism country Viet Nam, people are forced to be
    alive and in the western country like US, people are free to
    die from Covid 19 lol" duc ha
    albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to albert@cherry. on Mon May 23 11:44:39 2022
    albert@cherry.(none) (albert) writes:
    In article <2022May22.212841@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Marcel Hendrix <mhx@iae.nl> writes:
    I adapted both Atkin and the standard Sieve so that I could run them with >>>array sizes between 1k and 10GByte, with the following results:

    I have now tried to run it with 100GB on a machine with 128GB RAM that
    is not running anything significant. However, the out-of-memory
    killer reaped it after a while, despite "free" showing 125GB free; ok, >>100GB for the sieve and 100GB for MODS does not fit. Does MODS buy >>anything?

    I prepared a version without MODS >><http://www.complang.tuwien.ac.at/forth/programs/sieveofatkin-nomods.fs>, >>and it is actually faster:

    On the 5800X with n=1e8, using vfx64:

    MODS no MODS
    8,896,377,480 4,943,212,037 cycles
    9,533,502,029 8,432,685,132 instructions
    1,212,120,001 998,289,650 L1-dcache-loads
    155,895,135 43,060,075 L1-dcache-load-misses

    Ok, let's see about the 100G. Our 5800X currently has only 96GB, so I
    used a Rocket Lake machine (Xeon W-1370P), where perf unfortunately
    does not report loads or misses. I used vfx64, because the version of >>iforth installed apparently does not like to allocate 10GB.

    cycles instructions n primes
    4,958,261,806 8,431,214,174 1e8 5761455
    56,409,350,162 84,181,549,986 1e9 50847534
    597,442,687,159 891,687,473,921 1e10 455052511
    6,068,869,349,719 8,916,633,565,847 1e11 4118054813

    The number of instructions is superlinear between n=1e9 and 1e10, but >>linear for 1e8->1e9 and 1e10->1e11. Hmm. Cycles rise slightly >>superlinearly, possibly because caches help less and less.

    I've seen code in this thread that made the sieve of Atkin a
    test of the speed of printing code.

    So?

    Long ago the original byte benchmark turned into measuring the
    speed of `` 1899 . '' instead of the speed of sieving.

    Let's test it on a Ryzen 5800X:

    perf stat -e cycles gforth-fast ../gforth/siev.fs -e ": foo 1000 0 do 1899 . loop ; ..."

    cycles ...
    31203365 bye
    300973318 main bye
    35385551 foo bye

    MAIN runs the sieve (made slightly faster than the Byte version) 1000
    times, without printing the number of primes.

    FOO prints 1899 1000 times.

    As you can see, the printing is a lot faster.

    However logN missing or adding here or there

    ld N is 20 for N=1e6, so it's a lot less than N, but not negligible.

    -- O(n) versus O(N.logn) --

    It's about O(N) vs. O(N log log N), though.

    The wikipedia Atking page needs a caveat of this sort,
    who are going to add it?

    It states repeatedly that Erathostenes is faster and discusses the
    reasons. What makes you think that it needs an additional caveat?

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: http://www.forth200x.org/forth200x.html
    EuroForth 2021: https://euro.theforth.net/2021

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Marcel Hendrix on Mon May 23 11:05:18 2022
    On Monday, May 23, 2022 at 10:11:23 AM UTC+2, Marcel Hendrix wrote:
    On Monday, May 23, 2022 at 9:02:11 AM UTC+2, Anton Ertl wrote:
    [..]
    Next, and probably last, step will be to limit memory use.
    The MOD hack should obviously go (as you've shown) and I can
    make sieve a bit array. That will indicate if is advantageous to compress more aggressively.

    Using a bit array uncovered a likely reason for the discontinuity in the
    speed results.

    The bit results start off slower than the ones using bytes. The Intel bit manipulation instructions are not very fast -- extracting the bits
    'by hand' might be faster -- but iForth has the words, so why not.

    Contrary to the byte results, from n = 1e3 to 1e8 the runtime increases linearly with n, and for n =1e8 the result with bits is 1237/738 = 1.7 times faster than with bytes. Definitely a worthwhile improvement.

    Note that the AMD 5800X has 512KB L2 cache per core and 32MEG
    L3 in total. It could be that for n > 0.1Gbytes or 12.5Mbits it runs out
    of L3 cache and loses 100 - 300% performance that way. The faster
    but much smaller L1 and L2 caches won't help here.

    n runtime (ms) primes runtime (ms) (bits iso bytes)
    1,000 0 168 0
    10,000 0 1,229 0
    100,000 0 9,592 0
    1,000,000 4 78,498 6
    10,000,000 44 664,579 72
    100,000,000 1237 5,761,455 738
    1,000,000,000 16044 50,847,534 16902
    10,000,000,000 206919 455,052,511 215561

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jali Heinonen@21:1/5 to All on Mon May 23 12:48:34 2022
    maanantai 23. toukokuuta 2022 klo 13.55.03 UTC+3 Jan Coombs kirjoitti:
    On Sun, 22 May 2022 13:07:39 -0700 (PDT)
    Jali Heinonen <jali.h...@gmail.com> wrote:

    perjantai 20. toukokuuta 2022 klo 18.48.44 UTC+3 Jan Coombs kirjoitti:
    On Fri, 20 May 2022 05:53:24 -0700 (PDT)
    Marcel Hendrix <m...@iae.nl> wrote:

    Apparently the fastest possible Sieve (in the limit). https://www.baeldung.com/cs/prime-number-algorithms .

    Unfortunately, the pseudo-code (?) is somewhat ambiguous.
    The explanation is good, thanks. This seems to work, how's your python?

    This is the optimized implementation proposed by Zsolt KOVACS:
    from: https://stackoverflow.com/questions/21783160/sieve-of-atkin-implementation-in-python
    Does this linked Python code really work? With upper limit of 229 it seems to find the following primes:

    [2, 3, 5, 7, 13, 19, 29, 53, 67, 85, 103, 173, 199]

    There should 50 primes, not 13 primes?
    jan@mypc:$ python SieveOfAtkin.py 229
    [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61,
    67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137,
    139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199,
    211, 223, 227]

    Email me for source, it was beyond my ability/patience to change it.

    Jan Coombs
    --

    Thanks, I found a working Python implementation. Seems to be faster than my wheel sieve conversion for the 8th but luckily I can easily beat it's performance by parallelizing my code. It's easy as wheel sieve allows finding primes between specific number
    range, so I can for an example find primes between 10000000000 and 10000011050. It takes just a few milliseconds to find that first prime is 10000000019, last prime is 10000011037 and there are 455 primes. I think Sieve of Atkin can't do that? This makes
    parallelizing easy as each thread can be given different number range to handle and no locking is necessary.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans Bezemer@21:1/5 to Anton Ertl on Wed May 25 05:15:48 2022
    On Sunday, May 22, 2022 at 10:38:37 PM UTC+2, Anton Ertl wrote:
    I prepared a version without MODS <http://www.complang.tuwien.ac.at/forth/programs/sieveofatkin-nomods.fs>,
    and it is actually faster:
    BTW, clean code - getting it to run under 4tH was a breeze! Thanks for publishing this!

    Hans Bezemer

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Marcel Hendrix on Sun Jun 5 05:25:07 2022
    On Monday, May 23, 2022 at 8:05:20 PM UTC+2, Marcel Hendrix wrote:
    [..]
    n runtime (ms) primes runtime (ms) (bits iso bytes)
    1,000 0 168 0
    10,000 0 1,229 0
    100,000 0 9,592 0
    1,000,000 4 78,498 6
    10,000,000 44 664,579 72
    100,000,000 1237 5,761,455 738
    1,000,000,000 16044 50,847,534 16902
    10,000,000,000 206919 455,052,511 215561

    Slight update: I had a look in the BIOS and saw that the memory timing
    was not according to the XMP profile. Adjusting that increased the
    run-speed of the Sieve of Atkin by 121%.

    n runtime (ms) primes runtime (ms) (bits) (bits+mem timing)
    1,000 0 168 0 0
    10,000 0 1,229 0 0
    100,000 0 9,592 0 0
    1,000,000 4 78,498 6 6
    10,000,000 44 664,579 72 65
    100,000,000 1237 5,761,455 738 729
    1,000,000,000 16044 50,847,534 16902 13920 ( -121%)
    10,000,000,000 206919 455,052,511 215561 177643 ( -121%)

    BTW, I learned that the segmented sieve of Atkin exists but I have no idea
    if it is faster or not. The Sieve guys love to throw numbers around and bicker about them.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)