• Shared memory

    From Marcel Hendrix@21:1/5 to All on Thu Dec 29 15:29:31 2022
    I want to experiment with shared memory between iForth instantiations
    running on a multi-core CPU. On Windows, it is possible to share a memory-mapped file between programs. When a non-existing file name is given, the
    used system call defaults to an arbitrary memory buffer, exactly what is needed.

    First experiments are successful, I am able to pass text from one iForth
    to another with literally only a single line of code. However, after hours of debugging, it proves that the sharing is only possible when both iForth instances are run as an Administrator, which is somewhat understandable,
    but a nuisance.

    The MS example 'C' code ignores the problem, suggesting that
    default security measures do not prevent the idea from working.
    Does anybody know how to get around this problem (or lessen the OS
    default security level a notch)?

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans Bezemer@21:1/5 to Marcel Hendrix on Thu Dec 29 16:48:13 2022
    On Friday, December 30, 2022 at 12:29:33 AM UTC+1, Marcel Hendrix wrote:
    I want to experiment with shared memory between iForth instantiations
    running on a multi-core CPU. On Windows, it is possible to share a memory-mapped file between programs. When a non-existing file name is given, the
    used system call defaults to an arbitrary memory buffer, exactly what is needed.

    First experiments are successful, I am able to pass text from one iForth
    to another with literally only a single line of code. However, after hours of debugging, it proves that the sharing is only possible when both iForth instances are run as an Administrator, which is somewhat understandable,
    but a nuisance.

    The MS example 'C' code ignores the problem, suggesting that
    default security measures do not prevent the idea from working.
    Does anybody know how to get around this problem (or lessen the OS
    default security level a notch)?

    -marcel
    Maybe play with umask() before opening up shm?
    Like: myMask = umask(0); /* open shm */ umask(myMask);

    Hans Bezemer

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Marcel Hendrix on Fri Dec 30 09:42:43 2022
    Marcel Hendrix <mhx@iae.nl> writes:
    First experiments are successful, I am able to pass text from one iForth
    to another with literally only a single line of code.

    Note that, if you want to communicate between the processes by writing
    to shared memory in one process and reading in the other, modern CPUs
    tend to have quite nonintuitive behaviour, and require the programmer
    to jump through some hoops for reliable operation. IA-32 and AMD64
    are somewhat better in that respect than, e.g., ARM, but even they
    have non-intuitive behaviour.

    My suggestion is to encapsulate the workarounds for this behaviour in
    libraries for shared-memory communication (whether between processes
    or between threads of the same process). Bernd Paysan has quite a bit
    of practical experience with threads and shared memory, and has added
    some libraries of this kind to Gforth.

    The MS example 'C' code ignores the problem, suggesting that
    default security measures do not prevent the idea from working.

    And, have you tried it? Does it work as non-administrator? If it
    does, what's the difference from what you have tried?

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Anton Ertl on Fri Dec 30 05:24:19 2022
    On Friday, December 30, 2022 at 10:53:01 AM UTC+1, Anton Ertl wrote:
    [..]
    Note that, if you want to communicate between the processes by writing
    to shared memory in one process and reading in the other, modern CPUs
    tend to have quite nonintuitive behaviour, and require the programmer
    to jump through some hoops for reliable operation. IA-32 and AMD64
    are somewhat better in that respect than, e.g., ARM, but even they
    have non-intuitive behaviour.

    (iForth does not yet support ARM.) Your warning is appreciated, because
    I thought that I was done already (apart from setting up a semaphore).

    The MS example 'C' code ignores the problem, suggesting that
    default security measures do not prevent the idea from working.
    And, have you tried it? Does it work as non-administrator? If it
    does, what's the difference from what you have tried?

    There are two steps to it. First iForth.exe must be started under an Administrator account. That cost me quite a bit of time, but I found
    several one-click solutions for it. Unfortunately, high privilege
    programs are checked by UAC and require further acknowledgement
    before they can run. It is incredibly complex to auto-skip that without
    editing the Registry. For now I'll live with UAC until share memory
    proves useful.

    And, have you tried it? Does it work as non-administrator? If it
    does, what's the difference from what you have tried?

    I guess that you ask if I compiled the original example.
    No, I did not. It was only a rough sketch. I may try that later.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Marcel Hendrix on Fri Dec 30 17:05:24 2022
    Marcel Hendrix <mhx@iae.nl> writes:
    On Friday, December 30, 2022 at 10:53:01 AM UTC+1, Anton Ertl wrote:
    [..]
    Note that, if you want to communicate between the processes by writing
    to shared memory in one process and reading in the other, modern CPUs
    tend to have quite nonintuitive behaviour, and require the programmer
    to jump through some hoops for reliable operation. IA-32 and AMD64
    are somewhat better in that respect than, e.g., ARM, but even they
    have non-intuitive behaviour.

    (iForth does not yet support ARM.) Your warning is appreciated, because
    I thought that I was done already (apart from setting up a semaphore).

    I expect that the semaphore code (from the OS, right?) contains the
    necessary operations such that when you write, then V the semaphore in
    one process, and P for the semaphore in the other process, and then
    read the shared memory in the other process, things will work as
    expected. But such semaphore operations tend to be quite expensive.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@arcor.de@21:1/5 to Marcel Hendrix on Fri Dec 30 10:14:53 2022
    Marcel Hendrix schrieb am Freitag, 30. Dezember 2022 um 00:29:33 UTC+1:
    I want to experiment with shared memory between iForth instantiations
    running on a multi-core CPU. On Windows, it is possible to share a memory-mapped file between programs. When a non-existing file name is given, the
    used system call defaults to an arbitrary memory buffer, exactly what is needed.

    First experiments are successful, I am able to pass text from one iForth
    to another with literally only a single line of code. However, after hours of debugging, it proves that the sharing is only possible when both iForth instances are run as an Administrator, which is somewhat understandable,
    but a nuisance.

    The MS example 'C' code ignores the problem, suggesting that
    default security measures do not prevent the idea from working.
    Does anybody know how to get around this problem (or lessen the OS
    default security level a notch)?

    Perhaps this helps: https://epdf.tips/multicore-application-programming-for-windows-linux-and-oracle-solaris.html
    see page 225ff

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@arcor.de@21:1/5 to minf...@arcor.de on Fri Dec 30 10:18:14 2022
    minf...@arcor.de schrieb am Freitag, 30. Dezember 2022 um 19:14:55 UTC+1:
    Marcel Hendrix schrieb am Freitag, 30. Dezember 2022 um 00:29:33 UTC+1:
    I want to experiment with shared memory between iForth instantiations running on a multi-core CPU. On Windows, it is possible to share a memory-mapped file between programs. When a non-existing file name is given, the
    used system call defaults to an arbitrary memory buffer, exactly what is needed.

    First experiments are successful, I am able to pass text from one iForth
    to another with literally only a single line of code. However, after hours of
    debugging, it proves that the sharing is only possible when both iForth instances are run as an Administrator, which is somewhat understandable, but a nuisance.

    The MS example 'C' code ignores the problem, suggesting that
    default security measures do not prevent the idea from working.
    Does anybody know how to get around this problem (or lessen the OS
    default security level a notch)?
    Perhaps this helps: https://epdf.tips/multicore-application-programming-for-windows-linux-and-oracle-solaris.html
    see page 225ff

    p.s. it's the document page, not the page in the epdf online viewer

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to mhx@iae.nl on Fri Dec 30 21:24:23 2022
    In article <73c2da86-b581-4519-bdb0-0c17df4d646en@googlegroups.com>,
    Marcel Hendrix <mhx@iae.nl> wrote:
    I want to experiment with shared memory between iForth instantiations
    running on a multi-core CPU. On Windows, it is possible to share a memory-mapped file between programs. When a non-existing file name is given, the
    used system call defaults to an arbitrary memory buffer, exactly what is >needed.

    I have success in going the other direction. Starting Forth and then
    forking the process. Naturally the dictionary space is shared (cutting
    waste) and a piece of common space (if need be Gbytes)
    Each Forth has it own private dictionary space to add definitions
    to, so it is fully functional.
    It is based on cooperation. Each Forth is supposed to not mess with
    each others stack and other private parts.
    This works on linux (although I have discovered a defect in the 64 bit forking that I've worked around.)
    The same compatible (!) system works on Windows 32, no need to align
    Windows and Linux for a common API that will hard to come by.
    That is the advantage of relying on Forth itself.

    Thanks to the abysmal documentation of the Windows API's I have
    not managed to run it on Windows 64. Mind you, it is supposed
    to work the same way as on Windows32. The answers that you get
    is that you should use the C++ compiler not the API.
    (Same with Linux, "you should use the shared libraries, not the
    system calls." Only C++/C compiler writers have the right to
    use system calls.)

    First experiments are successful, I am able to pass text from one iForth
    to another with literally only a single line of code. However, after hours of >debugging, it proves that the sharing is only possible when both iForth >instances are run as an Administrator, which is somewhat understandable,
    but a nuisance.

    Being root should have nothing to do with it. You are in for a
    nasty ride.

    The MS example 'C' code ignores the problem, suggesting that
    default security measures do not prevent the idea from working.
    Does anybody know how to get around this problem (or lessen the OS
    default security level a notch)?

    I had practical motivation to implement this multi tasking
    for the parallel Meissel/Hedgehog inspired idea's of counting
    primes. It worked.

    What programs do you have in mind to accommodate with this extension?


    -marcel

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge.
    Don't sell the hide of the bear until you shot it.
    Better one bird in the hand than ten in the air.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to All on Sat Jan 7 10:54:36 2023
    I think I got it. Shared memory is implemented.

    A minor annoyance is that iForth now has to be in the Administrator
    group to run on Windows 11. This means UAC kicks in when the
    program starts. I know how to fix it, but it is not on my priority list.

    Getting it to work was not so difficult after all, but once applied for
    iSPICE I found an unexpected twist. When iSPICE is ordered to run
    a parallel job, the commandline could not contain certain parameters,
    because these were not transferred from the controlling core to the
    slaves. Here, when #|cpus ( the number of cores allotted to the job )
    is set at 8 on the controller,
    iSPICE> 1 TO #|cpus RUN-PAR
    ran the slaves with #|cpus is 8, not 1. Apparently RUN-PAR is started
    before the commandline is fully evaluated.

    Below are some results. I took a simple SPICE simulation file with
    3 nested .STEP loops for a total of 24 tasks.
    Run on LTspice, this takes 363 seconds. Under the same conditions,
    it was run under iSPICE with #|cpus set between 1 and 32.

    iSPICE> .TICKER-INFO
    AMD Ryzen 7 5800X 8-Core Processor

    The best result is about 45 times faster than LTspice.
    The optimum is 12 cores, with a strange outlier at #|cpus = 10.
    An iSPICE task needs about 2 GBytes of memory (here).
    The base memory use was 6 GBytes when I ran the test, so with
    12 cores the job ran out of memory (I have only 32Gbytes here).
    Maybe that with 10 cores Windows started making decisions
    with regards to swapspace or working set.

    During the test I kept an eye on clock frequency and memory use.
    There was no throttling (5.6 GHz throughout), and maximum
    memory use was about 31 Gbytes. No disk activity detectable (or
    not shown by Windows :--)

    The 8 extra hyperthreads are not very useful for this kind of work.
    Once the 8 real threads are active, the simulation time does not
    really decrease further. Maybe I should stick in more RAM to
    make sure about that, or run it on a workstation with more/less
    cores.

    -marcel

    \ LTspiceXVII vs 17.1.5
    \ Total elapsed time: 363.431 seconds.

    iSPICE> 1 TO #|cpus ok | RUN-PAR
    Job `step\step_partest.cir` finished, 49.638 seconds elapsed. ok
    iSPICE> 2 TO #|cpus ok | RUN-PAR
    Job `step\step_partest.cir` finished, 25.352 seconds elapsed. ok
    iSPICE> 4 TO #|cpus ok | RUN-PAR
    Job `step\step_partest.cir` finished, 13.489 seconds elapsed. ok
    iSPICE> 8 TO #|cpus ok | RUN-PAR
    Job `step\step_partest.cir` finished, 8.618 seconds elapsed. ok
    iSPICE> 10 TO #|cpus ok | RUN-PAR
    Job `step\step_partest.cir` finished, 11.051 seconds elapsed. ok
    iSPICE> 12 TO #|cpus ok | RUN-PAR
    Job `step\step_partest.cir` finished, 7.569 seconds elapsed. ok
    iSPICE> 14 TO #|cpus ok | RUN-PAR
    Job `step\step_partest.cir` finished, 7.822 seconds elapsed. ok
    iSPICE> 16 TO #|cpus ok | RUN-PAR
    Job `step\step_partest.cir` finished, 8.255 seconds elapsed. ok
    iSPICE> 20 TO #|cpus ok | RUN-PAR
    Job `step\step_partest.cir` finished, 9.459 seconds elapsed. ok
    iSPICE> 24 TO #|cpus ok | RUN-PAR
    Job `step\step_partest.cir` finished, 8.441 seconds elapsed. ok
    iSPICE> 28 TO #|cpus ok | RUN-PAR
    Job `step\step_partest.cir` finished, 10.799 seconds elapsed. ok
    iSPICE> 32 TO #|cpus ok | RUN-PAR
    Job `step\step_partest.cir` finished, 12.280 seconds elapsed. ok

    \ About 363/8 = 45x faster than Analog Devices' LTspice.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Marcel Hendrix on Fri Jan 13 13:14:49 2023
    On Saturday, January 7, 2023 at 7:54:38 PM UTC+1, Marcel Hendrix wrote:
    I think I got it. Shared memory is implemented.

    Now without bugs. ( https://ibb.co/Qd7Xw3g )

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Marcel Hendrix on Fri Jan 13 13:45:57 2023
    On Friday, January 13, 2023 at 10:14:51 PM UTC+1, Marcel Hendrix wrote:
    On Saturday, January 7, 2023 at 7:54:38 PM UTC+1, Marcel Hendrix wrote:
    I think I got it. Shared memory is implemented.
    Now without bugs. ( https://ibb.co/Qd7Xw3g )

    Some details:

    iSPICE schematic ( https://ibb.co/MsfXGmw )

    ~~~
    -- iForth netlist (automatically converted from SPICE netlist)
    -- d:\dfwforth\examples\SPICE\ispice\circuits\net_lts\powerplane
    -- powerIII.cir processed by iSPICE on 14:12:52, January 12, 2023
    CIRCUIT
    5 N: ina out p s p2
    3 B: i_V1 i_Vs i_B1

    FCONST: k1 = 0.9999
    FCONST: L11=15mH
    FCONST: L22=15mH r=10
    FCONST: con_1=r
    FCONST: con_0=r

    EXPR: ex_0 -V(p)*I(V1)-V(s)*I(Vs)

    ina GND i_V1 PULSE: V1 ( -20 20 0 10n 10n 0.5ms 1ms )
    out GND con_0 RESS R2
    ina p con_1 RESS R1
    p GND s GND CI XU1 L11={L11} L22={L22} K={k1}
    s out i_Vs 0e VSOURCE Vs
    p2 GND i_B1 ex_0 BVXT B1
    END

    NO-JOB-STORE
    FALSE TO fastaccess?
    .TRAN 0 1s {1s-2ms} 0.1u
    SIMULATE
    WRITES

    ~~~

    -- iForth cmd file
    CLEAR-TASK-DATA
    .STEP param k1 0.99 1 0.0005
    SUBMIT

    ~~~

    -- original SPICE netlist
    * D:\dfwforth\examples\SPICE\ispice\circuits\net_lts\powerplane\powerIII.asc
    V1 ina 0 PULSE(-20 20 0 10n 10n 0.5ms 1ms)
    R2 out 0 {r}
    R1 ina p {r}
    XU1 p 0 s 0 CI L11={L11} L22={L22} K={k1}
    Vs s out 0
    B1 p2 0 V=-V(p)*I(V1)-V(s)*I(Vs)
    .param k1 = 0.9999
    .param L11=15mH
    .param L22=15mH r=10
    .option reltol=0.1m
    .tran 0 1s {1s-2ms} 0.1u
    .step param k1 0.99 1 0.0005
    .meas FORTH p2 @AVG pleak2
    .lib NGSPICE\CI.sub
    * LTspice total elapsed time: 527.32 seconds.
    .backanno
    .end


    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Marcel Hendrix on Sat Jan 21 05:29:14 2023
    On Saturday, January 7, 2023 at 7:54:38 PM UTC+1, Marcel Hendrix wrote:
    I think I got it. Shared memory is implemented.

    With further testing I noticed another hidden Windows 'feature.'
    When running iForth as an Administrator, drag and drop to the iForth
    console and from/to my editor and File manager sometimes did not work.
    I suspected a bug in iForth, but digging around uncovered that this is a well-known Windows feature: a higher privileged process (here
    iForth) is prevented from accepting drag-and-drop from a lower
    privileged one (here File manager). Ok, but there is a nasty twist here:
    when iForth starts my editor with the S" xx" SYSTEM command, 'xx'
    apparently becomes higher privileged too, and as a consequence,
    drag-and-drop does not work anymore for 'xx' (the started editor).
    This is somewhat unexpected and certainly a nuisance.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Marcel Hendrix on Sat Jan 21 05:54:45 2023
    On Saturday, January 7, 2023 at 7:54:38 PM UTC+1, Marcel Hendrix wrote:
    I think I got it. Shared memory is implemented.

    And now I want more :--)

    It would be really great if the shared memory trick (which uses the system
    page file) worked across the network. Admittedly it is only cosmetic, because for my current purpose I could also use a shared file with a Filemap view
    (mmap the file in the iForth virtual address space). In case of a file I have to
    rewrite my array accesses to file operations, which is a drag. Neither Windows nor Linux appear to directly support shared memory between networked
    computers.

    Is there a Forth library with RDMA (a transparent protocol build into many network
    adapters)? If it existed I could buy a refurbished HP840 workstation and *really*
    get going (such workstations have 44 cores/88 threads and cost a mere
    2000 Euros, 15 - 20k new, refurbished RDMA nic's are 20 Euros...).

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Marcel Hendrix on Sat Jan 21 15:14:46 2023
    Marcel Hendrix <mhx@iae.nl> writes:
    Neither Windows
    nor Linux appear to directly support shared memory between networked >computers.

    If you can live with its performance characteristics (and probably
    lack of coherence), how about mmapping an NFS-mounted file (other
    distributed fie systems may be better for that purpose, though).

    Otherwise, I think there are good reasons for that lack of support.
    The latency is long, and coherence is a problem. RDMA may solve the
    coherence problem and reduce the latency, but it's still long.
    Therefore people tend to use message passing rather than shared memory
    across the network. Interestingly, in the Safe Forth concept I
    suggested avoiding shared memory and communicating between threads (or processes) with messages, even on the same machine (where shared
    memory is easy and may be cheap).

    Is there a Forth library with RDMA (a transparent protocol build into many network
    adapters)?

    Not that I have heard of, but if you want one, you are in a good
    position to work on one.

    If it existed I could buy a refurbished HP840 workstation and *really*
    get going (such workstations have 44 cores/88 threads and cost a mere
    2000 Euros, 15 - 20k new, refurbished RDMA nic's are 20 Euros...).

    Makes you wonder what's wrong with them:-)

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul Rubin@21:1/5 to Marcel Hendrix on Sat Jan 21 08:16:54 2023
    Marcel Hendrix <mhx@iae.nl> writes:
    Is there a Forth library with RDMA (a transparent protocol build into
    many network adapters)? If it existed I could buy a refurbished HP840 workstation and *really* get going (such workstations have 44 cores/88 threads and cost a mere 2000 Euros, 15 - 20k new, refurbished RDMA
    nic's are 20 Euros...).

    Unless you had a bunch of those workstations networked together, why
    would you need RDMA, assuming your Forth program is running on the
    workstation?

    I see one here for 1000 USD, with 44 cores and 128GB ram:

    https://www.ebay.com/itm/175576911219

    That is really impressive. Anton asks what is wrong with them.
    Obviously they are old and power hungry, but less so than it seems:

    https://www.intel.com/content/www/us/en/products/sku/91317/intel-xeon-processor-e52699-v4-55m-cache-2-20-ghz/specifications.html

    They use 14nm lithography and have 2.2GHz base frequency, which is not
    all that fast. They were introduced in 2016. This 44 core system is
    almost definitely slower than a 32 core Threadripper, but might beat a
    16 core Ryzen. On the other hand those will cost more up front,
    especially with the memory figured in. If you are running the
    workstation 24/7 then the newer hardware will probably pay for itself in
    power savings quickly, but if you only run it part of the time it might
    be ok.

    Now I feel a little bit interested but don't have an actual use for such
    a box. Spinning up some Hetzner cloud servers for an occasional compute
    task is pretty cheap.

    Maybe you could implement MPI (does anyone still use that?) for your
    Spice stuff.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Anton Ertl on Sat Jan 21 08:27:45 2023
    On Saturday, January 21, 2023 at 4:38:11 PM UTC+1, Anton Ertl wrote:
    Marcel Hendrix <m...@iae.nl> writes:
    [..]
    If it existed I could buy a refurbished HP840 workstation and *really*
    get going (such workstations have 44 cores/88 threads and cost a mere
    2000 Euros, 15 - 20k new, refurbished RDMA nic's are 20 Euros...).
    Makes you wonder what's wrong with them:-)

    They come with a 3 year warranty, but I have no idea who dares buy
    that stuff for their business, and how these resellers (there are many)
    can prosper? I'll find out :--)

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Anton Ertl on Sat Jan 21 08:59:32 2023
    On Saturday, January 21, 2023 at 4:38:11 PM UTC+1, Anton Ertl wrote:
    Marcel Hendrix <m...@iae.nl> writes:
    [..]
    If you can live with its performance characteristics (and probably
    lack of coherence), how about mmapping an NFS-mounted file (other
    distributed fie systems may be better for that purpose, though).

    Hmm, given the very limited functionality I need, this might be
    perfectly adequate.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Paul Rubin on Sat Jan 21 08:51:17 2023
    On Saturday, January 21, 2023 at 5:16:57 PM UTC+1, Paul Rubin wrote:
    Marcel Hendrix <m...@iae.nl> writes:
    [..]
    Unless you had a bunch of those workstations networked together, why
    would you need RDMA, assuming your Forth program is running on the workstation?

    I will put the workstation(s) in the attic, where I can't hear and feel them. My desktop pc dispatches and controls the runs and displays the results.

    This 44 core system is almost definitely slower than a 32 core
    Threadripper, but might beat a 16 core Ryzen.

    That costs 7,500 Euros around here, or 4 refurbished HP boxes...

    It will be more fun than tweaking a game PC with liquid metal
    and nitrogen for 1% higher frame rates.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to All on Tue Feb 27 23:41:10 2024
    I have been polishing my shared memory application (iSPICE) a bit more.
    The benchmark I previously showed compared running a circuit simulation
    with a variable number of communicating CPUs. Only a minimum amount of data
    is shared (a page with published parameters and achieved results, plus the ready! flags). With this setup I got about a factor of 3 improvement for
    8 CPUs. I hoped to improve this factor a bit with better hardware and maybe some software tweaking.

    What I didn't try until today was checking how fast the circuit simulation
    ran on a single CPU, *not* using the shared memory framework. And indeed,
    that is a problem, in that without shared memory the runtime is *3 times
    less* than with shared memory. In other words, there is no net gain in
    having 8 mem-shared cpu's. As a additional check I started the circuit run
    in 3 separate windows. They all achieved the same speed as the single run non-shared version, proving that the hardware (cpu/memory/disk) is amply sufficient to provide an 8 times speed-up.

    I will now start working on Anton's suggesting of a shared file. Or maybe
    I should try this on Linux first, maybe shared memory works better there.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Wed Feb 28 01:46:14 2024
    Perhaps this is the reason why:

    Windows shared memory is not the same as Linux only some things are similar.

    The Unix mmap() API is practically equivalent to the CreateFileMapping/ MapViewOfFile Windows API. Both can map files and/or can create shared (anonymous) maps that are backed by the swap device (if any). As a matter of fact, glibc uses anonymous mmap() to implement malloc() when the requested memory size is sufficiently large.

    The biggest difference is the memory allocation granularity size. Linux is 4K and Windows is 64K. If it's important to have say arbitrary 8K pages mapped into specific 8K destinations well you are stuck on Windows and it just can't be done.

    Another difference is you can mmap a new page over the top of an existing page effectively replacing the first page mapping. In Windows you can't do this
    but instead must destroy the entire view and rebuild the entire view in what ever new layout that is required. So if the "view" contains 1024 pages and
    1 page changes then in Linux you can just change that one page. In Windows
    you must drop all 1024 pages and re-view the same 1023 pages + the one new page.

    IOW with only minimal data to share, Linux should be faster. A normal file
    will probably do the job already, since most probably it is buffered in memory anyway.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to This is certainly interesting. Prev on Wed Feb 28 08:29:28 2024
    This is certainly interesting. Previously I wrote:

    Only a minimum amount of data is shared (a page with published
    parameters and achieved results, plus the ready! flags).

    However, I see now that I asked for 'arbitrary size' in the system
    call. Combined with a locked address, this could cause Windows to
    swap a huge amount of memory on accesses, explaining the slow
    execution.

    I will have to spend more time reading the documentation after all.

    Thanks a lot everybody, for the helpful comments!

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From albert@spenarnc.xs4all.nl@21:1/5 to mhx on Wed Feb 28 11:40:00 2024
    In article <2ec6da768657cb7e0838af11eb2d209e@www.novabbs.com>,
    mhx <mhx@iae.nl> wrote:
    I have been polishing my shared memory application (iSPICE) a bit more.
    The benchmark I previously showed compared running a circuit simulation
    with a variable number of communicating CPUs. Only a minimum amount of data >is shared (a page with published parameters and achieved results, plus the >ready! flags). With this setup I got about a factor of 3 improvement for
    8 CPUs. I hoped to improve this factor a bit with better hardware and maybe >some software tweaking.

    What I didn't try until today was checking how fast the circuit simulation >ran on a single CPU, *not* using the shared memory framework. And indeed, >that is a problem, in that without shared memory the runtime is *3 times >less* than with shared memory. In other words, there is no net gain in
    having 8 mem-shared cpu's. As a additional check I started the circuit run
    in 3 separate windows. They all achieved the same speed as the single run >non-shared version, proving that the hardware (cpu/memory/disk) is amply >sufficient to provide an 8 times speed-up.

    I will now start working on Anton's suggesting of a shared file. Or maybe
    I should try this on Linux first, maybe shared memory works better there.

    I simply use the clone system call on linux ( NR number is 56 for 64 bits)

    ( THREAD-PET KILL-PET PAUSE-PET ) CF: ?LI \ B5dec2
    "CTA" WANTED "-syscalls-" WANTED HEX
    \ Exit a thread. Indeed this is exit().
    : EXIT-PET 0 _ _ __NR_exit XOS ;
    \ Do a preemptive pause. ( abuse MS )
    : PAUSE-PET 1 MS ;
    \ Create a thread with dictionary SPACE. Execute XT in thread.
    : THREAD-PET ALLOT CTA CREATE RSP@ SWAP RSP! R0 @ S0 @
    ROT RSP! 2 CELLS - ( DSP) , ( TASK) , ( pid) 0 ,
    DOES> DUP @ >R SWAP OVER CELL+ @ R@ 2! ( clone S: tp,xt)
    100 R> _ __NR_clone XOS DUP IF
    ( Mother) DUP ?ERRUR SWAP 2 CELLS + ! ELSE
    ( Child) DROP RSP! CATCH DUP IF ERROR THEN EXIT-PET THEN ;
    \ Kill a THREAD-PET , preemptively. Throw errors.
    : KILL-PET >BODY 2 CELLS + @ 9 _ __NR_kill XOS ?ERRUR ;
    DECIMAL

    The idea is
    1000 ( dictionary space ) CREATE extra

    Now you run an xt as follows :
    xt extra
    The xt runs until it does an EXIT-PET, or is killed by a KILL-PET.

    In r10par.frt it run 41 sec on one 27 on two processors for 10^12.
    This was more a demonstration of parallel processing, the
    communication and work load balancing kills the advantages for
    more processors.

    (This was prime counting)

    Maybe try something simple before jumping into sockets and mapped
    files.

    The CTA words carves out a small dictionary space for the new
    processes to be used, plus stacks and user space.
    This is utterly system dependant, but in the ciforth model it
    is just one screen, and portable over 32/64 arm/86 linux/windows.
    It helps if you have a simple Forth to begin with ;-)
    (CTA is used in cooperative multi tasking as well.)

    -marcel

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat purring. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to All on Wed Feb 28 11:11:12 2024
    Maybe try something simple before jumping into sockets and mapped
    files.

    I have tried that way for the past 20 years already, and indeed it
    works fine. However, my simple example shown above needs 24 threads /processes/cores (whatever) each having about 2 to 4 GB of memory.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From albert@spenarnc.xs4all.nl@21:1/5 to mhx on Wed Feb 28 13:25:26 2024
    In article <c53be72b665c3d10796bfe67a7f02dcf@www.novabbs.com>,
    mhx <mhx@iae.nl> wrote:
    Maybe try something simple before jumping into sockets and mapped
    files.

    I have tried that way for the past 20 years already, and indeed it
    works fine. However, my simple example shown above needs 24 threads >/processes/cores (whatever) each having about 2 to 4 GB of memory.

    I have lost context, can you tell more about the simple example?
    (My provider purges old messages swiftly)

    And what with
    lina -g 96000 lina96G

    lina96G -e
    ..
    WANT UNUSED
    S[ ] OK UNUSED S>D DEC.

    0,000,000,000,000,000,000,000,000,000,100,730,247,992
    S[ ] OK
    I'm sure most Forth's can do something similar.
    (Overcommitting but not with my hp workstation with 256 Gbyte RAM).


    -marcel

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat purring. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to All on Sat Mar 2 18:14:18 2024
    I have lost context, can you tell more about the simple example?
    (My provider purges old messages swiftly)

    I was in the exploring/debugging phase and have only very recently
    completed the experiments.

    The final results are that with shared memory, on Windows
    11, it is possible to get an almost linear speedup with the
    number of cores in use. The way shared memory is implemented
    on Windows is with a memory-mapped file that uses the OS
    pagefile as backup. The file is guaranteed to not be swapped
    out under reasonable conditions, and Windows keeps its
    management invisible for users.

    I tried to make the file as small as possible. For this
    iForth benchmark it was 11 int64's (11 * 8 bytes) and 24
    extended floats (24 * 16 bytes), about 1/2 Kbyte. The file
    is touched very infrequently, just 24 result writes and
    then a loop over the 11 words to see if all cpu's finished
    (check at 10ms intervals). At the moment I have no idea
    what happens with very frequent read/writes (it is not
    the intended type of use).

    [During debugging I was lucky. When setting the number of
    working cpu's interactively, completely wrong results
    were obtained. This happened because #|cpus was defined
    as a VALUE in a configuration file. When changing #|cpus
    from the console, the value in sconfig.frt stayed the
    same (of course) while all the dynamically started cores
    used the on-disk value, not the value I typed in on
    CPU #0. Easy to understand in hindsight, but this type
    of 'black-hole' mistake can take hours to find in a 7000+
    line program. For some reason I just knew that it had to
    be #|cpus that was causing the problem.]

    The benchmark is a circuit file that defines a voltage
    source and a 2-resistor divider, all parameterized.
    These values were swept for a total of 24 different
    circuits. To calculate the result for one of the
    combinations takes 2.277s on a single core with iSPICE,
    or 24 x that value, 54.648s, for all 24 combinations.
    In the benchmark the 24 simulations are spread out over
    11 processes on an 8-core CPU :

    iSPICE> .ticker-info
    AMD Ryzen 7 5800X 8-Core Processor
    TICKS-GET uses os time & PROCESSOR-CLOCK 4192MHz
    Do: < n TO PROCESSOR-CLOCK RECALIBRATE >

    The aim is to get an 8 times speedup, or more if
    hyperthreads bring something, and do all combinations
    in less than 6.831 seconds. The best I managed is
    7.694s or about 7.67 "cores", which I consider not
    that bad. Here are the details (run 4 times):

    % cpus time [s] perf. ratio
    1 49.874 1.46
    2 25.314 2.39
    3 17.391 3.23
    4 13.335 4.11
    5 10.565 5.17
    6 9.468 5.71
    7 8.712 6.22
    8 7.694 7.67
    9 7.260 7.37
    10 7.874 6.72
    11 7.856 6.73 ok

    For your information: Running the same 24 variations
    with LTspice 17.1.15, one of the fastest SPICE
    implementations currently available, takes 382.265
    seconds, almost exactly 7 times slower than the iSPICE
    single-core run. Using 8 cores (LTspice pretends to
    use 16 threads), that ratio becomes 62 times.

    In the above table the performance ration for a single
    cpu is 1.46 (1.46 times faster than doing the 24
    simulations on a single core *without* shared memory),
    which might seem strange. I think the phenomenon is
    caused by the fact that a single combination takes
    only 2.277s and this may be too slow for the processor
    (or Windows) to ramp up the clock frequency. If the
    performance factor is normalized by the timing for
    1 cpu, the maximum speedup decreases to 5.25.
    We'll see what happens on an HPZ840.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From albert@spenarnc.xs4all.nl@21:1/5 to mhx on Sun Mar 3 12:19:13 2024
    In article <c2fb7eb58b7ae773f632a15c1abac917@www.novabbs.com>,
    mhx <mhx@iae.nl> wrote:
    I have lost context, can you tell more about the simple example?
    (My provider purges old messages swiftly)

    I was in the exploring/debugging phase and have only very recently
    completed the experiments.


    The final results are that with shared memory, on Windows
    11, it is possible to get an almost linear speedup with the
    number of cores in use. The way shared memory is implemented
    on Windows is with a memory-mapped file that uses the OS
    pagefile as backup. The file is guaranteed to not be swapped
    out under reasonable conditions, and Windows keeps its
    management invisible for users.

    Linear speedup? That must depend on the program.
    Can I surmise that the context is that you're comparing your
    version/clone iSpice with LTSpice.

    I tried to make the file as small as possible. For this
    iForth benchmark it was 11 int64's (11 * 8 bytes) and 24
    extended floats (24 * 16 bytes), about 1/2 Kbyte. The file
    is touched very infrequently, just 24 result writes and
    then a loop over the 11 words to see if all cpu's finished
    (check at 10ms intervals). At the moment I have no idea
    what happens with very frequent read/writes (it is not
    the intended type of use).


    [During debugging I was lucky. When setting the number of
    working cpu's interactively, completely wrong results
    were obtained. This happened because #|cpus was defined
    as a VALUE in a configuration file. When changing #|cpus
    from the console, the value in sconfig.frt stayed the
    same (of course) while all the dynamically started cores
    used the on-disk value, not the value I typed in on
    CPU #0. Easy to understand in hindsight, but this type
    of 'black-hole' mistake can take hours to find in a 7000+
    line program. For some reason I just knew that it had to
    be #|cpus that was causing the problem.]


    The benchmark is a circuit file that defines a voltage
    source and a 2-resistor divider, all parameterized.
    These values were swept for a total of 24 different
    circuits. To calculate the result for one of the
    combinations takes 2.277s on a single core with iSPICE,
    or 24 x that value, 54.648s, for all 24 combinations.
    In the benchmark the 24 simulations are spread out over
    11 processes on an 8-core CPU :

    iSPICE> .ticker-info
    AMD Ryzen 7 5800X 8-Core Processor
    TICKS-GET uses os time & PROCESSOR-CLOCK 4192MHz
    Do: < n TO PROCESSOR-CLOCK RECALIBRATE >

    The aim is to get an 8 times speedup, or more if
    hyperthreads bring something, and do all combinations
    in less than 6.831 seconds. The best I managed is
    7.694s or about 7.67 "cores", which I consider not
    that bad. Here are the details (run 4 times):

    % cpus time [s] perf. ratio
    1 49.874 1.46
    2 25.314 2.39
    3 17.391 3.23
    4 13.335 4.11
    5 10.565 5.17
    6 9.468 5.71
    7 8.712 6.22
    8 7.694 7.67
    9 7.260 7.37
    10 7.874 6.72
    11 7.856 6.73 ok

    For your information: Running the same 24 variations
    with LTspice 17.1.15, one of the fastest SPICE
    implementations currently available, takes 382.265
    seconds, almost exactly 7 times slower than the iSPICE
    single-core run. Using 8 cores (LTspice pretends to
    use 16 threads), that ratio becomes 62 times.

    So LT spice becomes slower by using 8 cores
    going from 7 times slower to 62 time slower than iSPICE.
    There must be a mistake here.

    In the above table the performance ration for a single
    cpu is 1.46 (1.46 times faster than doing the 24
    simulations on a single core *without* shared memory),
    which might seem strange. I think the phenomenon is
    caused by the fact that a single combination takes
    only 2.277s and this may be too slow for the processor
    (or Windows) to ramp up the clock frequency. If the
    performance factor is normalized by the timing for
    1 cpu, the maximum speedup decreases to 5.25.
    We'll see what happens on an HPZ840.

    You are going to run Windows 11 on the HP work station?
    I'm going to install a Linux version, for I want to
    experiment with CUDA.


    -marcel
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat purring. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to albert@spenarnc.xs4all.nl on Wed Mar 6 10:54:55 2024
    albert@spenarnc.xs4all.nl wrote:

    In article <c2fb7eb58b7ae773f632a15c1abac917@www.novabbs.com>,
    mhx <mhx@iae.nl> wrote:
    I have lost context, can you tell more about the simple example?
    [..]
    The final results are that with shared memory, on Windows
    11, it is possible to get an almost linear speedup with the
    number of cores in use.
    [..]
    Linear speedup? That must depend on the program.
    Can I surmise that the context is that you're comparing your
    version/clone iSpice with LTSpice.

    The example is *not* about trying to speed up programs
    by adding threads to work on parts that can be parallelized.
    A circuit simulator is used as the example here. Circuits
    contain on average about 30% of operations that can be done
    in parallel, so a fine-grained threaded approach with an
    infinite amount of threads can at most give 30% of a speedup.

    Most circuit simulation problems can not be solved with a
    single simulation. In almost every case one wants to re-run
    a job with small variations on the original specification.
    The variations can be on the circuit components themselves,
    on variations in environmental conditions like temperature,
    humidity, noise, variations on input sources or output loads,
    or even parameters of their (digital) control algorithms.
    Between 10 and many thousands of simulations could be
    necessary. At the top level, this problem is trivial to solve
    by editing the input netlist with the necessary changes,
    re-run the simulation, and store the results in a database.
    When all runs are done, the data is evaluated by querying.

    In practice, it is difficult to keep the administration
    straight if the above is done by hand. What I am looking for
    is a simple way to specify variations, create a list
    of all the simulations needed, then distribute the tasks
    to as many cpu cores as are available (locally, on the network,
    or in the Cloud), combine the results, and generate reports.

    To do this in Forth, I found it useful to use either shared
    memory, or a shared file. The post is about experiments with
    shared memory (useful when the number of cores is less than
    256 and the main memory requirement is less than 1 TByte.)

    The concrete example is to run N variations of a circuit on
    an 8 core system with 32GB of memory, with the features I
    describe above. The question was: is it possible to get
    a speedup of 8 when the benchmark runs on an 8 core CPU.

    iSPICE> .ticker-info
    AMD Ryzen 7 5800X 8-Core Processor
    TICKS-GET uses os time & PROCESSOR-CLOCK 4192MHz
    Do: < n TO PROCESSOR-CLOCK RECALIBRATE >

    The aim is to get an 8 times speedup, or more if
    hyperthreads bring something, and do all combinations
    in less than 6.831 seconds. The best I managed is
    7.694s or about 7.67 "cores", which I consider not
    that bad. Here are the details (run 4 times):

    % cpus time [s] perf. ratio
    1 49.874 1.46
    2 25.314 2.39
    3 17.391 3.23
    4 13.335 4.11
    5 10.565 5.17
    6 9.468 5.71
    7 8.712 6.22
    8 7.694 7.67
    9 7.260 7.37
    10 7.874 6.72
    11 7.856 6.73 ok


    For your information: Running the same 24 variations
    with LTspice 17.1.15, one of the fastest SPICE
    implementations currently available, takes 382.265
    seconds, almost exactly 7 times slower than the iSPICE
    single-core run. Using 8 cores (LTspice pretends to
    use 16 threads), that ratio becomes 62 times.

    I realize now that this comparison of iSPICE with LTspice
    can confuse the reader. It does not matter at all for this
    benchmark which SPICE simulator is used.

    So LT spice becomes slower by using 8 cores
    going from 7 times slower to 62 time slower than iSPICE.
    There must be a mistake here.

    There is no mistake. LTspice is 7 slower than iSPICE for
    the specific type of task used here. Although LTspice has
    a mechanism to run multiple variations, and claims to use
    8 cores / 16 threads, it does not appear to use them as
    efficiently as iSPICE does using shared memory.

    [..]
    We'll see what happens on an HPZ840.

    You are going to run Windows 11 on the HP work station?
    I'm going to install a Linux version, for I want to
    experiment with CUDA.

    I certainly want to see what happens if I run iSPICE on
    my 44-core HPZ840 :--) The fastest way to implement that
    should be to install Windows 10 or 11 on the HP. However,
    if that proves problematic I have no problem using Linux.
    I did not try iSPICE on Linux/WSL2 yet and I probably will
    do that first.

    I also want to experiment with CUDA (BTW, why not OpenCL,
    did you already find arguments against that route?),
    however, that would be to investigate a new way of circuit
    simulation that not uses the standard SPICE algorithms.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)