Forum: >>> Magnum BBS <<<

Shared memory

From Marcel Hendrix@21:1/5 to All on Thu Dec 29 15:29:31 2022

I want to experiment with shared memory between iForth instantiations
running on a multi-core CPU. On Windows, it is possible to share a memory-mapped file between programs. When a non-existing file name is given, the
used system call defaults to an arbitrary memory buffer, exactly what is needed.

First experiments are successful, I am able to pass text from one iForth
to another with literally only a single line of code. However, after hours of debugging, it proves that the sharing is only possible when both iForth instances are run as an Administrator, which is somewhat understandable,
but a nuisance.

The MS example 'C' code ignores the problem, suggesting that
default security measures do not prevent the idea from working.
Does anybody know how to get around this problem (or lessen the OS
default security level a notch)?

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans Bezemer@21:1/5 to Marcel Hendrix on Thu Dec 29 16:48:13 2022

On Friday, December 30, 2022 at 12:29:33 AM UTC+1, Marcel Hendrix wrote:

I want to experiment with shared memory between iForth instantiations
running on a multi-core CPU. On Windows, it is possible to share a memory-mapped file between programs. When a non-existing file name is given, the
used system call defaults to an arbitrary memory buffer, exactly what is needed.

First experiments are successful, I am able to pass text from one iForth
to another with literally only a single line of code. However, after hours of debugging, it proves that the sharing is only possible when both iForth instances are run as an Administrator, which is somewhat understandable,
but a nuisance.

The MS example 'C' code ignores the problem, suggesting that
default security measures do not prevent the idea from working.
Does anybody know how to get around this problem (or lessen the OS
default security level a notch)?

-marcel

Maybe play with umask() before opening up shm?
Like: myMask = umask(0); /* open shm */ umask(myMask);

Hans Bezemer

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Marcel Hendrix on Fri Dec 30 09:42:43 2022

Marcel Hendrix <mhx@iae.nl> writes:

First experiments are successful, I am able to pass text from one iForth
to another with literally only a single line of code.

Note that, if you want to communicate between the processes by writing
to shared memory in one process and reading in the other, modern CPUs
tend to have quite nonintuitive behaviour, and require the programmer
to jump through some hoops for reliable operation. IA-32 and AMD64
are somewhat better in that respect than, e.g., ARM, but even they
have non-intuitive behaviour.

My suggestion is to encapsulate the workarounds for this behaviour in
libraries for shared-memory communication (whether between processes
or between threads of the same process). Bernd Paysan has quite a bit
of practical experience with threads and shared memory, and has added
some libraries of this kind to Gforth.

The MS example 'C' code ignores the problem, suggesting that
default security measures do not prevent the idea from working.

And, have you tried it? Does it work as non-administrator? If it
does, what's the difference from what you have tried?

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Hendrix@21:1/5 to Anton Ertl on Fri Dec 30 05:24:19 2022

On Friday, December 30, 2022 at 10:53:01 AM UTC+1, Anton Ertl wrote:
[..]

Note that, if you want to communicate between the processes by writing
to shared memory in one process and reading in the other, modern CPUs
tend to have quite nonintuitive behaviour, and require the programmer
to jump through some hoops for reliable operation. IA-32 and AMD64
are somewhat better in that respect than, e.g., ARM, but even they
have non-intuitive behaviour.

(iForth does not yet support ARM.) Your warning is appreciated, because
I thought that I was done already (apart from setting up a semaphore).

The MS example 'C' code ignores the problem, suggesting that
default security measures do not prevent the idea from working.

And, have you tried it? Does it work as non-administrator? If it
does, what's the difference from what you have tried?

There are two steps to it. First iForth.exe must be started under an Administrator account. That cost me quite a bit of time, but I found
several one-click solutions for it. Unfortunately, high privilege
programs are checked by UAC and require further acknowledgement
before they can run. It is incredibly complex to auto-skip that without
editing the Registry. For now I'll live with UAC until share memory
proves useful.

And, have you tried it? Does it work as non-administrator? If it
does, what's the difference from what you have tried?

I guess that you ask if I compiled the original example.
No, I did not. It was only a rough sketch. I may try that later.

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Marcel Hendrix on Fri Dec 30 17:05:24 2022

Marcel Hendrix <mhx@iae.nl> writes:

On Friday, December 30, 2022 at 10:53:01 AM UTC+1, Anton Ertl wrote:
[..]

Note that, if you want to communicate between the processes by writing
to shared memory in one process and reading in the other, modern CPUs
tend to have quite nonintuitive behaviour, and require the programmer
to jump through some hoops for reliable operation. IA-32 and AMD64
are somewhat better in that respect than, e.g., ARM, but even they
have non-intuitive behaviour.

(iForth does not yet support ARM.) Your warning is appreciated, because
I thought that I was done already (apart from setting up a semaphore).

I expect that the semaphore code (from the OS, right?) contains the
necessary operations such that when you write, then V the semaphore in
one process, and P for the semaphore in the other process, and then
read the shared memory in the other process, things will work as
expected. But such semaphore operations tend to be quite expensive.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to Marcel Hendrix on Fri Dec 30 10:14:53 2022

Marcel Hendrix schrieb am Freitag, 30. Dezember 2022 um 00:29:33 UTC+1:

I want to experiment with shared memory between iForth instantiations
running on a multi-core CPU. On Windows, it is possible to share a memory-mapped file between programs. When a non-existing file name is given, the
used system call defaults to an arbitrary memory buffer, exactly what is needed.

First experiments are successful, I am able to pass text from one iForth
to another with literally only a single line of code. However, after hours of debugging, it proves that the sharing is only possible when both iForth instances are run as an Administrator, which is somewhat understandable,
but a nuisance.

The MS example 'C' code ignores the problem, suggesting that
default security measures do not prevent the idea from working.
Does anybody know how to get around this problem (or lessen the OS
default security level a notch)?

Perhaps this helps: https://epdf.tips/multicore-application-programming-for-windows-linux-and-oracle-solaris.html
see page 225ff

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to minf...@arcor.de on Fri Dec 30 10:18:14 2022

minf...@arcor.de schrieb am Freitag, 30. Dezember 2022 um 19:14:55 UTC+1:

Marcel Hendrix schrieb am Freitag, 30. Dezember 2022 um 00:29:33 UTC+1:

I want to experiment with shared memory between iForth instantiations running on a multi-core CPU. On Windows, it is possible to share a memory-mapped file between programs. When a non-existing file name is given, the
used system call defaults to an arbitrary memory buffer, exactly what is needed.

First experiments are successful, I am able to pass text from one iForth
to another with literally only a single line of code. However, after hours of
debugging, it proves that the sharing is only possible when both iForth instances are run as an Administrator, which is somewhat understandable, but a nuisance.

The MS example 'C' code ignores the problem, suggesting that
default security measures do not prevent the idea from working.
Does anybody know how to get around this problem (or lessen the OS
default security level a notch)?

Perhaps this helps: https://epdf.tips/multicore-application-programming-for-windows-linux-and-oracle-solaris.html
see page 225ff

p.s. it's the document page, not the page in the epdf online viewer

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From none) (albert@21:1/5 to mhx@iae.nl on Fri Dec 30 21:24:23 2022

In article <73c2da86-b581-4519-bdb0-0c17df4d646en@googlegroups.com>,
Marcel Hendrix <mhx@iae.nl> wrote:

I want to experiment with shared memory between iForth instantiations
running on a multi-core CPU. On Windows, it is possible to share a memory-mapped file between programs. When a non-existing file name is given, the
used system call defaults to an arbitrary memory buffer, exactly what is >needed.

I have success in going the other direction. Starting Forth and then
forking the process. Naturally the dictionary space is shared (cutting
waste) and a piece of common space (if need be Gbytes)
Each Forth has it own private dictionary space to add definitions
to, so it is fully functional.
It is based on cooperation. Each Forth is supposed to not mess with
each others stack and other private parts.
This works on linux (although I have discovered a defect in the 64 bit forking that I've worked around.)
The same compatible (!) system works on Windows 32, no need to align
Windows and Linux for a common API that will hard to come by.
That is the advantage of relying on Forth itself.

Thanks to the abysmal documentation of the Windows API's I have
not managed to run it on Windows 64. Mind you, it is supposed
to work the same way as on Windows32. The answers that you get
is that you should use the C++ compiler not the API.
(Same with Linux, "you should use the shared libraries, not the
system calls." Only C++/C compiler writers have the right to
use system calls.)

First experiments are successful, I am able to pass text from one iForth
to another with literally only a single line of code. However, after hours of >debugging, it proves that the sharing is only possible when both iForth >instances are run as an Administrator, which is somewhat understandable,
but a nuisance.

Being root should have nothing to do with it. You are in for a
nasty ride.

The MS example 'C' code ignores the problem, suggesting that
default security measures do not prevent the idea from working.
Does anybody know how to get around this problem (or lessen the OS
default security level a notch)?

I had practical motivation to implement this multi tasking
for the parallel Meissel/Hedgehog inspired idea's of counting
primes. It worked.

What programs do you have in mind to accommodate with this extension?

-marcel

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge.
Don't sell the hide of the bear until you shot it.
Better one bird in the hand than ten in the air.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Hendrix@21:1/5 to All on Sat Jan 7 10:54:36 2023

I think I got it. Shared memory is implemented.

A minor annoyance is that iForth now has to be in the Administrator
group to run on Windows 11. This means UAC kicks in when the
program starts. I know how to fix it, but it is not on my priority list.

Getting it to work was not so difficult after all, but once applied for
iSPICE I found an unexpected twist. When iSPICE is ordered to run
a parallel job, the commandline could not contain certain parameters,
because these were not transferred from the controlling core to the
slaves. Here, when #|cpus ( the number of cores allotted to the job )
is set at 8 on the controller,
iSPICE> 1 TO #|cpus RUN-PAR
ran the slaves with #|cpus is 8, not 1. Apparently RUN-PAR is started
before the commandline is fully evaluated.

Below are some results. I took a simple SPICE simulation file with
3 nested .STEP loops for a total of 24 tasks.
Run on LTspice, this takes 363 seconds. Under the same conditions,
it was run under iSPICE with #|cpus set between 1 and 32.

iSPICE> .TICKER-INFO
AMD Ryzen 7 5800X 8-Core Processor

The best result is about 45 times faster than LTspice.
The optimum is 12 cores, with a strange outlier at #|cpus = 10.
An iSPICE task needs about 2 GBytes of memory (here).
The base memory use was 6 GBytes when I ran the test, so with
12 cores the job ran out of memory (I have only 32Gbytes here).
Maybe that with 10 cores Windows started making decisions
with regards to swapspace or working set.

During the test I kept an eye on clock frequency and memory use.
There was no throttling (5.6 GHz throughout), and maximum
memory use was about 31 Gbytes. No disk activity detectable (or
not shown by Windows :--)

The 8 extra hyperthreads are not very useful for this kind of work.
Once the 8 real threads are active, the simulation time does not
really decrease further. Maybe I should stick in more RAM to
make sure about that, or run it on a workstation with more/less
cores.

-marcel

\ LTspiceXVII vs 17.1.5
\ Total elapsed time: 363.431 seconds.

iSPICE> 1 TO #|cpus ok | RUN-PAR
Job `step\step_partest.cir` finished, 49.638 seconds elapsed. ok
iSPICE> 2 TO #|cpus ok | RUN-PAR
Job `step\step_partest.cir` finished, 25.352 seconds elapsed. ok
iSPICE> 4 TO #|cpus ok | RUN-PAR
Job `step\step_partest.cir` finished, 13.489 seconds elapsed. ok
iSPICE> 8 TO #|cpus ok | RUN-PAR
Job `step\step_partest.cir` finished, 8.618 seconds elapsed. ok
iSPICE> 10 TO #|cpus ok | RUN-PAR
Job `step\step_partest.cir` finished, 11.051 seconds elapsed. ok
iSPICE> 12 TO #|cpus ok | RUN-PAR
Job `step\step_partest.cir` finished, 7.569 seconds elapsed. ok
iSPICE> 14 TO #|cpus ok | RUN-PAR
Job `step\step_partest.cir` finished, 7.822 seconds elapsed. ok
iSPICE> 16 TO #|cpus ok | RUN-PAR
Job `step\step_partest.cir` finished, 8.255 seconds elapsed. ok
iSPICE> 20 TO #|cpus ok | RUN-PAR
Job `step\step_partest.cir` finished, 9.459 seconds elapsed. ok
iSPICE> 24 TO #|cpus ok | RUN-PAR
Job `step\step_partest.cir` finished, 8.441 seconds elapsed. ok
iSPICE> 28 TO #|cpus ok | RUN-PAR
Job `step\step_partest.cir` finished, 10.799 seconds elapsed. ok
iSPICE> 32 TO #|cpus ok | RUN-PAR
Job `step\step_partest.cir` finished, 12.280 seconds elapsed. ok

\ About 363/8 = 45x faster than Analog Devices' LTspice.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Hendrix@21:1/5 to Marcel Hendrix on Fri Jan 13 13:14:49 2023

On Saturday, January 7, 2023 at 7:54:38 PM UTC+1, Marcel Hendrix wrote:

I think I got it. Shared memory is implemented.

Now without bugs. ( https://ibb.co/Qd7Xw3g )

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Hendrix@21:1/5 to Marcel Hendrix on Fri Jan 13 13:45:57 2023

On Friday, January 13, 2023 at 10:14:51 PM UTC+1, Marcel Hendrix wrote:

On Saturday, January 7, 2023 at 7:54:38 PM UTC+1, Marcel Hendrix wrote:

I think I got it. Shared memory is implemented.

Now without bugs. ( https://ibb.co/Qd7Xw3g )

Some details:

iSPICE schematic ( https://ibb.co/MsfXGmw )

~~~
-- iForth netlist (automatically converted from SPICE netlist)
-- d:\dfwforth\examples\SPICE\ispice\circuits\net_lts\powerplane
-- powerIII.cir processed by iSPICE on 14:12:52, January 12, 2023
CIRCUIT
5 N: ina out p s p2
3 B: i_V1 i_Vs i_B1

FCONST: k1 = 0.9999
FCONST: L11=15mH
FCONST: L22=15mH r=10
FCONST: con_1=r
FCONST: con_0=r

EXPR: ex_0 -V(p)*I(V1)-V(s)*I(Vs)

ina GND i_V1 PULSE: V1 ( -20 20 0 10n 10n 0.5ms 1ms )
out GND con_0 RESS R2
ina p con_1 RESS R1
p GND s GND CI XU1 L11={L11} L22={L22} K={k1}
s out i_Vs 0e VSOURCE Vs
p2 GND i_B1 ex_0 BVXT B1
END

NO-JOB-STORE
FALSE TO fastaccess?
.TRAN 0 1s {1s-2ms} 0.1u
SIMULATE
WRITES

~~~

-- iForth cmd file
CLEAR-TASK-DATA
.STEP param k1 0.99 1 0.0005
SUBMIT

~~~

-- original SPICE netlist
* D:\dfwforth\examples\SPICE\ispice\circuits\net_lts\powerplane\powerIII.asc
V1 ina 0 PULSE(-20 20 0 10n 10n 0.5ms 1ms)
R2 out 0 {r}
R1 ina p {r}
XU1 p 0 s 0 CI L11={L11} L22={L22} K={k1}
Vs s out 0
B1 p2 0 V=-V(p)*I(V1)-V(s)*I(Vs)
.param k1 = 0.9999
.param L11=15mH
.param L22=15mH r=10
.option reltol=0.1m
.tran 0 1s {1s-2ms} 0.1u
.step param k1 0.99 1 0.0005
.meas FORTH p2 @AVG pleak2
.lib NGSPICE\CI.sub
* LTspice total elapsed time: 527.32 seconds.
.backanno
.end

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Hendrix@21:1/5 to Marcel Hendrix on Sat Jan 21 05:29:14 2023

On Saturday, January 7, 2023 at 7:54:38 PM UTC+1, Marcel Hendrix wrote:

I think I got it. Shared memory is implemented.

With further testing I noticed another hidden Windows 'feature.'
When running iForth as an Administrator, drag and drop to the iForth
console and from/to my editor and File manager sometimes did not work.
I suspected a bug in iForth, but digging around uncovered that this is a well-known Windows feature: a higher privileged process (here
iForth) is prevented from accepting drag-and-drop from a lower
privileged one (here File manager). Ok, but there is a nasty twist here:
when iForth starts my editor with the S" xx" SYSTEM command, 'xx'
apparently becomes higher privileged too, and as a consequence,
drag-and-drop does not work anymore for 'xx' (the started editor).
This is somewhat unexpected and certainly a nuisance.

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Hendrix@21:1/5 to Marcel Hendrix on Sat Jan 21 05:54:45 2023

On Saturday, January 7, 2023 at 7:54:38 PM UTC+1, Marcel Hendrix wrote:

I think I got it. Shared memory is implemented.

And now I want more :--)

It would be really great if the shared memory trick (which uses the system
page file) worked across the network. Admittedly it is only cosmetic, because for my current purpose I could also use a shared file with a Filemap view
(mmap the file in the iForth virtual address space). In case of a file I have to
rewrite my array accesses to file operations, which is a drag. Neither Windows nor Linux appear to directly support shared memory between networked
computers.

Is there a Forth library with RDMA (a transparent protocol build into many network
adapters)? If it existed I could buy a refurbished HP840 workstation and *really*
get going (such workstations have 44 cores/88 threads and cost a mere
2000 Euros, 15 - 20k new, refurbished RDMA nic's are 20 Euros...).

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Marcel Hendrix on Sat Jan 21 15:14:46 2023

Marcel Hendrix <mhx@iae.nl> writes:

Neither Windows
nor Linux appear to directly support shared memory between networked >computers.

If you can live with its performance characteristics (and probably
lack of coherence), how about mmapping an NFS-mounted file (other
distributed fie systems may be better for that purpose, though).

Otherwise, I think there are good reasons for that lack of support.
The latency is long, and coherence is a problem. RDMA may solve the
coherence problem and reduce the latency, but it's still long.
Therefore people tend to use message passing rather than shared memory
across the network. Interestingly, in the Safe Forth concept I
suggested avoiding shared memory and communicating between threads (or processes) with messages, even on the same machine (where shared
memory is easy and may be cheap).

Is there a Forth library with RDMA (a transparent protocol build into many network
adapters)?

Not that I have heard of, but if you want one, you are in a good
position to work on one.

If it existed I could buy a refurbished HP840 workstation and *really*
get going (such workstations have 44 cores/88 threads and cost a mere
2000 Euros, 15 - 20k new, refurbished RDMA nic's are 20 Euros...).

Makes you wonder what's wrong with them:-)

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul Rubin@21:1/5 to Marcel Hendrix on Sat Jan 21 08:16:54 2023

Marcel Hendrix <mhx@iae.nl> writes:

Is there a Forth library with RDMA (a transparent protocol build into
many network adapters)? If it existed I could buy a refurbished HP840 workstation and *really* get going (such workstations have 44 cores/88 threads and cost a mere 2000 Euros, 15 - 20k new, refurbished RDMA
nic's are 20 Euros...).

Unless you had a bunch of those workstations networked together, why
would you need RDMA, assuming your Forth program is running on the
workstation?

I see one here for 1000 USD, with 44 cores and 128GB ram:

https://www.ebay.com/itm/175576911219

That is really impressive. Anton asks what is wrong with them.
Obviously they are old and power hungry, but less so than it seems:

https://www.intel.com/content/www/us/en/products/sku/91317/intel-xeon-processor-e52699-v4-55m-cache-2-20-ghz/specifications.html

They use 14nm lithography and have 2.2GHz base frequency, which is not
all that fast. They were introduced in 2016. This 44 core system is
almost definitely slower than a 32 core Threadripper, but might beat a
16 core Ryzen. On the other hand those will cost more up front,
especially with the memory figured in. If you are running the
workstation 24/7 then the newer hardware will probably pay for itself in
power savings quickly, but if you only run it part of the time it might
be ok.

Now I feel a little bit interested but don't have an actual use for such
a box. Spinning up some Hetzner cloud servers for an occasional compute
task is pretty cheap.

Maybe you could implement MPI (does anyone still use that?) for your
Spice stuff.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Hendrix@21:1/5 to Anton Ertl on Sat Jan 21 08:27:45 2023

On Saturday, January 21, 2023 at 4:38:11 PM UTC+1, Anton Ertl wrote:

Marcel Hendrix <m...@iae.nl> writes:

[..]

If it existed I could buy a refurbished HP840 workstation and *really*
get going (such workstations have 44 cores/88 threads and cost a mere
2000 Euros, 15 - 20k new, refurbished RDMA nic's are 20 Euros...).

Makes you wonder what's wrong with them:-)

They come with a 3 year warranty, but I have no idea who dares buy
that stuff for their business, and how these resellers (there are many)
can prosper? I'll find out :--)

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Hendrix@21:1/5 to Anton Ertl on Sat Jan 21 08:59:32 2023

On Saturday, January 21, 2023 at 4:38:11 PM UTC+1, Anton Ertl wrote:

Marcel Hendrix <m...@iae.nl> writes:

[..]

If you can live with its performance characteristics (and probably
lack of coherence), how about mmapping an NFS-mounted file (other
distributed fie systems may be better for that purpose, though).

Hmm, given the very limited functionality I need, this might be
perfectly adequate.

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Hendrix@21:1/5 to Paul Rubin on Sat Jan 21 08:51:17 2023

On Saturday, January 21, 2023 at 5:16:57 PM UTC+1, Paul Rubin wrote:

Marcel Hendrix <m...@iae.nl> writes:

[..]

Unless you had a bunch of those workstations networked together, why
would you need RDMA, assuming your Forth program is running on the workstation?

I will put the workstation(s) in the attic, where I can't hear and feel them. My desktop pc dispatches and controls the runs and displays the results.

This 44 core system is almost definitely slower than a 32 core
Threadripper, but might beat a 16 core Ryzen.

That costs 7,500 Euros around here, or 4 refurbished HP boxes...

It will be more fun than tweaking a game PC with liquid metal
and nitrogen for 1% higher frame rates.

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From mhx@21:1/5 to All on Tue Feb 27 23:41:10 2024

I have been polishing my shared memory application (iSPICE) a bit more.
The benchmark I previously showed compared running a circuit simulation
with a variable number of communicating CPUs. Only a minimum amount of data
is shared (a page with published parameters and achieved results, plus the ready! flags). With this setup I got about a factor of 3 improvement for
8 CPUs. I hoped to improve this factor a bit with better hardware and maybe some software tweaking.

What I didn't try until today was checking how fast the circuit simulation
ran on a single CPU, *not* using the shared memory framework. And indeed,
that is a problem, in that without shared memory the runtime is *3 times
less* than with shared memory. In other words, there is no net gain in
having 8 mem-shared cpu's. As a additional check I started the circuit run
in 3 separate windows. They all achieved the same speed as the single run non-shared version, proving that the hardware (cpu/memory/disk) is amply sufficient to provide an 8 times speed-up.

I will now start working on Anton's suggesting of a shared file. Or maybe
I should try this on Linux first, maybe shared memory works better there.

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to All on Wed Feb 28 01:46:14 2024

Perhaps this is the reason why:

Windows shared memory is not the same as Linux only some things are similar.

The Unix mmap() API is practically equivalent to the CreateFileMapping/ MapViewOfFile Windows API. Both can map files and/or can create shared (anonymous) maps that are backed by the swap device (if any). As a matter of fact, glibc uses anonymous mmap() to implement malloc() when the requested memory size is sufficiently large.

The biggest difference is the memory allocation granularity size. Linux is 4K and Windows is 64K. If it's important to have say arbitrary 8K pages mapped into specific 8K destinations well you are stuck on Windows and it just can't be done.

Another difference is you can mmap a new page over the top of an existing page effectively replacing the first page mapping. In Windows you can't do this
but instead must destroy the entire view and rebuild the entire view in what ever new layout that is required. So if the "view" contains 1024 pages and
1 page changes then in Linux you can just change that one page. In Windows
you must drop all 1024 pages and re-view the same 1023 pages + the one new page.

IOW with only minimal data to share, Linux should be faster. A normal file
will probably do the job already, since most probably it is buffered in memory anyway.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From mhx@21:1/5 to This is certainly interesting. Prev on Wed Feb 28 08:29:28 2024

This is certainly interesting. Previously I wrote:

Only a minimum amount of data is shared (a page with published
parameters and achieved results, plus the ready! flags).

However, I see now that I asked for 'arbitrary size' in the system
call. Combined with a locked address, this could cause Windows to
swap a huge amount of memory on accesses, explaining the slow
execution.

I will have to spend more time reading the documentation after all.

Thanks a lot everybody, for the helpful comments!

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From albert@spenarnc.xs4all.nl@21:1/5 to mhx on Wed Feb 28 11:40:00 2024

In article <2ec6da768657cb7e0838af11eb2d209e@www.novabbs.com>,
mhx <mhx@iae.nl> wrote:

I have been polishing my shared memory application (iSPICE) a bit more.
The benchmark I previously showed compared running a circuit simulation
with a variable number of communicating CPUs. Only a minimum amount of data >is shared (a page with published parameters and achieved results, plus the >ready! flags). With this setup I got about a factor of 3 improvement for
8 CPUs. I hoped to improve this factor a bit with better hardware and maybe >some software tweaking.

What I didn't try until today was checking how fast the circuit simulation >ran on a single CPU, *not* using the shared memory framework. And indeed, >that is a problem, in that without shared memory the runtime is *3 times >less* than with shared memory. In other words, there is no net gain in
having 8 mem-shared cpu's. As a additional check I started the circuit run
in 3 separate windows. They all achieved the same speed as the single run >non-shared version, proving that the hardware (cpu/memory/disk) is amply >sufficient to provide an 8 times speed-up.

I will now start working on Anton's suggesting of a shared file. Or maybe
I should try this on Linux first, maybe shared memory works better there.

I simply use the clone system call on linux ( NR number is 56 for 64 bits)

( THREAD-PET KILL-PET PAUSE-PET ) CF: ?LI \ B5dec2
"CTA" WANTED "-syscalls-" WANTED HEX
\ Exit a thread. Indeed this is exit().
: EXIT-PET 0 _ _ __NR_exit XOS ;
\ Do a preemptive pause. ( abuse MS )
: PAUSE-PET 1 MS ;
\ Create a thread with dictionary SPACE. Execute XT in thread.
: THREAD-PET ALLOT CTA CREATE RSP@ SWAP RSP! R0 @ S0 @
ROT RSP! 2 CELLS - ( DSP) , ( TASK) , ( pid) 0 ,
DOES> DUP @ >R SWAP OVER CELL+ @ R@ 2! ( clone S: tp,xt)
100 R> _ __NR_clone XOS DUP IF
( Mother) DUP ?ERRUR SWAP 2 CELLS + ! ELSE
( Child) DROP RSP! CATCH DUP IF ERROR THEN EXIT-PET THEN ;
\ Kill a THREAD-PET , preemptively. Throw errors.
: KILL-PET >BODY 2 CELLS + @ 9 _ __NR_kill XOS ?ERRUR ;
DECIMAL

The idea is
1000 ( dictionary space ) CREATE extra

Now you run an xt as follows :
xt extra
The xt runs until it does an EXIT-PET, or is killed by a KILL-PET.

In r10par.frt it run 41 sec on one 27 on two processors for 10^12.
This was more a demonstration of parallel processing, the
communication and work load balancing kills the advantages for
more processors.

(This was prime counting)

Maybe try something simple before jumping into sockets and mapped
files.

The CTA words carves out a small dictionary space for the new
processes to be used, plus stacks and user space.
This is utterly system dependant, but in the ciforth model it
is just one screen, and portable over 32/64 arm/86 linux/windows.
It helps if you have a simple Forth to begin with ;-)
(CTA is used in cooperative multi tasking as well.)

-marcel

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From mhx@21:1/5 to All on Wed Feb 28 11:11:12 2024

Maybe try something simple before jumping into sockets and mapped
files.

I have tried that way for the past 20 years already, and indeed it
works fine. However, my simple example shown above needs 24 threads /processes/cores (whatever) each having about 2 to 4 GB of memory.

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From albert@spenarnc.xs4all.nl@21:1/5 to mhx on Wed Feb 28 13:25:26 2024

In article <c53be72b665c3d10796bfe67a7f02dcf@www.novabbs.com>,
mhx <mhx@iae.nl> wrote:

Maybe try something simple before jumping into sockets and mapped
files.

I have tried that way for the past 20 years already, and indeed it
works fine. However, my simple example shown above needs 24 threads >/processes/cores (whatever) each having about 2 to 4 GB of memory.

I have lost context, can you tell more about the simple example?
(My provider purges old messages swiftly)

And what with
lina -g 96000 lina96G

lina96G -e
..
WANT UNUSED
S[ ] OK UNUSED S>D DEC.

0,000,000,000,000,000,000,000,000,000,100,730,247,992
S[ ] OK
I'm sure most Forth's can do something similar.
(Overcommitting but not with my hp workstation with 256 Gbyte RAM).

-marcel

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From mhx@21:1/5 to All on Sat Mar 2 18:14:18 2024

I have lost context, can you tell more about the simple example?
(My provider purges old messages swiftly)

I was in the exploring/debugging phase and have only very recently
completed the experiments.

The final results are that with shared memory, on Windows
11, it is possible to get an almost linear speedup with the
number of cores in use. The way shared memory is implemented
on Windows is with a memory-mapped file that uses the OS
pagefile as backup. The file is guaranteed to not be swapped
out under reasonable conditions, and Windows keeps its
management invisible for users.

I tried to make the file as small as possible. For this
iForth benchmark it was 11 int64's (11 * 8 bytes) and 24
extended floats (24 * 16 bytes), about 1/2 Kbyte. The file
is touched very infrequently, just 24 result writes and
then a loop over the 11 words to see if all cpu's finished
(check at 10ms intervals). At the moment I have no idea
what happens with very frequent read/writes (it is not
the intended type of use).

[During debugging I was lucky. When setting the number of
working cpu's interactively, completely wrong results
were obtained. This happened because #|cpus was defined
as a VALUE in a configuration file. When changing #|cpus
from the console, the value in sconfig.frt stayed the
same (of course) while all the dynamically started cores
used the on-disk value, not the value I typed in on
CPU #0. Easy to understand in hindsight, but this type
of 'black-hole' mistake can take hours to find in a 7000+
line program. For some reason I just knew that it had to
be #|cpus that was causing the problem.]

The benchmark is a circuit file that defines a voltage
source and a 2-resistor divider, all parameterized.
These values were swept for a total of 24 different
circuits. To calculate the result for one of the
combinations takes 2.277s on a single core with iSPICE,
or 24 x that value, 54.648s, for all 24 combinations.
In the benchmark the 24 simulations are spread out over
11 processes on an 8-core CPU :

iSPICE> .ticker-info
AMD Ryzen 7 5800X 8-Core Processor
TICKS-GET uses os time & PROCESSOR-CLOCK 4192MHz
Do: < n TO PROCESSOR-CLOCK RECALIBRATE >

The aim is to get an 8 times speedup, or more if
hyperthreads bring something, and do all combinations
in less than 6.831 seconds. The best I managed is
7.694s or about 7.67 "cores", which I consider not
that bad. Here are the details (run 4 times):

% cpus time [s] perf. ratio
1 49.874 1.46
2 25.314 2.39
3 17.391 3.23
4 13.335 4.11
5 10.565 5.17
6 9.468 5.71
7 8.712 6.22
8 7.694 7.67
9 7.260 7.37
10 7.874 6.72
11 7.856 6.73 ok

For your information: Running the same 24 variations
with LTspice 17.1.15, one of the fastest SPICE
implementations currently available, takes 382.265
seconds, almost exactly 7 times slower than the iSPICE
single-core run. Using 8 cores (LTspice pretends to
use 16 threads), that ratio becomes 62 times.

In the above table the performance ration for a single
cpu is 1.46 (1.46 times faster than doing the 24
simulations on a single core *without* shared memory),
which might seem strange. I think the phenomenon is
caused by the fact that a single combination takes
only 2.277s and this may be too slow for the processor
(or Windows) to ramp up the clock frequency. If the
performance factor is normalized by the timing for
1 cpu, the maximum speedup decreases to 5.25.
We'll see what happens on an HPZ840.

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From albert@spenarnc.xs4all.nl@21:1/5 to mhx on Sun Mar 3 12:19:13 2024

In article <c2fb7eb58b7ae773f632a15c1abac917@www.novabbs.com>,
mhx <mhx@iae.nl> wrote:

I have lost context, can you tell more about the simple example?
(My provider purges old messages swiftly)

I was in the exploring/debugging phase and have only very recently
completed the experiments.

The final results are that with shared memory, on Windows
11, it is possible to get an almost linear speedup with the
number of cores in use. The way shared memory is implemented
on Windows is with a memory-mapped file that uses the OS
pagefile as backup. The file is guaranteed to not be swapped
out under reasonable conditions, and Windows keeps its
management invisible for users.

Linear speedup? That must depend on the program.
Can I surmise that the context is that you're comparing your
version/clone iSpice with LTSpice.

I tried to make the file as small as possible. For this
iForth benchmark it was 11 int64's (11 * 8 bytes) and 24
extended floats (24 * 16 bytes), about 1/2 Kbyte. The file
is touched very infrequently, just 24 result writes and
then a loop over the 11 words to see if all cpu's finished
(check at 10ms intervals). At the moment I have no idea
what happens with very frequent read/writes (it is not
the intended type of use).

[During debugging I was lucky. When setting the number of
working cpu's interactively, completely wrong results
were obtained. This happened because #|cpus was defined
as a VALUE in a configuration file. When changing #|cpus
from the console, the value in sconfig.frt stayed the
same (of course) while all the dynamically started cores
used the on-disk value, not the value I typed in on
CPU #0. Easy to understand in hindsight, but this type
of 'black-hole' mistake can take hours to find in a 7000+
line program. For some reason I just knew that it had to
be #|cpus that was causing the problem.]

The benchmark is a circuit file that defines a voltage
source and a 2-resistor divider, all parameterized.
These values were swept for a total of 24 different
circuits. To calculate the result for one of the
combinations takes 2.277s on a single core with iSPICE,
or 24 x that value, 54.648s, for all 24 combinations.
In the benchmark the 24 simulations are spread out over
11 processes on an 8-core CPU :

iSPICE> .ticker-info
AMD Ryzen 7 5800X 8-Core Processor
TICKS-GET uses os time & PROCESSOR-CLOCK 4192MHz
Do: < n TO PROCESSOR-CLOCK RECALIBRATE >

The aim is to get an 8 times speedup, or more if
hyperthreads bring something, and do all combinations
in less than 6.831 seconds. The best I managed is
7.694s or about 7.67 "cores", which I consider not
that bad. Here are the details (run 4 times):

% cpus time [s] perf. ratio
1 49.874 1.46
2 25.314 2.39
3 17.391 3.23
4 13.335 4.11
5 10.565 5.17
6 9.468 5.71
7 8.712 6.22
8 7.694 7.67
9 7.260 7.37
10 7.874 6.72
11 7.856 6.73 ok

For your information: Running the same 24 variations
with LTspice 17.1.15, one of the fastest SPICE
implementations currently available, takes 382.265
seconds, almost exactly 7 times slower than the iSPICE
single-core run. Using 8 cores (LTspice pretends to
use 16 threads), that ratio becomes 62 times.

So LT spice becomes slower by using 8 cores
going from 7 times slower to 62 time slower than iSPICE.
There must be a mistake here.

In the above table the performance ration for a single
cpu is 1.46 (1.46 times faster than doing the 24
simulations on a single core *without* shared memory),
which might seem strange. I think the phenomenon is
caused by the fact that a single combination takes
only 2.277s and this may be too slow for the processor
(or Windows) to ramp up the clock frequency. If the
performance factor is normalized by the timing for
1 cpu, the maximum speedup decreases to 5.25.
We'll see what happens on an HPZ840.

You are going to run Windows 11 on the HP work station?
I'm going to install a Linux version, for I want to
experiment with CUDA.

-marcel

--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From mhx@21:1/5 to albert@spenarnc.xs4all.nl on Wed Mar 6 10:54:55 2024

albert@spenarnc.xs4all.nl wrote:

In article <c2fb7eb58b7ae773f632a15c1abac917@www.novabbs.com>,
mhx <mhx@iae.nl> wrote:

I have lost context, can you tell more about the simple example?

[..]

The final results are that with shared memory, on Windows
11, it is possible to get an almost linear speedup with the
number of cores in use.

[..]

Linear speedup? That must depend on the program.
Can I surmise that the context is that you're comparing your
version/clone iSpice with LTSpice.

The example is *not* about trying to speed up programs
by adding threads to work on parts that can be parallelized.
A circuit simulator is used as the example here. Circuits
contain on average about 30% of operations that can be done
in parallel, so a fine-grained threaded approach with an
infinite amount of threads can at most give 30% of a speedup.

Most circuit simulation problems can not be solved with a
single simulation. In almost every case one wants to re-run
a job with small variations on the original specification.
The variations can be on the circuit components themselves,
on variations in environmental conditions like temperature,
humidity, noise, variations on input sources or output loads,
or even parameters of their (digital) control algorithms.
Between 10 and many thousands of simulations could be
necessary. At the top level, this problem is trivial to solve
by editing the input netlist with the necessary changes,
re-run the simulation, and store the results in a database.
When all runs are done, the data is evaluated by querying.

In practice, it is difficult to keep the administration
straight if the above is done by hand. What I am looking for
is a simple way to specify variations, create a list
of all the simulations needed, then distribute the tasks
to as many cpu cores as are available (locally, on the network,
or in the Cloud), combine the results, and generate reports.

To do this in Forth, I found it useful to use either shared
memory, or a shared file. The post is about experiments with
shared memory (useful when the number of cores is less than
256 and the main memory requirement is less than 1 TByte.)

The concrete example is to run N variations of a circuit on
an 8 core system with 32GB of memory, with the features I
describe above. The question was: is it possible to get
a speedup of 8 when the benchmark runs on an 8 core CPU.

iSPICE> .ticker-info
AMD Ryzen 7 5800X 8-Core Processor
TICKS-GET uses os time & PROCESSOR-CLOCK 4192MHz
Do: < n TO PROCESSOR-CLOCK RECALIBRATE >

The aim is to get an 8 times speedup, or more if
hyperthreads bring something, and do all combinations
in less than 6.831 seconds. The best I managed is
7.694s or about 7.67 "cores", which I consider not
that bad. Here are the details (run 4 times):

% cpus time [s] perf. ratio
1 49.874 1.46
2 25.314 2.39
3 17.391 3.23
4 13.335 4.11
5 10.565 5.17
6 9.468 5.71
7 8.712 6.22
8 7.694 7.67
9 7.260 7.37
10 7.874 6.72
11 7.856 6.73 ok

For your information: Running the same 24 variations
with LTspice 17.1.15, one of the fastest SPICE
implementations currently available, takes 382.265
seconds, almost exactly 7 times slower than the iSPICE
single-core run. Using 8 cores (LTspice pretends to
use 16 threads), that ratio becomes 62 times.

I realize now that this comparison of iSPICE with LTspice
can confuse the reader. It does not matter at all for this
benchmark which SPICE simulator is used.

So LT spice becomes slower by using 8 cores
going from 7 times slower to 62 time slower than iSPICE.
There must be a mistake here.

There is no mistake. LTspice is 7 slower than iSPICE for
the specific type of task used here. Although LTspice has
a mechanism to run multiple variations, and claims to use
8 cores / 16 threads, it does not appear to use them as
efficiently as iSPICE does using shared memory.

[..]

We'll see what happens on an HPZ840.

You are going to run Windows 11 on the HP work station?
I'm going to install a Linux version, for I want to
experiment with CUDA.

I certainly want to see what happens if I run iSPICE on
my 44-core HPZ840 :--) The fastest way to implement that
should be to install Windows 10 or 11 on the HP. However,
if that proves problematic I have no problem using Linux.
I did not try iSPICE on Linux/WSL2 yet and I probably will
do that first.

I also want to experiment with CUDA (BTW, why not OpenCL,
did you already find arguments against that route?),
however, that would be to investigate a new way of circuit
simulation that not uses the standard SPICE algorithms.

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Keyop
  Sun May 5 19:26:27 2024
  from Huddersfield, West Yorkshire via SSH
- Bob Worm
  Mon May 6 11:44:29 2024
  from Wales, Uk via Telnet
- Bob Worm
  Tue May 7 14:12:05 2024
  from Wales, Uk via Telnet
- Bob Worm
  Tue May 7 09:06:52 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	300
Nodes:	16 (3 / 13)
Uptime:	42:56:39
Calls:	6,709
Calls today:	2
Files:	12,243
Messages:	5,353,935

Shared memory

Who's Online

Recent Visitors

System Info