Parallel Forth on a 44 core machine

Discussion:

Add Reply

mhx

2024-08-17 15:34:31 UTC

My refurbished HP Z840 (from 2016) is finally running iForth.

Initially, the HP had a problem with its fan control and was
unbearably loud. I fixed that by replacing a failed transistor
that was being used as a temperature sensor. It's incredible how
large, knowledgeable, and helpful the HP community is, and how
well engineered and documented the Z series workstation are.

The Z840 is prepared for Linux and Windows 10. Because it came
with Windows pre-installed, I tried that first.

Installing iForth was the easy part, some of the other tools
(WSL2, Octave, MATLAB, VS) took quite a bit longer.

Although the Z840 is equipped with modern 1TB Samsung SSDs, these
are connected to the SATA interface and run at a maximum speed of
only 500MB/sec (instead of the 12GB/s we are now used to). I was
afraid that would become a bottleneck, but for now it will do.

Below the results for the first experiments with iSPICE (a SPICE
compatible circuit simulator that is written in iForth and supports
explicit parallel processing). I gave it a circuit of an SMPS with
44 component variations. Depending on the number of allotted cores,
iSPICE distributes the 44 jobs over the available processors and
stores the results in text and graphical formats. As can be seen
below, with 44 processors the tasks finish 22x faster than
with a single core. The CPU temperatures stay below 62 deg C.

The maximum RAM use is 64GB (this machine has 128GB).

Disk I/O is clearly a problem to be worked on, for now I
fake it by spacing the benchmark runs 30 seconds apart.

Scaling with the number of processors appears to be linear
and quite a bit better than it is on my AMD Ryzen 5800X,
although well below the theoretical factor of 44x.

iSPICE> .TICKER-INFO
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
TICKS-GET uses os time & PROCESSOR-CLOCK 3000MHz
Do: < n TO PROCESSOR-CLOCK RECALIBRATE >
ok
iSPICE> BENCHTEST
Starting 1 process to run 44 jobs.
Master task (0) ready, waiting for the workers, performing FIX-UP ...
Job `2input-boost/2input-boost.cir` finished in 52.627 seconds.
waiting 30 seconds for flush to disk . . .

Starting 11 processes to run 44 jobs.
Master task (0) ready, waiting for the workers, performing FIX-UP ...
Job `2input-boost/2input-boost.cir` finished in 6.130 seconds.
waiting 30 seconds for flush to disk . . .

Starting 22 processes to run 44 jobs.
Master task (0) ready, waiting for the workers, performing FIX-UP ...
Job `2input-boost/2input-boost.cir` finished in 3.255 seconds.
waiting 30 seconds for flush to disk . . .

Starting 44 processes to run 44 jobs.
Master task (0) ready, waiting for the workers, performing FIX-UP ...
Job `2input-boost/2input-boost.cir` finished in 2.30 seconds.
waiting 30 seconds for flush to disk . . .

% cpus time [s] performance ratio
1 52.921 1
11 6.521 8.115473
22 3.668 14.427753
44 2.431 21.76923 ok

-marcel

minforth

2024-08-18 09:28:09 UTC

Permalink

Impressive! A PCIe NVMe drive will be a boost, but don't expect
too much, when you already have so much RAM. And electric power. ;-)

My experiments with parallel threads were a bit sobering. You
really need rather isolated subprocesses that require little
synchronisation. Otherwise the slowest process plus additional
syncing costs can eat up all the expected benefits. Nothing new.

mhx

2024-08-18 11:31:37 UTC

Permalink

Post by minforth
Impressive! A PCIe NVMe drive will be a boost, but don't expect
too much, when you already have so much RAM. And electric power. ;-)

I tried a RAM drive (from AMD), but it has a throughput of only 50MB/s,
10x slower than the SATA 6GBs connected Samsung SSD (500MB/s). I am a
bit puzzled why that is so devastatingly slow.

Post by minforth
My experiments with parallel threads were a bit sobering. You
really need rather isolated subprocesses that require little
synchronisation.

Yes, that is Amdahl's law. We constantly struggled with that
for tForth. Fine-grained parallelism never gave us good results.

Post by minforth
Otherwise the slowest process plus additional
syncing costs can eat up all the expected benefits. Nothing new.

A new (to me) thing was that processes slow down enormously from
accessing shared global variables (depending on their physical
location), even when no locks are needed/used. For iSPICE such
variables are in OS managed shared memory (aka the swap file)
and are used very infrequently.

-marcel

a***@spenarnc.xs4all.nl

2024-08-18 12:47:39 UTC

Permalink

Post by mhx

Post by minforth
Impressive! A PCIe NVMe drive will be a boost, but don't expect
too much, when you already have so much RAM. And electric power. ;-)

I tried a RAM drive (from AMD), but it has a throughput of only 50MB/s,
10x slower than the SATA 6GBs connected Samsung SSD (500MB/s). I am a
bit puzzled why that is so devastatingly slow.

Post by minforth
My experiments with parallel threads were a bit sobering. You
really need rather isolated subprocesses that require little
synchronisation.

Yes, that is Amdahl's law. We constantly struggled with that
for tForth. Fine-grained parallelism never gave us good results.

Post by minforth
Otherwise the slowest process plus additional
syncing costs can eat up all the expected benefits. Nothing new.

That agrees with my experience. Parallel processes work with the
same image. The protocol is that one process write to a shared variable,
the other reads. The last process signals the chain that it is
ready. All processes are busy waiting on the signal to stop and to
pass it down the chain.

That was on linux with AMD.
Was your experience MS with Intel?

Post by mhx
-marcel

Groetjes Albert

--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -

mhx

2024-08-18 13:33:27 UTC

Permalink

[..]

Post by a***@spenarnc.xs4all.nl

Post by mhx
A new (to me) thing was that processes slow down enormously from
accessing shared global variables (depending on their physical
location), even when no locks are needed/used. For iSPICE such
variables are in OS managed shared memory (aka the swap file)
and are used very infrequently.

That agrees with my experience. Parallel processes work with the
same image. The protocol is that one process write to a shared variable,
the other reads. The last process signals the chain that it is
ready. All processes are busy waiting on the signal to stop and to
pass it down the chain.
That was on linux with AMD.
Was your experience MS with Intel?

What you seem to describe is that processes interfere when wanting
access to the same (multi-byte) variable. It is obviously tricky to
read a value byte-by-byte when somebody else is updating it
byte-by-byte.
What I meant is severe slowdown when reading variables that are
physically *close* to variables that belong to another process.
It happens for both AMD and Intel on both Windows and Linux.
Spacing such variables farther apart has dramatic impact but
is quite inconvenient in most cases.

I don't recall that transputers had these problems. It may have
to do with the physical memory read/write hardware.

-marcel

Anton Ertl

2024-08-18 13:42:33 UTC

Permalink

Post by mhx
What I meant is severe slowdown when reading variables that are
physically *close* to variables that belong to another process.

That is known as false sharing. The cache coherence protocols work at
the granularity of a a cache line (usually 64 bytes). If core A
writes to a variable, and core B, say, reads one in the same cache
line, the cache coherence protocol first makes that cache line
modified by core A (and every other core has to invalidate that cache
line), and then core B has to wait until core A sends out the data to
the other cores.

Post by mhx
It happens for both AMD and Intel on both Windows and Linux.
Spacing such variables farther apart has dramatic impact but
is quite inconvenient in most cases.

Yes, but if you want performance, you have to rearrange your data to
avoid false sharing.

Post by mhx
I don't recall that transputers had these problems.

Transputers have no shared memory and therefore no cache coherence
protocols.

- anton

--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2024: https://euro.theforth.net

mhx

2024-08-18 14:32:16 UTC

Permalink

Post by Anton Ertl

Post by mhx
What I meant is severe slowdown when reading variables that are
physically *close* to variables that belong to another process.

Yes, but if you want performance, you have to rearrange your data to
avoid false sharing.

Do you know if shared memory as provided by the OS (or Windows)
has these problems too?

-marcel

Anton Ertl

2024-08-18 15:14:52 UTC

Permalink

Post by mhx
Do you know if shared memory as provided by the OS (or Windows)
has these problems too?

Shared memory has false sharing problems, however that sharing is
arranged. The slowdown comes from the hardware. See
<https://en.wikipedia.org/wiki/False_sharing>.

The I-cache/D-cache ping-pong when you have writable data close to
executed code on AMD64 is also false sharing, this time within one
core.

- anton

minforth

2024-08-18 14:01:04 UTC

Permalink

Post by mhx
What I meant is severe slowdown when reading variables that are
physically *close* to variables that belong to another process.
It happens for both AMD and Intel on both Windows and Linux.
Spacing such variables farther apart has dramatic impact but
is quite inconvenient in most cases.

IIRC I once read a recommendation to group shared variables in
(larger) structs. With structs you have control over their memory
spacing and improve cache behaviour.

mhx

2024-08-28 09:29:37 UTC

Permalink

Post by minforth
Impressive! A PCIe NVMe drive will be a boost, but don't expect
too much, when you already have so much RAM. And electric power. ;-)

I didn't catch your drift there until I found out why there are no
really fast RAM drives. The fastest drive is no drive at all, and
that is possible by writing the simulation data to a temp file.
Windows has a special attribute for that ( _O_SHORT_LIVED ) and
Linux has shm.

This means that there is no need for iSPICE to require a fast disk
as long as there is enough free memory. And nice: I didn't have to
change the code much, only the file attributes for one CREATE-FILE.

With this change there is a slight possibility of losing data
when the OS crashes before the RAM runs out, or before there's
a reboot.

-marcel

Paul Rubin

2024-08-28 16:35:42 UTC

Permalink

Post by mhx
I didn't catch your drift there until I found out why there are no
really fast RAM drives. The fastest drive is no drive at all, and
that is possible by writing the simulation data to a temp file.
Windows has a special attribute for that ( _O_SHORT_LIVED ) and
Linux has shm.

On Linux you can make a ramdisk (use some of your system ram as a file
system) with tmpfs.

mhx

2024-08-18 18:37:02 UTC

Permalink

Here are the results on a bit more modern AMD CPU with
32GB memory and a 7GB SSD.

The scaling is near perfect and much better than expected
(based on experiments a few months ago). The 10% decrease
(7 i.s.o. 8x) for 8 cores might be because 32GB is just a bit
too tight.

The waiting for files to flush is not really necessary now,
but I used 5s for good measure.

This is the same circuit as used with the HPZ840 but with
8 instead of 44 jobs and, to compensate, for a 5x longer
time period.

iSPICE> .TICKER-INFO
AMD Ryzen 7 5800X 8-Core Processor
TICKS-GET uses os time & PROCESSOR-CLOCK 4192MHz
Do: < n TO PROCESSOR-CLOCK RECALIBRATE >
ok

Starting 1 process to run 8 jobs.
Master task (0) ready, waiting for the workers, performing FIX-UP ...
Job `2input-boost/2input-boost.cir` finished in 40.169 seconds.
waiting 5 seconds for flush to disk . . .

Starting 2 processes to run 8 jobs.
Master task (0) ready, waiting for the workers, performing FIX-UP ...
Job `2input-boost/2input-boost.cir` finished in 20.157 seconds.
waiting 5 seconds for flush to disk . . .

Starting 4 processes to run 8 jobs.
Master task (0) ready, waiting for the workers, performing FIX-UP ...
Job `2input-boost/2input-boost.cir` finished in 10.206 seconds.
waiting 5 seconds for flush to disk . . .

Starting 8 processes to run 8 jobs.
Master task (0) ready, waiting for the workers, performing FIX-UP ...
Job `2input-boost/2input-boost.cir` finished in 5.675 seconds.
waiting 5 seconds for flush to disk . . .

% cpus time [s] performance ratio
1 40.240 1
2 20.232 1.988928
4 10.283 3.913254
8 5.750 6.99826 ok

-marcel