Anton Ertl
2024-07-09 08:14:12 UTC
I recently found out that the siev.fs that we use in Gforth is not the
version that Gilbreath published in Byte 6/9 (1981). The version
printed in Byte has one non-standard usage (it seems to have been
written for fig-Forth) and one bug (undefined word). I produced a
fixed version, and then found that Victor H. Yngve has performed the
same changes in FD VII/4, p. 17
<https://www.forth.org/fd/FD-V07N4.pdf>. In any case, here are the
two versions:
\ Gilbreath with fixes:
8190 CONSTANT SIZE
CREATE FLAGS SIZE ALLOT
: DO-PRIME
FLAGS SIZE 1 FILL ( SET ARRAY )
0 ( 0 COUNT ) SIZE 0
DO FLAGS I + C@
IF I DUP + 3 + DUP I +
BEGIN DUP SIZE <
WHILE 0 OVER FLAGS + C! OVER + REPEAT
DROP DROP 1+
THEN
LOOP
. ." PRIMES" ;
\ sieve.fs distributed with Gforth:
DECIMAL
CREATE FLAGS 8190 ALLOT
FLAGS 8190 + CONSTANT EFLAG
: PRIMES ( -- n ) FLAGS 8190 1 FILL 0 3 EFLAG FLAGS
DO I C@
IF DUP I + DUP EFLAG <
IF EFLAG SWAP
DO 0 I C! DUP +LOOP
ELSE DROP THEN SWAP 1+ SWAP
THEN 2 +
LOOP DROP ;
I have attributed this version to Bernd Paysan, but recently I have
seen some version with the same inner loop that was attributed to some
article in Vierte Dimension (but I failed to find that article).
Looking at the inner loop, the latter version seems to be more
efficient at first:
\ Gilbreath
BEGIN DUP SIZE <
WHILE 0 OVER FLAGS + C! OVER + REPEAT
\ sieve.fs
DO 0 I C! DUP +LOOP
However, +LOOP has to work for circular arithmetic and for positive
and negative increments, which complicates both the specification and
the implementation, so is the +LOOP-using version really faster?
Let's see how the code for the inner loops compares:
On VFX64:
Gilbreath: siev.fs
CMP RBX, # 00001FFE MOV RDX, R14
JNL/GE 0050C3CF MOV Byte 0 [RDX], # 00
MOV RDX, RBX ADD R14, RBX
MOV Byte [RBX+0050A300], # 00 ADD R15, RBX
ADD RDX, [RBP] JNO 0050C401
MOV RBX, RDX
JMP 0050C3AF
(siev.fs is a version of sieve.fs with some small changes for
supporting more systems).
So sieve.fs works slightly better on VFX64. Let's see if I can do better:
DECIMAL
CREATE FLAGS 8190 ALLOT
variable eflag
: PRIMES ( -- n ) FLAGS 8190 1 FILL 0 3 EFLAG @ FLAGS
DO I C@
IF DUP I + DUP EFLAG @ <
IF EFLAG @ SWAP
begin ( prime limit index )
0 over c! 2 pick + 2dup u<=
until
2drop
ELSE DROP THEN SWAP 1+ SWAP
THEN 2 +
LOOP DROP ;
The part about EFLAGS being a variable is the main change between
sieve.fs and siev.fs and has nothing to do with the inner loop. The
inner loop is now decompiled as:
MOV Byte 0 [RBX], # 00
ADD RBX, [RBP+08]
CMP RBX, [RBP]
JB/NAE 0050C3C4
That's smaller by one instruction, but that probably does not help on
most modern CPUs which can do 5 instructions/cycle, but only one taken
branch per cycle. Let's see how it performs (with 10000 calls of
DO-PRIME/PRIME):
Invocations:
perf stat vfx64 "include gilbreath-works.4th : bench 10000 0 do do-prime drop loop ; bench bye"
perf stat vfx64 "include /nfs/nfstmp/anton/gforth-amd64/siev.fs flags 8190 + eflag ! : bench 10000 0 do primes drop loop ; bench bye"
perf stat vfx64 "include sieve-ae.fs flags 8190 + eflag ! : bench 10000 0 do primes drop loop ; bench bye"
Results:
Cycles Instructions Program
907282940 1956567063 gilbreath-works.4th
689872097 1837620737 siev.fs
724491376 1579623569 sieve-ae.fs
So, while siev.fs needs more instructions, it is slightly faster.
Let's see how this works out with a current development Gforth:
Cycles Instructions Program
1660796257 5123079957 gilbreath-works.4th
1465616414 4020741453 siev.fs
1357163016 4313608319 sieve-ae.fs
Here we have the opposite variant: sieve-ae.fs needs more
instructions, but fewer cycles than siev.fs.
- anton
version that Gilbreath published in Byte 6/9 (1981). The version
printed in Byte has one non-standard usage (it seems to have been
written for fig-Forth) and one bug (undefined word). I produced a
fixed version, and then found that Victor H. Yngve has performed the
same changes in FD VII/4, p. 17
<https://www.forth.org/fd/FD-V07N4.pdf>. In any case, here are the
two versions:
\ Gilbreath with fixes:
8190 CONSTANT SIZE
CREATE FLAGS SIZE ALLOT
: DO-PRIME
FLAGS SIZE 1 FILL ( SET ARRAY )
0 ( 0 COUNT ) SIZE 0
DO FLAGS I + C@
IF I DUP + 3 + DUP I +
BEGIN DUP SIZE <
WHILE 0 OVER FLAGS + C! OVER + REPEAT
DROP DROP 1+
THEN
LOOP
. ." PRIMES" ;
\ sieve.fs distributed with Gforth:
DECIMAL
CREATE FLAGS 8190 ALLOT
FLAGS 8190 + CONSTANT EFLAG
: PRIMES ( -- n ) FLAGS 8190 1 FILL 0 3 EFLAG FLAGS
DO I C@
IF DUP I + DUP EFLAG <
IF EFLAG SWAP
DO 0 I C! DUP +LOOP
ELSE DROP THEN SWAP 1+ SWAP
THEN 2 +
LOOP DROP ;
I have attributed this version to Bernd Paysan, but recently I have
seen some version with the same inner loop that was attributed to some
article in Vierte Dimension (but I failed to find that article).
Looking at the inner loop, the latter version seems to be more
efficient at first:
\ Gilbreath
BEGIN DUP SIZE <
WHILE 0 OVER FLAGS + C! OVER + REPEAT
\ sieve.fs
DO 0 I C! DUP +LOOP
However, +LOOP has to work for circular arithmetic and for positive
and negative increments, which complicates both the specification and
the implementation, so is the +LOOP-using version really faster?
Let's see how the code for the inner loops compares:
On VFX64:
Gilbreath: siev.fs
CMP RBX, # 00001FFE MOV RDX, R14
JNL/GE 0050C3CF MOV Byte 0 [RDX], # 00
MOV RDX, RBX ADD R14, RBX
MOV Byte [RBX+0050A300], # 00 ADD R15, RBX
ADD RDX, [RBP] JNO 0050C401
MOV RBX, RDX
JMP 0050C3AF
(siev.fs is a version of sieve.fs with some small changes for
supporting more systems).
So sieve.fs works slightly better on VFX64. Let's see if I can do better:
DECIMAL
CREATE FLAGS 8190 ALLOT
variable eflag
: PRIMES ( -- n ) FLAGS 8190 1 FILL 0 3 EFLAG @ FLAGS
DO I C@
IF DUP I + DUP EFLAG @ <
IF EFLAG @ SWAP
begin ( prime limit index )
0 over c! 2 pick + 2dup u<=
until
2drop
ELSE DROP THEN SWAP 1+ SWAP
THEN 2 +
LOOP DROP ;
The part about EFLAGS being a variable is the main change between
sieve.fs and siev.fs and has nothing to do with the inner loop. The
inner loop is now decompiled as:
MOV Byte 0 [RBX], # 00
ADD RBX, [RBP+08]
CMP RBX, [RBP]
JB/NAE 0050C3C4
That's smaller by one instruction, but that probably does not help on
most modern CPUs which can do 5 instructions/cycle, but only one taken
branch per cycle. Let's see how it performs (with 10000 calls of
DO-PRIME/PRIME):
Invocations:
perf stat vfx64 "include gilbreath-works.4th : bench 10000 0 do do-prime drop loop ; bench bye"
perf stat vfx64 "include /nfs/nfstmp/anton/gforth-amd64/siev.fs flags 8190 + eflag ! : bench 10000 0 do primes drop loop ; bench bye"
perf stat vfx64 "include sieve-ae.fs flags 8190 + eflag ! : bench 10000 0 do primes drop loop ; bench bye"
Results:
Cycles Instructions Program
907282940 1956567063 gilbreath-works.4th
689872097 1837620737 siev.fs
724491376 1579623569 sieve-ae.fs
So, while siev.fs needs more instructions, it is slightly faster.
Let's see how this works out with a current development Gforth:
Cycles Instructions Program
1660796257 5123079957 gilbreath-works.4th
1465616414 4020741453 siev.fs
1357163016 4313608319 sieve-ae.fs
Here we have the opposite variant: sieve-ae.fs needs more
instructions, but fewer cycles than siev.fs.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2024: https://euro.theforth.net
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2024: https://euro.theforth.net