Revisiting Gilbreath's sieve

Discussion:

(too old to reply)

Anton Ertl

2024-07-09 08:14:12 UTC

I recently found out that the siev.fs that we use in Gforth is not the
version that Gilbreath published in Byte 6/9 (1981). The version
printed in Byte has one non-standard usage (it seems to have been
written for fig-Forth) and one bug (undefined word). I produced a
fixed version, and then found that Victor H. Yngve has performed the
same changes in FD VII/4, p. 17
<https://www.forth.org/fd/FD-V07N4.pdf>. In any case, here are the
two versions:

\ Gilbreath with fixes:
8190 CONSTANT SIZE
CREATE FLAGS SIZE ALLOT

: DO-PRIME
FLAGS SIZE 1 FILL ( SET ARRAY )
0 ( 0 COUNT ) SIZE 0
DO FLAGS I + C@
IF I DUP + 3 + DUP I +
BEGIN DUP SIZE <
WHILE 0 OVER FLAGS + C! OVER + REPEAT
DROP DROP 1+
THEN
LOOP
. ." PRIMES" ;

\ sieve.fs distributed with Gforth:
DECIMAL
CREATE FLAGS 8190 ALLOT
FLAGS 8190 + CONSTANT EFLAG

: PRIMES ( -- n ) FLAGS 8190 1 FILL 0 3 EFLAG FLAGS
DO I C@
IF DUP I + DUP EFLAG <
IF EFLAG SWAP
DO 0 I C! DUP +LOOP
ELSE DROP THEN SWAP 1+ SWAP
THEN 2 +
LOOP DROP ;

I have attributed this version to Bernd Paysan, but recently I have
seen some version with the same inner loop that was attributed to some
article in Vierte Dimension (but I failed to find that article).

Looking at the inner loop, the latter version seems to be more
efficient at first:

\ Gilbreath
BEGIN DUP SIZE <
WHILE 0 OVER FLAGS + C! OVER + REPEAT

\ sieve.fs
DO 0 I C! DUP +LOOP

However, +LOOP has to work for circular arithmetic and for positive
and negative increments, which complicates both the specification and
the implementation, so is the +LOOP-using version really faster?
Let's see how the code for the inner loops compares:

On VFX64:

Gilbreath: siev.fs
CMP RBX, # 00001FFE MOV RDX, R14
JNL/GE 0050C3CF MOV Byte 0 [RDX], # 00
MOV RDX, RBX ADD R14, RBX
MOV Byte [RBX+0050A300], # 00 ADD R15, RBX
ADD RDX, [RBP] JNO 0050C401
MOV RBX, RDX
JMP 0050C3AF

(siev.fs is a version of sieve.fs with some small changes for
supporting more systems).

So sieve.fs works slightly better on VFX64. Let's see if I can do better:

DECIMAL
CREATE FLAGS 8190 ALLOT
variable eflag

: PRIMES ( -- n ) FLAGS 8190 1 FILL 0 3 EFLAG @ FLAGS
DO I C@
IF DUP I + DUP EFLAG @ <
IF EFLAG @ SWAP
begin ( prime limit index )
0 over c! 2 pick + 2dup u<=
until
2drop
ELSE DROP THEN SWAP 1+ SWAP
THEN 2 +
LOOP DROP ;

The part about EFLAGS being a variable is the main change between
sieve.fs and siev.fs and has nothing to do with the inner loop. The
inner loop is now decompiled as:

MOV Byte 0 [RBX], # 00
ADD RBX, [RBP+08]
CMP RBX, [RBP]
JB/NAE 0050C3C4

That's smaller by one instruction, but that probably does not help on
most modern CPUs which can do 5 instructions/cycle, but only one taken
branch per cycle. Let's see how it performs (with 10000 calls of
DO-PRIME/PRIME):

Invocations:

perf stat vfx64 "include gilbreath-works.4th : bench 10000 0 do do-prime drop loop ; bench bye"
perf stat vfx64 "include /nfs/nfstmp/anton/gforth-amd64/siev.fs flags 8190 + eflag ! : bench 10000 0 do primes drop loop ; bench bye"
perf stat vfx64 "include sieve-ae.fs flags 8190 + eflag ! : bench 10000 0 do primes drop loop ; bench bye"

Results:
Cycles Instructions Program
907282940 1956567063 gilbreath-works.4th
689872097 1837620737 siev.fs
724491376 1579623569 sieve-ae.fs

So, while siev.fs needs more instructions, it is slightly faster.

Let's see how this works out with a current development Gforth:

Cycles Instructions Program
1660796257 5123079957 gilbreath-works.4th
1465616414 4020741453 siev.fs
1357163016 4313608319 sieve-ae.fs

Here we have the opposite variant: sieve-ae.fs needs more
instructions, but fewer cycles than siev.fs.

- anton

--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2024: https://euro.theforth.net

Anton Ertl

2024-07-09 10:12:40 UTC

Permalink

Post by Anton Ertl
DECIMAL
CREATE FLAGS 8190 ALLOT
variable eflag
begin ( prime limit index )
0 over c! 2 pick + 2dup u<=
until
2drop
ELSE DROP THEN SWAP 1+ SWAP
THEN 2 +
LOOP DROP ;
The part about EFLAGS being a variable is the main change between
sieve.fs and siev.fs and has nothing to do with the inner loop. The
MOV Byte 0 [RBX], # 00
ADD RBX, [RBP+08]
CMP RBX, [RBP]
JB/NAE 0050C3C4
That's smaller by one instruction, but that probably does not help on
most modern CPUs which can do 5 instructions/cycle, but only one taken
branch per cycle.

By unrolling the loop by a factor of 2 we can convert every second
branch in the inner loop into a not-taken branch:

: PRIMES ( -- n ) FLAGS 8190 1 FILL 0 3 EFLAG @ FLAGS
DO I C@
IF DUP I + DUP EFLAG @ <
IF EFLAG @ SWAP
begin ( prime limit index )
0 over c! 2 pick + 2dup u> while
0 over c! 2 pick + 2dup u<= until then
2drop
ELSE DROP THEN SWAP 1+ SWAP
THEN 2 +
LOOP DROP ;

Post by Anton Ertl
perf stat vfx64 "include gilbreath-works.4th : bench 10000 0 do do-prime drop loop ; bench bye"
perf stat vfx64 "include /nfs/nfstmp/anton/gforth-amd64/siev.fs flags 8190 + eflag ! : bench 10000 0 do primes drop loop ; bench bye"
perf stat vfx64 "include sieve-ae.fs flags 8190 + eflag ! : bench 10000 0 do primes drop loop ; bench bye"

VFX64 Results (on a Rocket Lake):
Cycles Instructions Program
907282940 1956567063 gilbreath-works.4th
689872097 1837620737 siev.fs
724491376 1579623569 sieve-ae.fs before unrolling
633416709 1579649480 sieve-ae.fs with unrolling

development gforth-fast results on a Rocket Lake:
Cycles Instructions Program
1660796257 5123079957 gilbreath-works.4th
1465616414 4020741453 siev.fs
1357163016 4313608319 sieve-ae.fs before unrolling
1258628834 4465785819 sieve-ae.fs with unrolling

So, with unrolling by a factor of 2 the new code is faster than the
+LOOP-based code on both VFX64 and gforth-fast.

- anton

mhx

2024-07-09 12:34:47 UTC

Permalink

On Tue, 9 Jul 2024 10:12:40 +0000, Anton Ertl wrote:

[..]

Post by Anton Ertl

Post by Anton Ertl
That's smaller by one instruction, but that probably does not help on
most modern CPUs which can do 5 instructions/cycle, but only one taken
branch per cycle.

[..]

Post by Anton Ertl
By unrolling the loop by a factor of 2 we can convert every second

Interesting. Shouldn't unrolling the BEGIN ... UNTIL loop help even
more then? Maybe make the FLAGS array (much) bigger and don't check
for out of bounds?

-marcel

Anton Ertl

2024-07-09 15:10:27 UTC

Permalink

Post by mhx
[..]

Post by Anton Ertl

Post by Anton Ertl
That's smaller by one instruction, but that probably does not help on
most modern CPUs which can do 5 instructions/cycle, but only one taken
branch per cycle.

[..]

Post by Anton Ertl
By unrolling the loop by a factor of 2 we can convert every second

Interesting. Shouldn't unrolling the BEGIN ... UNTIL loop help even
more then? Maybe make the FLAGS array (much) bigger and don't check
for out of bounds?

What I did above is just unrolling without optimizing anything between
the iterations, to only convert half of the taken branches to
fall-through branches.

Of course it's possible to use unrolling to optimize more. One way to
do it is to prepare for superfluous stores (as you suggest), another
is to terminate the unrolled loop up to n-1 iterations early, and then
perform the other iterations with a rolled loop.

VFX64 Results (on a Rocket Lake):
Cycles Instructions Program
907282940 1956567063 gilbreath-works.4th
689872097 1837620737 siev.fs
724491376 1579623569 sieve-ae.fs before unrolling
633416709 1579649480 sieve-ae.fs with unrolling
570292058 1443573951 sieve-ae-unroll1.fs additional stores factor 2
548018536 1405060098 sieve-ae-unroll1.fs additional stores factor 3
502384416 1392124606 sieve-ae-unroll1.fs additional stores factor 4
640881327 1672778754 sieve-ae-unroll-early2.fs factor 2
538351055 1564539570 sieve-ae-unroll-early.fs factor 3
633510703 1454646171 sieve-ae-unroll-early5.fs factor 5

development gforth-fast results on a Rocket Lake:
Cycles Instructions Program
1660796257 5123079957 gilbreath-works.4th
1465616414 4020741453 siev.fs
1357163016 4313608319 sieve-ae.fs before unrolling
1258628834 4465785819 sieve-ae.fs with unrolling
1140289429 3603230471 sieve-ae-unroll1.fs additional stores factor 2
1022706250 3393279480 sieve-ae-unroll1.fs additional stores factor 3
1075651583 3371104325 sieve-ae-unroll1.fs additional stores factor 4
1284447140 3937437544 sieve-ae-unroll-early2.fs factor 2
1163813270 3653564554 sieve-ae-unroll-early.fs factor 3
1352573850 3478545033 sieve-ae-unroll-early5.fs factor 5

The VFX64 numbers for the additional stores shows significant
variations in the cycles, apparently as the result of even bigger
variations in the number of branch mispredictions. In the above I
only show the result of one of the faster runs.

Code:

E.g., additional stores factor 4:

CREATE FLAGS 8190 10 * ALLOT
variable eflag
\ FLAGS 8190 + CONSTANT EFLAG
flags 8190 + eflag !

: PRIMES ( -- n )
FLAGS 8190 1 FILL
0 3 EFLAG @ FLAGS DO ( u1 prime-candidate )
I C@ IF ( u1 prime )
DUP I + DUP EFLAG @ < IF ( u1 prime addr )
EFLAG @ SWAP begin ( prime limit index )
0 over c! 2 pick +
0 over c! 2 pick +
0 over c! 2 pick +
0 over c! 2 pick + 2dup u<= until
2drop
ELSE
DROP
THEN
SWAP 1+ SWAP
THEN
2 +
LOOP
DROP ;

Early termination factor 3:

: PRIMES ( -- n )
FLAGS 8190 1 FILL
0 3 EFLAG @ FLAGS DO ( counter prime-candidate )
I C@ IF ( counter prime )
DUP I + DUP 2 pick 2* + EFLAG @ u< IF ( counter prime addr )
EFLAG @ SWAP begin ( counter prime limit index )
\ 0 over c! 2 pick +
0 over c! 2 pick +
0 over c! 2 pick +
0 over c! 2 pick +
over 3 pick 2* 2 pick + u<= until ( counter prime limit index )
begin
2dup u> while
0 over c! 2 pick +
repeat
2drop
ELSE
DROP
THEN
SWAP 1+ SWAP
THEN
2 +
LOOP
DROP ;

- anton

mhx

2024-07-09 19:39:04 UTC

Permalink

About 3% faster, 41,662 CPU cycles for a single iteration.

-- Ideas: M. Anton Ertl
http://www.complang.tuwien.ac.at/anton/home.html

ANEW -sieves

#100000 =: #times
#8190 =: size

CREATE FLAGS size 10 * ALLOT
variable eflag
flags size + eflag !

-- iForth
: DO-PRIME 0 LOCAL cnt
flags size 1 FILL
flags
size 0 DO
C@+ IF
I 2* 3 +
DUP I + SWAP >R
BEGIN DUP size
U< WHILE DUP flags + C0!
R@ +
REPEAT DROP
-R 1 +TO cnt
ENDIF
LOOP
DROP cnt ;

: PRIMES-as4 ( -- n )
FLAGS 8190 1 FILL
0 3 EFLAG @ FLAGS DO ( u1 prime-candidate )
I C@ IF ( u1 prime )
DUP I + DUP EFLAG @ < IF ( u1 prime addr )
EFLAG @ SWAP begin ( prime limit index )
0 over c! 2 pick +
0 over c! 2 pick +
0 over c! 2 pick +
0 over c! 2 pick + 2dup u<= until
2drop
ELSE
DROP
THEN
SWAP 1+ SWAP
THEN
2 +
LOOP
DROP ;

: PRIMES-et3 ( -- n )
FLAGS 8190 1 FILL
0 3 EFLAG @ FLAGS DO ( counter prime-candidate )
I C@ IF ( counter prime )
DUP I + DUP 2 pick 2* + EFLAG @ u< IF ( counter prime addr )
EFLAG @ SWAP begin ( counter prime limit index )
\ 0 over c! 2 pick +
0 over c! 2 pick +
0 over c! 2 pick +
0 over c! 2 pick +
over 3 pick 2* 2 pick + u<= until ( counter prime limit
index )
begin
2dup u> while
0 over c! 2 pick +
repeat
2drop
ELSE
DROP
THEN
SWAP 1+ SWAP
THEN
2 +
LOOP
DROP ;

: .DIFF ( d -- ) TICKS-GET 2SWAP D- (n,3) ." clock ticks." ;

: PRIMES
CR #times DEC. ." iterations."
CR ." \ do-prime : " TICKS-GET #times 0 DO DO-PRIME DROP LOOP
DIFF
CR ." \ primes-as4 : " TICKS-GET #times 0 DO PRIMES-as4 DROP LOOP
DIFF
CR ." \ primes-et3 : " TICKS-GET #times 0 DO PRIMES-et3 DROP LOOP
DIFF ;

FORTH> PRIMES
100000 iterations.
\ do-prime : 4,312,568,627 ticks.
\ primes-as4 : 4,167,558,125 ticks.
\ primes-et3 : 4,540,875,427 ticks. ok

FORTH> see primes-as4
Flags: ANSI
$01356200 : PRIMES-as4
$0135620A push $01341CC0 d#
$0135620F push $00001FFE d#
$01356214 push 1 b#
$01356216 lea rbp, [rbp -8 +] qword
$0135621A mov [rbp 0 +] qword, $01356227 d#
$01356222 jmp FILL+10 ( $0124179A ) offset NEAR
$01356227 push 0 b#
$01356229 push 3 b#
$0135622B push $01355CC0 qword-offset
$01356231 mov rbx, $01341CC0 d#
$01356238 pop rcx
$01356239 call (DO) offset NEAR
$01356243 lea rax, [rax 0 +] qword
$01356248 mov rdi, [rbp 0 +] qword
$0135624C cmp [rdi] byte, 0 b#
$0135624F je $013562DC offset NEAR
$01356255 mov rdi, [rbp 0 +] qword
$01356259 lea rax, [rbx rdi*1] qword
$0135625D mov rcx, $01355CC0 qword-offset
$01356264 cmp rcx, rax
$01356267 push rbx
$01356268 mov rbx, rcx
$0135626B mov rcx, rax
$0135626E mov rbx, rcx
$01356271 jle $013562D5 offset NEAR
$01356277 push $01355CC0 qword-offset
$0135627D mov rax, rax
$01356280 mov [rbx] byte, 0 b#
$01356283 mov rcx, rbx
$01356286 mov rbx, [rsp 8 +] qword
$0135628B mov [rcx rbx*1] byte, 0 b#
$0135628F lea rbx, [rcx rbx*1] qword
$01356293 mov rcx, rbx
$01356296 mov rbx, [rsp 8 +] qword
$0135629B mov [rcx rbx*1] byte, 0 b#
$0135629F lea rbx, [rcx rbx*1] qword
$013562A3 mov rcx, rbx
$013562A6 mov rbx, [rsp 8 +] qword
$013562AB mov [rcx rbx*1] byte, 0 b#
$013562AF lea rbx, [rcx rbx*1] qword
$013562B3 mov rcx, rbx
$013562B6 mov rbx, [rsp 8 +] qword
$013562BB pop rdi
$013562BC lea rax, [rcx rbx*1] qword
$013562C0 cmp rax, rdi
$013562C3 push rdi
$013562C4 mov rcx, rax
$013562C7 push rcx
$013562C8 pop rbx
$013562C9 jb $01356280 offset NEAR
$013562CF pop rdi
$013562D0 jmp $013562D5 offset NEAR
$013562D5 pop rbx
$013562D6 pop rdi
$013562D7 lea rdi, [rdi 1 +] qword
$013562DB push rdi
$013562DC lea rbx, [rbx 2 +] qword
$013562E0 add [rbp 0 +] qword, 1 b#
$013562E5 add [rbp 8 +] qword, 1 b#
$013562EA jno $01356248 offset NEAR
$013562F0 add rbp, #24 b#
$013562F4 ;

-marcel