Anton Ertl
2024-09-21 17:25:51 UTC
I recently noticed that Gforth still used the following COMPILE,
implementation for words defined with CREATE...SET-DOES> (and
consequently also for words defined with CREATE...DOES>):
: does, ( xt -- ) does-check ['] does-xt peephole-compile, , ;
Ignore DOES-CHECK (it has to do with stack-depth checking, still
incomplete). The rest means that it compiles the primitive DOES-XT
with the xt of the COMPILE,d word as immediate argument. DOES-XT
pushes the body of the word and then EXECUTEs the xt that SET-DOES>
has registered for this word. In most cases this is a colon
definition (always if DOES> is used), so the next thing that happens
is DOCOL, and then the code for the colon definition is run.
I have now replaced this with
: does, ( xt -- ) does-check dup >body lit, >extra @ compile, ;
What this does is to compile the body as a literal, and then it
COMPILE,s the xt that DOES-XT would EXECUTE. In the common case of a
colon definition this compiles a call to the colon definition. This
saves the overhead of accessing the doesfield and of dispatching on
its contents at run-time; all that is now done during compilation.
Let us first look at the generated code. Consider the example:
: myconst create , does> @ ;
5 myconst five
: foo five ;
SIMPLE-SEE FOO shows:
old new
$7F6F5CAE6BC8 does-xt 1->1 $7F46A7EA92B8 lit 1->1
$7F6F5CAE6BD0 five $7F46A7EA92C0 five
$7F6F5CAE6BD8 ;s 1->1 ok $7F46A7EA92C8 call 1->1
$7F46A7EA92D0 $7F46A7C0A168
$7F46A7EA92D8 ;s 1->1
For the following microbenchmark:
: d1 ( "name" -- )
create 0 ,
does> ( -- addr )
; \ yes, an empty DOES> exists in an application program
d1 z1
: bench-z1-comp ( -- )
iterations 0 ?do
1 z1 +!
loop ;
I see the following results per iteration (startup overhead included)
on a Rocket Lake:
old new
8.2 7.5 cycles:u
34.0 29.0 instructions:u
5.2 4.2 branches:u
So five instructions less (including one branch), resulting in a small
speedup for this microbenchmark.
The Gforth image contained 129 occurences of does-xt and after the
change it contains 12 (a part of the image is created with the
cross-compiler, which still compiles to DOES-XT. As a result, the
image size and gforth-fast (AMD64) native-code size in bytes are as
follows:
old new
2189364 2193264 image
448291 448659 native-code
The larger image is no surprise. For the 117 replaced does-xts, the
threaded code grows by 2 cells each, and the meta-data grows
correspondingly.
For the native code, the growth is not that expected. Let's see how
the code looks:
does-xt lit call
add rbx,$10 mov $00[r13],r8
mov $00[r13],r8 sub r13,$08
mov r8,-$08[rbx] mov r8,$08[rbx]
sub r13,$08 mov rax,$18[rbx]
sub rbx,$08 sub r14,$08
mov rax,-$08[r8] add rbx,$20
mov rdx,$18[rax] mov [r14],rbx
mov rax,-$10[rdx] mov rbx,rax
jmp eax mov rax,[rbx]
jmp eax
34 bytes 35 bytes
Ok, it's larger, but that explains only 117 extra bytes. Maybe the
interaction with other optimizations explains the rest.
- anton
implementation for words defined with CREATE...SET-DOES> (and
consequently also for words defined with CREATE...DOES>):
: does, ( xt -- ) does-check ['] does-xt peephole-compile, , ;
Ignore DOES-CHECK (it has to do with stack-depth checking, still
incomplete). The rest means that it compiles the primitive DOES-XT
with the xt of the COMPILE,d word as immediate argument. DOES-XT
pushes the body of the word and then EXECUTEs the xt that SET-DOES>
has registered for this word. In most cases this is a colon
definition (always if DOES> is used), so the next thing that happens
is DOCOL, and then the code for the colon definition is run.
I have now replaced this with
: does, ( xt -- ) does-check dup >body lit, >extra @ compile, ;
What this does is to compile the body as a literal, and then it
COMPILE,s the xt that DOES-XT would EXECUTE. In the common case of a
colon definition this compiles a call to the colon definition. This
saves the overhead of accessing the doesfield and of dispatching on
its contents at run-time; all that is now done during compilation.
Let us first look at the generated code. Consider the example:
: myconst create , does> @ ;
5 myconst five
: foo five ;
SIMPLE-SEE FOO shows:
old new
$7F6F5CAE6BC8 does-xt 1->1 $7F46A7EA92B8 lit 1->1
$7F6F5CAE6BD0 five $7F46A7EA92C0 five
$7F6F5CAE6BD8 ;s 1->1 ok $7F46A7EA92C8 call 1->1
$7F46A7EA92D0 $7F46A7C0A168
$7F46A7EA92D8 ;s 1->1
For the following microbenchmark:
: d1 ( "name" -- )
create 0 ,
does> ( -- addr )
; \ yes, an empty DOES> exists in an application program
d1 z1
: bench-z1-comp ( -- )
iterations 0 ?do
1 z1 +!
loop ;
I see the following results per iteration (startup overhead included)
on a Rocket Lake:
old new
8.2 7.5 cycles:u
34.0 29.0 instructions:u
5.2 4.2 branches:u
So five instructions less (including one branch), resulting in a small
speedup for this microbenchmark.
The Gforth image contained 129 occurences of does-xt and after the
change it contains 12 (a part of the image is created with the
cross-compiler, which still compiles to DOES-XT. As a result, the
image size and gforth-fast (AMD64) native-code size in bytes are as
follows:
old new
2189364 2193264 image
448291 448659 native-code
The larger image is no surprise. For the 117 replaced does-xts, the
threaded code grows by 2 cells each, and the meta-data grows
correspondingly.
For the native code, the growth is not that expected. Let's see how
the code looks:
does-xt lit call
add rbx,$10 mov $00[r13],r8
mov $00[r13],r8 sub r13,$08
mov r8,-$08[rbx] mov r8,$08[rbx]
sub r13,$08 mov rax,$18[rbx]
sub rbx,$08 sub r14,$08
mov rax,-$08[r8] add rbx,$20
mov rdx,$18[rax] mov [r14],rbx
mov rax,-$10[rdx] mov rbx,rax
jmp eax mov rax,[rbx]
jmp eax
34 bytes 35 bytes
Ok, it's larger, but that explains only 117 extra bytes. Maybe the
interaction with other optimizations explains the rest.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2024: https://euro.theforth.net
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2024: https://euro.theforth.net