Anton Ertl
2024-07-11 14:06:02 UTC
At least one Forth system implements DOES> inefficiently, but I
suspect that it's not alone in that. I have reported the issue to the
system vendor, and hopefully they will fix it. Here I discuss the
issue so that possibly other system implementors can avoid these
pitfalls. I do not name the system here to protect the guilty, but
make no additional effort at anonymizing it.
Here's a microbenchmark that puts a spotlight on the issue. I have
started investingating the issue because the system performed badly on
an application benchmark that calls DOES>-defined words a lot, so this
microbenchmark is not just contrived.
50000000 constant iterations
: d1 create 0 , does> ;
: d1-part2 ;
d1 x1
d1 y1
: b1 iterations 0 do x1 dup ! y1 dup ! loop ;
: c1 iterations 0 do x1 drop y1 drop loop ;
: b3
iterations 0 do
[ ' x1 >body ] literal d1-part2 dup !
[ ' y1 >body ] literal d1-part2 dup !
loop ;
: c3
iterations 0 do
[ ' x1 >body ] literal d1-part2 drop
[ ' y1 >body ] literal d1-part2 drop
loop ;
B1 and C1 use the DOES>-defined words in the way the system implements
them, while B3 and C3 use them in the way the system should implement
them (when COMPILE,ing the xt of X1 and Y1): Put the address of the
body on the data stack as a literal and then call the code behind
DOES> (or in this case a colon definition that contains the code).
Let's see the performance (all numbers from a Tiger Lake), first for
C3/C1:
perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses sf64 "include does-microbench.fs c3 bye"
Performance counter stats for 'sf64 include does-microbench.fs c3 bye':
402963130 cycles:u
804105655 instructions:u # 2.00 insn per cycle
83766 L1-dcache-load-misses
1750283 L1-icache-load-misses
36403 branch-misses
0.114868603 seconds time elapsed
The code for the X1 part of the loop here is:
44EB26 -8 [RBP] RBP LEA 488D6DF8
44EB2A RBX 0 [RBP] MOV 48895D00
44EB2E 44E970 # EBX MOV BB70E94400
44EB33 44E94D ( d1-part2 ) CALL E815FEFFFF
44EB38 0 [RBP] RBX MOV 488B5D00
44EB3C 8 [RBP] RBP LEA 488D6D08
and the DS1-PART is:
44E94D RET C3
This loop takes 8 cycles per iteration (about half of that for each
simulated DOES>-defined word). Now for C1:
C1:
3502930384 cycles:u
903847649 instructions:u # 0.26 insn per cycle
93579 L1-dcache-load-misses
1813286 L1-icache-load-misses
100033846 branch-misses
0.846190766 seconds time elapsed
This loop takes 70 cycles per iteration (i.e., almost 9 times slower)
and has one branch misprediction per DOES>-defined word. Let's see
why:
The code in the loop for the X1 part is:
44EA42 44E96B ( x1 ) CALL E824FFFFFF
44EA47 0 [RBP] RBX MOV 488B5D00
44EA4B 8 [RBP] RBP LEA 488D6D08
It calls X1:
44E96B 44E927 ( d1 +1C ) CALL E8B7FFFFFF
which in turn calls this code:
44E927 -8 [RBP] RBP LEA 488D6DF8
44E92B RBX 0 [RBP] MOV 48895D00
44E92F RBX POP 5B
44E930 RET C3
Here the return address of the call at 44E96B is popped and used as
body address for X1. However, the hardware return stack predicts that
the following RET returns to that return address, which is a
misprediction, because the RET actually returns to the return address
of the outer call (44EA47). You can see here that mispredictions are
expensive.
Let's turn to B1. The code for the X1 part of the loop is:
44E9CE 44E96B ( x1 ) CALL E898FFFFFF
44E9D3 -8 [RBP] RBP LEA 488D6DF8
44E9D7 RBX 0 [RBP] MOV 48895D00
44E9DB 0 [RBP] RAX MOV 488B4500
44E9DF RAX 0 [RBX] MOV 488903
44E9E2 8 [RBP] RBX MOV 488B5D08
44E9E6 10 [RBP] RBP LEA 488D6D10
The results are:
44758764805 cycles:u
1303768694 instructions:u # 0.03 insn per cycle
150154275 L1-dcache-load-misses
202142121 L1-icache-load-misses
100051173 branch-misses
10.699859443 seconds time elapsed
10.696626000 seconds user
0.004000000 seconds sys
So in addition to the branch mispredictions that we also saw in C1, we
see 3 D-cache misses per iteration and 4 I-cache misses per iteration,
resulting in 895 cycles/iteration. A part of this cache-ping-pong is
because the call at 44E96B is in the same cache line as the stored-to
data at 44E970. The store wants the cache line in the D-cache, while
the call to 44E96B wants it in the I-cache. This is an issue I have
pointed out here for the first time in 1995, and Forth systems still
have not fixed it completely 29 years later.
Let's see if the B3 variant escapes this problem:
0.106679000 seconds user
0.008206000 seconds sys
20590473606 cycles:u
1204106872 instructions:u # 0.06 insn per cycle
50145367 L1-dcache-load-misses
101982668 L1-icache-load-misses
49127 branch-misses
4.932825183 seconds time elapsed
4.926391000 seconds user
0.003998000 seconds sys
It's better, but there is still one D-cache miss and 2 I-cache misses
per iteration. It seems that the distance between the D1-PART2 code
and X1 data is not enough to completely avoid the issue (but this is
just guessing).
In any case, having the call right before the data and executing it on
every call to a does>-defined word is responsible for 2 I-cache misses
and 2 D-cache misses per iteration, so it should be avoided by
generating code like in C3 and B3.
For dealing with the remaining cache consistency problems, most Forth
systems have chosen to put padding between code and data, and increase
the padding in response to slowness reports. Another alternative is
to put the code in a separate memory region than the data.
- anton
suspect that it's not alone in that. I have reported the issue to the
system vendor, and hopefully they will fix it. Here I discuss the
issue so that possibly other system implementors can avoid these
pitfalls. I do not name the system here to protect the guilty, but
make no additional effort at anonymizing it.
Here's a microbenchmark that puts a spotlight on the issue. I have
started investingating the issue because the system performed badly on
an application benchmark that calls DOES>-defined words a lot, so this
microbenchmark is not just contrived.
50000000 constant iterations
: d1 create 0 , does> ;
: d1-part2 ;
d1 x1
d1 y1
: b1 iterations 0 do x1 dup ! y1 dup ! loop ;
: c1 iterations 0 do x1 drop y1 drop loop ;
: b3
iterations 0 do
[ ' x1 >body ] literal d1-part2 dup !
[ ' y1 >body ] literal d1-part2 dup !
loop ;
: c3
iterations 0 do
[ ' x1 >body ] literal d1-part2 drop
[ ' y1 >body ] literal d1-part2 drop
loop ;
B1 and C1 use the DOES>-defined words in the way the system implements
them, while B3 and C3 use them in the way the system should implement
them (when COMPILE,ing the xt of X1 and Y1): Put the address of the
body on the data stack as a literal and then call the code behind
DOES> (or in this case a colon definition that contains the code).
Let's see the performance (all numbers from a Tiger Lake), first for
C3/C1:
perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses sf64 "include does-microbench.fs c3 bye"
Performance counter stats for 'sf64 include does-microbench.fs c3 bye':
402963130 cycles:u
804105655 instructions:u # 2.00 insn per cycle
83766 L1-dcache-load-misses
1750283 L1-icache-load-misses
36403 branch-misses
0.114868603 seconds time elapsed
The code for the X1 part of the loop here is:
44EB26 -8 [RBP] RBP LEA 488D6DF8
44EB2A RBX 0 [RBP] MOV 48895D00
44EB2E 44E970 # EBX MOV BB70E94400
44EB33 44E94D ( d1-part2 ) CALL E815FEFFFF
44EB38 0 [RBP] RBX MOV 488B5D00
44EB3C 8 [RBP] RBP LEA 488D6D08
and the DS1-PART is:
44E94D RET C3
This loop takes 8 cycles per iteration (about half of that for each
simulated DOES>-defined word). Now for C1:
C1:
3502930384 cycles:u
903847649 instructions:u # 0.26 insn per cycle
93579 L1-dcache-load-misses
1813286 L1-icache-load-misses
100033846 branch-misses
0.846190766 seconds time elapsed
This loop takes 70 cycles per iteration (i.e., almost 9 times slower)
and has one branch misprediction per DOES>-defined word. Let's see
why:
The code in the loop for the X1 part is:
44EA42 44E96B ( x1 ) CALL E824FFFFFF
44EA47 0 [RBP] RBX MOV 488B5D00
44EA4B 8 [RBP] RBP LEA 488D6D08
It calls X1:
44E96B 44E927 ( d1 +1C ) CALL E8B7FFFFFF
which in turn calls this code:
44E927 -8 [RBP] RBP LEA 488D6DF8
44E92B RBX 0 [RBP] MOV 48895D00
44E92F RBX POP 5B
44E930 RET C3
Here the return address of the call at 44E96B is popped and used as
body address for X1. However, the hardware return stack predicts that
the following RET returns to that return address, which is a
misprediction, because the RET actually returns to the return address
of the outer call (44EA47). You can see here that mispredictions are
expensive.
Let's turn to B1. The code for the X1 part of the loop is:
44E9CE 44E96B ( x1 ) CALL E898FFFFFF
44E9D3 -8 [RBP] RBP LEA 488D6DF8
44E9D7 RBX 0 [RBP] MOV 48895D00
44E9DB 0 [RBP] RAX MOV 488B4500
44E9DF RAX 0 [RBX] MOV 488903
44E9E2 8 [RBP] RBX MOV 488B5D08
44E9E6 10 [RBP] RBP LEA 488D6D10
The results are:
44758764805 cycles:u
1303768694 instructions:u # 0.03 insn per cycle
150154275 L1-dcache-load-misses
202142121 L1-icache-load-misses
100051173 branch-misses
10.699859443 seconds time elapsed
10.696626000 seconds user
0.004000000 seconds sys
So in addition to the branch mispredictions that we also saw in C1, we
see 3 D-cache misses per iteration and 4 I-cache misses per iteration,
resulting in 895 cycles/iteration. A part of this cache-ping-pong is
because the call at 44E96B is in the same cache line as the stored-to
data at 44E970. The store wants the cache line in the D-cache, while
the call to 44E96B wants it in the I-cache. This is an issue I have
pointed out here for the first time in 1995, and Forth systems still
have not fixed it completely 29 years later.
Let's see if the B3 variant escapes this problem:
0.106679000 seconds user
0.008206000 seconds sys
20590473606 cycles:u
1204106872 instructions:u # 0.06 insn per cycle
50145367 L1-dcache-load-misses
101982668 L1-icache-load-misses
49127 branch-misses
4.932825183 seconds time elapsed
4.926391000 seconds user
0.003998000 seconds sys
It's better, but there is still one D-cache miss and 2 I-cache misses
per iteration. It seems that the distance between the D1-PART2 code
and X1 data is not enough to completely avoid the issue (but this is
just guessing).
In any case, having the call right before the data and executing it on
every call to a does>-defined word is responsible for 2 I-cache misses
and 2 D-cache misses per iteration, so it should be avoided by
generating code like in C3 and B3.
For dealing with the remaining cache consistency problems, most Forth
systems have chosen to put padding between code and data, and increase
the padding in response to slowness reports. Another alternative is
to put the code in a separate memory region than the data.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2024: https://euro.theforth.net
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2024: https://euro.theforth.net