Discussion:
Another look at Gforth's locals implementation
(too old to reply)
Anton Ertl
2024-04-09 15:59:58 UTC
Permalink
30 years ago I wrote a paper about Gforth's locals and also gave some
performance results [ertl94l]. Since then the implementation of
Gforth, and also of locals has changed quite a bit; in particular,
recently we have improved the code generated for TO <local>.

So here I look at it again. As a first example, I look at the fib
benchmark (which has a one-off error, but that makes little difference):

no locals with a local

: fib ( n1 -- n2 ) : fib { n -- n2 }
dup 2 < if n 2 < if
drop 1 1
else else
dup n 1- recurse
1- recurse n 2 - recurse
swap 2 - recurse +
+ then ;
then ;

For 43 fib the performance results (AMD64 instructions, Skylake
cycles) are:

no locals with a local factor
13_805_561_653 15_012_002_369 1.09
41_475_949_910 48_490_119_217 1.17

So those of you who dislike locals can still point to worse
performance of code with locals on Gforth.

Another example is string comparison, where I compared three versions:

\ no locals
: strcmp1 ( addr1 u1 addr2 u2 -- n )
rot 2dup 2>r min 0 ?do ( a1 a2 )
over c@ over c@ - dup if
nip nip 2rdrop unloop exit then
drop
char+ swap char+ swap
loop
2drop r> r> - ;

\ locals with TO
: strcmp2
{ addr1 u1 addr2 u2 -- n }
u1 u2 min 0
?do
addr1 c@ addr2 c@ - dup if
unloop exit then
drop
addr1 char+ TO addr1
addr2 char+ TO addr2
loop
u1 u2 - ;

\ locals without TO
: strcmp3
{ addr1 u1 addr2 u2 -- n }
addr1 addr2
u1 u2 min 0
?do { s1 s2 }
s1 c@ s2 c@ - dup if
unloop exit then
drop
s1 char+ s2 char+
loop
2drop
u1 u2 - ;

I am too lazy to benchmark this, but here's the number of
native-instruction bytes in the inner loop:

129 strcmp1
128 strcmp2
149 strcmp3

Here's the code for body of the loop:

strcmp1 strcmp2 strcmp3
over 1->1 @local0 1->1 >l 1->1
mov $00[r13],r8 mov $00[r13],r8 mov rax,rbp
sub r13,$08 sub r13,$08 add r13,$08
mov r8,$10[r13] mov r8,$00[rbp] lea rbp,-$08[rbp]
c@ 1->1 c@ 1->1 mov -$08[rax],r8
movzx r8d,bytePTR[r8] movzx r8d,bytePTR[r8] mov r8,$00[r13]
over 1->2 @local2 1->2 >l @local0 1->1
mov r15,$08[r13] mov r15,$10[rbp] @local0
c@ 2->2 c@ 2->2 mov rax,rbp
movzx r15d,bytePTR[r15] movzx r15d,bytePTR[r15] lea rbp,-$08[rbp]
- 2->1 - 2->1 mov -$08[rax],r8
sub r8,r15 sub r8,r15 c@ 1->1
dup 1->2 dup 1->2 movzx r8d,bytePTR[r8]
mov r15,r8 mov r15,r8 @local1 1->2
?branch 2->1 ?branch 2->1 mov r15,$08[rbp]
<strcmp1+$A8> <strcmp2+$C0> c@ 2->2
add rbx,$68 add rbx,$68 movzx r15d,bytePTR[r15]
mov rax,[rbx] mov rax,[rbx] - 2->1
test r15,r15 test r15,r15 sub r8,r15
jnz $7FED3DB2C317 jnz $7FED3DB2C0DC dup 1->2
jmp eax jmp eax mov r15,r8
nip 1->1 unloop 1->1 ?branch 2->1
add r13,$08 add r14,$10 <strcmp3+$E0>
nip 1->1 lit 1->2 add rbx,$78
add r13,$08 #32 mov rax,[rbx]
2rdrop 1->1 sub rbx,$10 test r15,r15
add r14,$10 mov r15,-$08[rbx] jnz $7FED3DB2C208
unloop 1->1 lp+! 2->1 jmp eax
add r14,$10 add rbp,r15 unloop 1->1
;s 1->1 ;s 1->1 add r14,$10
mov rbx,[r14] mov rbx,[r14] lit 1->2
add r14,$08 add r14,$08 #48
mov rax,[rbx] mov rax,[rbx] sub rbx,$10
jmp eax jmp eax mov r15,-$08[rbx]
drop 1->1 drop 1->1 lp+! 2->1
mov r8,$08[r13] mov r8,$08[r13] add rbp,r15
add r13,$08 add r13,$08 ;s 1->1
char+ 1->1 @local0 1->2 mov rbx,[r14]
add r8,$01 mov r15,$00[rbp] add r14,$08
swap 1->2 char+ 2->2 mov rax,[rbx]
mov r15,$08[r13] add r15,$01 jmp eax
add r13,$08 !local0 2->1 drop 1->0
char+ 2->2 mov $00[rbp],r15 @local0 0->1
add r15,$01 @local2 1->2 mov r8,$00[rbp]
swap 2->1 mov r15,$10[rbp] char+ 1->1
mov $00[r13],r15 char+ 2->2 add r8,$01
sub r13,$08 add r15,$01 @local1 1->1
(loop) 1->1 !local2 2->1 mov $00[r13],r8
<strcmp1+$40> mov $10[rbp],r15 sub r13,$08
sub rbx,$68 (loop) 1->1 mov r8,$08[rbp]
mov rax,[r14] <strcmp2+$58> char+ 1->1
add rax,$01 sub rbx,$68 add r8,$01
cmp $08[r14],rax mov rax,[r14] (loop)-lp+!# 1->1
mov [r14],rax add rax,$01 <strcmp3+$68>
jz $7FED3DB2C36C cmp $08[r14],rax #16
mov rax,[rbx] mov [r14],rax add rbx,$40
jmp eax jz $7FED3DB2C130 mov rax,[r14]
mov rax,[rbx] mov rsi,-$10[rbx]
jmp eax add rax,$01
cmp $08[r14],rax
jz $7FED3DB2C25F
add rbp,-$08[rbx]
mov [r14],rax
mov rbx,rsi
mov rax,[rsi]
jmp eax

@InProceedings{ertl94l,
author = "M. Anton Ertl",
title = "Automatic Scoping of Local Variables",
booktitle = "EuroForth~'94 Conference Proceedings",
year = "1994",
address = "Winchester, UK",
pages = "31--37",
url = "http://www.complang.tuwien.ac.at/papers/ertl94l.ps.gz",
abstract = "In the process of lifting the restrictions on using
locals in Forth, an interesting problem poses
itself: What does it mean if a local is defined in a
control structure? Where is the local visible? Since
the user can create every possible control structure
in ANS Forth, the answer is not as simple as it may
seem. Ideally, the local is visible at a place if
the control flow {\em must} pass through the
definition of the local to reach this place. This
paper discusses locals in general, the visibility
problem, its solution, the consequences and the
implementation as well as related programming style
questions."
}

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
minforth
2024-04-09 19:43:55 UTC
Permalink
This is consistent with my observations. There is typically
a speed difference of around 10 per cent between code variants
with stack juggling and with locals. The difference is irrelevant
for the vast majority of tasks.

The gain in readability, on the other hand, is often enormous.
But that's another old, worn-out discussion.

With locals in CPU registers, I expect an even smaller speed
difference. However, I have not yet implemented and tested this.
mhx
2024-04-09 20:54:47 UTC
Permalink
Post by minforth
This is consistent with my observations. There is typically
a speed difference of around 10 per cent between code variants
with stack juggling and with locals. The difference is irrelevant
for the vast majority of tasks.
The gain in readability, on the other hand, is often enormous.
But that's another old, worn-out discussion.
With locals in CPU registers, I expect an even smaller speed
difference. However, I have not yet implemented and tested this.
Even without putting them in registers it is faster (Ryzen 7 5800X):

: bench
cr timer-reset #43 fib .elapsed ." ( " . ." )"
cr timer-reset #43 fib2 .elapsed ." ( " . ." )" ;

FORTH> bench
1.658 seconds elapsed. ( 701408733 )
1.564 seconds elapsed. ( 701408733 ) ok

Here is fib2:
: fib2 params| n |
n 2 < if 1
else n 1- recurse
n 2- recurse
+
endif ;

FORTH> ' fib2 idis
$014588C0 : fib2
$014588CA pop rbx
$014588CB cmp rbx, 2 b#
$014588CF lea rsi, [rsi #-16 +] qword
$014588D3 mov [rsi] qword, rbx
$014588D6 mov rbx, rcx
$014588D9 jge $014588EB offset NEAR
$014588DF mov rbx, 1 d#
$014588E6 jmp $01458921 offset NEAR
$014588EB mov rbx, [rsi] qword
$014588EE lea rbx, [rbx -1 +] qword
$014588F2 lea rbp, [rbp -8 +] qword
$014588F6 mov [rbp 0 +] qword, $01458903 d#
$014588FE jmp $014588CB offset NEAR
$01458903 mov rbx, [rsi] qword
$01458906 lea rbx, [rbx -2 +] qword
$0145890A lea rbp, [rbp -8 +] qword
$0145890E mov [rbp 0 +] qword, $0145891B d#
$01458916 jmp $014588CB offset NEAR
$0145891B pop rbx
$0145891C pop rdi
$0145891D lea rbx, [rdi rbx*1] qword
$01458921 push rbx
$01458922 lea rsi, [rsi #16 +] qword
$01458926 ;

-marcel
dxf
2024-04-10 03:00:48 UTC
Permalink
Post by mhx
Post by minforth
This is consistent with my observations. There is typically
a speed difference of around 10 per cent between code variants
with stack juggling and with locals. The difference is irrelevant
for the vast majority of tasks.
The gain in readability, on the other hand, is often enormous.
But that's another old, worn-out discussion.
With locals in CPU registers, I expect an even smaller speed
difference. However, I have not yet implemented and tested this.
: bench
  cr timer-reset #43 fib  .elapsed ."  ( " . ." )"
  cr timer-reset #43 fib2 .elapsed ."  ( " . ." )" ;
FORTH> bench
1.658 seconds elapsed. ( 701408733 )
1.564 seconds elapsed. ( 701408733 ) ok
: fib2 params| n |        n 2 < if 1           else n 1- recurse                n 2- recurse                +          endif ;
FORTH> ' fib2 idis
$014588C0  : fib2
$014588CA  pop           rbx
$014588CB  cmp           rbx, 2 b#
$014588CF  lea           rsi, [rsi #-16 +] qword
$014588D3  mov           [rsi] qword, rbx
$014588D6  mov           rbx, rcx
$014588D9  jge           $014588EB offset NEAR
$014588DF  mov           rbx, 1 d#
$014588E6  jmp           $01458921 offset NEAR
$014588EB  mov           rbx, [rsi] qword
$014588EE  lea           rbx, [rbx -1 +] qword
$014588F2  lea           rbp, [rbp -8 +] qword
$014588F6  mov           [rbp 0 +] qword, $01458903 d#
$014588FE  jmp           $014588CB offset NEAR
$01458903  mov           rbx, [rsi] qword
$01458906  lea           rbx, [rbx -2 +] qword
$0145890A  lea           rbp, [rbp -8 +] qword
$0145890E  mov           [rbp 0 +] qword, $0145891B d#
$01458916  jmp           $014588CB offset NEAR
$0145891B  pop           rbx
$0145891C  pop           rdi
$0145891D  lea           rbx, [rdi rbx*1] qword
$01458921  push          rbx
$01458922  lea           rsi, [rsi #16 +] qword
$01458926  ;
VFX doesn't put much effort into optimizing locals code. Should they have?
Or would that have encouraged users to write code that lacks thought and
leave it to the compiler to make efficient? Which is the C approach.
mhx
2024-04-10 05:25:48 UTC
Permalink
[..]
Post by dxf
VFX doesn't put much effort into optimizing locals code. Should they have?
Or would that have encouraged users to write code that lacks thought and
leave it to the compiler to make efficient? Which is the C approach.
I wouldn't know what those authors think, believe, or try to enforce.

There has been no effort spent in optimizing iForth's locals. The shown
result is simply a byproduct of the architecture of the compiler.

It is not my task to guide the users, they can do whatever they ask.
I believe my actions are consistent with that.

Locals are completely OK if you want to get stuff done. I use them for
the boring parts.

-marcel
dxf
2024-04-10 05:45:06 UTC
Permalink
Post by mhx
[..]
VFX doesn't put much effort into optimizing locals code.  Should they have?
Or would that have encouraged users to write code that lacks thought and
leave it to the compiler to make efficient?  Which is the C approach.
I wouldn't know what those authors think, believe, or try to enforce.
There has been no effort spent in optimizing iForth's locals. The shown
result is simply a byproduct of the architecture of the compiler.
...
Do you have a disassembly for Fib1? I ask because NT/Forth which purports
to do exactly that produces notably different results - despite there being
little 'stack juggling' to resolve.

NT/FORTH (C) 2005 Peter Fälth Version 1.6-983-824 Compiled on 2017-12-03

see fib1
A49E58 40917C 58 C80000 5 normal FIB1

40917C 83FB02 cmp ebx , # 2h
40917F 0F8D0A000000 jge "0040918F"
409185 BB01000000 mov ebx , # 1h
40918A E926000000 jmp "004091B5"
40918F 8BC3 mov eax , ebx
409191 48 dec eax
409192 895DFC mov [ebp-4h] , ebx
409195 8BD8 mov ebx , eax
409197 8D6DFC lea ebp , [ebp-4h]
40919A E8DDFFFFFF call FIB1
40919F 8B4500 mov eax , [ebp]
4091A2 83E802 sub eax , # 2h
4091A5 895D00 mov [ebp] , ebx
4091A8 8BD8 mov ebx , eax
4091AA E8CDFFFFFF call FIB1
4091AF 035D00 add ebx , [ebp]
4091B2 8D6D04 lea ebp , [ebp+4h]
4091B5 C3 ret near

see fib2
A49E70 4091B6 86 C80000 5 normal FIB2

4091B6 83FB02 cmp ebx , # 2h
4091B9 895C24FC mov [esp-4h] , ebx
4091BD 8B5D00 mov ebx , [ebp]
4091C0 8D6D04 lea ebp , [ebp+4h]
4091C3 8D6424FC lea esp , [esp-4h]
4091C7 0F8D10000000 jge "004091DD"
4091CD 895DFC mov [ebp-4h] , ebx
4091D0 BB01000000 mov ebx , # 1h
4091D5 8D6DFC lea ebp , [ebp-4h]
4091D8 E92A000000 jmp "00409207"
4091DD 8B0424 mov eax , [esp]
4091E0 48 dec eax
4091E1 895DFC mov [ebp-4h] , ebx
4091E4 8BD8 mov ebx , eax
4091E6 8D6DFC lea ebp , [ebp-4h]
4091E9 E8C8FFFFFF call FIB2
4091EE 8B0424 mov eax , [esp]
4091F1 83E802 sub eax , # 2h
4091F4 895DFC mov [ebp-4h] , ebx
4091F7 8BD8 mov ebx , eax
4091F9 8D6DFC lea ebp , [ebp-4h]
4091FC E8B5FFFFFF call FIB2
409201 035D00 add ebx , [ebp]
409204 8D6D04 lea ebp , [ebp+4h]
409207 8D642404 lea esp , [esp+4h]
40920B C3 ret near
mhx
2024-04-10 06:58:17 UTC
Permalink
Post by dxf
Do you have a disassembly for Fib1? I ask because NT/Forth which purports
to do exactly that produces notably different results - despite there being
little 'stack juggling' to resolve.
Here it is:

FORTH> : fib ( n1 -- n2 ) dup 2 < if drop 1 else dup 1- recurse swap 2 - recurse + then ; ok
FORTH> 10 fib . 89 ok
FORTH> see fib
Flags: ANSI
$01340A40 : fib
$01340A4A pop rbx
$01340A4B cmp rbx, 2 b#
$01340A4F jge $01340A61 offset NEAR
$01340A55 mov rbx, 1 d#
$01340A5C jmp $01340A95 offset NEAR
$01340A61 push rbx
$01340A62 lea rbx, [rbx -1 +] qword
$01340A66 lea rbp, [rbp -8 +] qword
$01340A6A mov [rbp 0 +] qword, $01340A77 d#
$01340A72 jmp $01340A4B offset NEAR
$01340A77 pop rbx
$01340A78 pop rdi
$01340A79 push rbx
$01340A7A lea rbx, [rdi -2 +] qword
$01340A7E lea rbp, [rbp -8 +] qword
$01340A82 mov [rbp 0 +] qword, $01340A8F d#
$01340A8A jmp $01340A4B offset NEAR
$01340A8F pop rbx
$01340A90 pop rdi
$01340A91 lea rbx, [rdi rbx*1] qword
$01340A95 push rbx
$01340A96 ;

-marcel
minforth
2024-04-10 07:34:38 UTC
Permalink
Post by mhx
I wouldn't know what those authors think, believe, or try to enforce.
There has been no effort spent in optimizing iForth's locals. The shown
result is simply a byproduct of the architecture of the compiler.
It is not my task to guide the users, they can do whatever they ask.
I believe my actions are consistent with that.
Locals are completely OK if you want to get stuff done. I use them for
the boring parts.
I think the attitude of "programming to make the compiler happy i.e.
by stack juggling" is a waste of human resources. I can't remember a
single case where a performance bottleneck was fixed by switching
from a local to a non-local code formulation. The difference in speed,
if any, is simply too small.

In such a case, you first rethink the task and the algorithm and,
if necessary, write a few lines in assembler (or C) after profiling.
dxf
2024-04-10 08:32:09 UTC
Permalink
Post by minforth
Post by mhx
Locals are completely OK if you want to get stuff done. I use them for
the boring parts.
...
I think the attitude of "programming to make the compiler happy i.e.
by stack juggling" is a waste of human resources. I can't remember a
single case where a performance bottleneck was fixed by switching
from a local to a non-local code formulation. The difference in speed,
if any, is simply too small.
To use forth's stack until such time as it becomes too much for one
suggests a half-heartedness. It begs the question why use forth at all
if one is merely going to toy with it.
minforth
2024-04-10 10:38:43 UTC
Permalink
Post by dxf
Post by minforth
Post by mhx
Locals are completely OK if you want to get stuff done. I use them for
the boring parts.
...
I think the attitude of "programming to make the compiler happy i.e.
by stack juggling" is a waste of human resources. I can't remember a
single case where a performance bottleneck was fixed by switching
from a local to a non-local code formulation. The difference in speed,
if any, is simply too small.
To use forth's stack until such time as it becomes too much for one
suggests a half-heartedness. It begs the question why use forth at all
if one is merely going to toy with it.
Yes, of course it's entirely up to you if you don't fully utilise the
available possibilities of your tools. But I also suspect that your Forth
system has no locals at all.

To give you another example: Dr Noble's formula translator. Certainly not
a toy of an incapable programmer, as you imply.
dxf
2024-04-11 02:09:25 UTC
Permalink
Post by minforth
Post by dxf
Post by minforth
Post by mhx
Locals are completely OK if you want to get stuff done. I use them for
the boring parts.
...
I think the attitude of "programming to make the compiler happy i.e.
by stack juggling" is a waste of human resources. I can't remember a
single case where a performance bottleneck was fixed by switching
from a local to a non-local code formulation. The difference in speed,
if any, is simply too small.
To use forth's stack until such time as it becomes too much for one
suggests a half-heartedness.  It begs the question why use forth at all
if one is merely going to toy with it.
Yes, of course it's entirely up to you if you don't fully utilise the
available possibilities of your tools.
In calling locals 'a tool' or 'used for the boring stuff' I'd have the
problem of explaining to colleagues how it is I can use a language in a
way that its creator has dismissed as antithetical.
Post by minforth
But I also suspect that your Forth
system has no locals at all.
DX-Forth offers locals as a loadable option. They're implemented
efficiently so as to be credible. I'm not aware of any user that has
availed themselves of it. Indeed I find they come to forth looking
for something that's honestly unique and challenging. To discover
Moore still resolute after all these years provides their role model.
a***@spenarnc.xs4all.nl
2024-04-11 09:49:02 UTC
Permalink
Post by dxf
Post by minforth
Post by dxf
Post by minforth
Post by mhx
Locals are completely OK if you want to get stuff done. I use them for
the boring parts.
...
I think the attitude of "programming to make the compiler happy i.e.
by stack juggling" is a waste of human resources. I can't remember a
single case where a performance bottleneck was fixed by switching
from a local to a non-local code formulation. The difference in speed,
if any, is simply too small.
To use forth's stack until such time as it becomes too much for one
suggests a half-heartedness.  It begs the question why use forth at all
if one is merely going to toy with it.
Yes, of course it's entirely up to you if you don't fully utilise the
available possibilities of your tools.
In calling locals 'a tool' or 'used for the boring stuff' I'd have the
problem of explaining to colleagues how it is I can use a language in a
way that its creator has dismissed as antithetical.
Post by minforth
But I also suspect that your Forth
system has no locals at all.
DX-Forth offers locals as a loadable option. They're implemented
efficiently so as to be credible. I'm not aware of any user that has
availed themselves of it. Indeed I find they come to forth looking
for something that's honestly unique and challenging. To discover
Moore still resolute after all these years provides their role model.
ciforth offers LOCAL as a loadable option.
They are implemented without regard of efficiency.
Also they cannot be used recursively.

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
dxf
2024-04-11 04:45:36 UTC
Permalink
Post by minforth
...
To give you another example: Dr Noble's formula translator. Certainly not
a toy of an incapable programmer, as you imply.
While Mr. Noble used locals in his formula translator, they were in fact
superfluous. Here's a modified version made some years ago that exchanges
locals with VALUEs:

https://pastebin.com/TnHKGNEk
Paul Rubin
2024-04-15 19:34:41 UTC
Permalink
Post by dxf
To use forth's stack until such time as it becomes too much for one
suggests a half-heartedness. It begs the question why use forth at all
if one is merely going to toy with it.
If Forth's greatness comes from its extensibility, why resist using that
extensibility to have locals? Forth is a philosophy that has many
attractions, but its stack is an implementation artifact.
dxf
2024-04-10 02:34:28 UTC
Permalink
Post by minforth
This is consistent with my observations. There is typically
a speed difference of around 10 per cent between code variants
with stack juggling and with locals. The difference is irrelevant
for the vast majority of tasks.
The gain in readability, on the other hand, is often enormous.
But that's another old, worn-out discussion.
'Readability' 'stack juggling'. No discussion there - just appeals
to prejudice.
Post by minforth
With locals in CPU registers, I expect an even smaller speed
difference. However, I have not yet implemented and tested this.
Moore and Fox discussed that. Making a non-optimal approach more
efficient isn't the answer.
Paul Rubin
2024-04-15 19:41:51 UTC
Permalink
Post by dxf
Moore and Fox discussed that. Making a non-optimal approach more
efficient isn't the answer.
Moore and Fox made cpu's that saved a lot of hardware by not having
registers. If you're on a register machine though, why not use it?
Doing the opposite is far from optimal.

Locals may not have been practical on Moore's stack machines, but
neither was ROT, it seems to me. So no locals and not that much
stackrobatics. He instead used VARIABLEs all over the place, which
burnt extra storage since they often weren't all alive at the same time,
plus gave rise to naming troubles and loss of reentrancy, both sort of
odd for a stack language.

Anton Ertl
2024-04-10 07:00:38 UTC
Permalink
Post by minforth
With locals in CPU registers, I expect an even smaller speed
difference. However, I have not yet implemented and tested this.
Stack items and locals are just different ways of expressing the same
data flow. Ideally a Forth system maps both expressions to the same
optimal way of expressing the data flow in machine code. There is one
example where this actually happens: Consider:

: 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
: 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
: 3dup.3 {: a b c :} a b c a b c ;
: 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;

These four ways of expressing 3DUP are all commpiled to exactly the
same code by lxf/ntf:

804FC0A 8B4500 mov eax , [ebp]
804FC0D 8945F4 mov [ebp-Ch] , eax
804FC10 8B4504 mov eax , [ebp+4h]
804FC13 8945F8 mov [ebp-8h] , eax
804FC16 895DFC mov [ebp-4h] , ebx
804FC19 8D6DF4 lea ebp , [ebp-Ch]
804FC1C C3 ret near

However, it's more difficult when control flow is involved, and most
native-code compilers register-allocate only in straight-line code.
For comparison, here's what gforth-fast currently gives you:

3dup.1 3dup.2 3dup.3 3dup.4
Post by minforth
r 1->0 third 1->2 >l >l 1->1 dup 1->1
mov -$08[r14],r8 mov r15,$10[r13] >l mov $00[r13],r8
sub r14,$08 third 2->3 mov -$08[rbp],r8 sub r13,$08
2dup 0->2 mov r9,$08[r13] mov rdx,$08[r13] 2over 1->3
mov r8,$10[r13] third 3->1 mov rax,rbp mov r15,$18[r13]
mov r15,$08[r13] mov $00[r13],r8 add r13,$10 mov r9,$10[r13]
i 2->3 sub r13,$18 lea rbp,-$10[rbp] rot 3->1
mov r9,[r14] mov $10[r13],r15 mov -$10[rax],rdx mov $00[r13],r15
-rot 3->2 mov $08[r13],r9 mov r8,$00[r13] sub r13,$10
mov $00[r13],r9 ;s 1->1 >l @local0 1->1 mov $08[r13],r9
sub r13,$08 mov rbx,[r14] @local0 ;s 1->1
r> 2->1 add r14,$08 mov rax,rbp mov rbx,[r14]
mov -$08[r13],r15 mov rax,[rbx] lea rbp,-$08[rbp] add r14,$08
sub r13,$10 jmp eax mov -$08[rax],r8 mov rax,[rbx]
mov $10[r13],r8 @local1 1->2 jmp eax
mov r8,[r14] mov r15,$08[rbp]
add r14,$08 @local2 2->1
;s 1->1 mov -$08[r13],r15
mov rbx,[r14] sub r13,$10
add r14,$08 mov $10[r13],r8
mov rax,[rbx] mov r8,$10[rbp]
jmp eax @local0 1->2
mov r15,$00[rbp]
@local1 2->3
mov r9,$08[rbp]
@local2 3->1
mov -$10[r13],r9
sub r13,$18
mov $10[r13],r15
mov $18[r13],r8
mov r8,$10[rbp]
lit 1->2
#24
mov r15,$50[rbx]
lp+! 2->1
add rbp,r15
;s 1->1
mov rbx,[r14]
add r14,$08
mov rax,[rbx]
jmp eax

Interestingly, 3DUP.2 is essentially the same as the lxf/ntf variant:
load two items from the memory parts of the stack, store three items
to the memory part of the stack, and one stack-pointer update. The
differences are:

64-bit vs. 32-bit cells
different way of returning to the caller (;S code vs. ret)
sub vs. lea for the stack-pointer update
different order of loads and stores
different register allocation

But this is just a happy coincidence, and the other versions make it
clear that gforth-fast does not analyse the data flow even in
straight-line code.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
mhx
2024-04-10 16:35:04 UTC
Permalink
Post by Anton Ertl
However, it's more difficult when control flow is involved, and most
native-code compilers register-allocate only in straight-line code.
[..]
Post by Anton Ertl
load two items from the memory parts of the stack, store three items
to the memory part of the stack, and one stack-pointer update. The
..
Nice example. Here is what iForth does:

FORTH> : 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ; ok
FORTH> : 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ; ok
FORTH> : 3dup.3 params| a b c | a b c a b c ; ok
FORTH> : 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ; ok
FORTH> : test.1 3dup.1 .S ; ok
FORTH> : test.2 3dup.2 .S ; ok
FORTH> : test.3 3dup.3 .S ; ok
FORTH> : test.4 3dup.4 .S ; ok
FORTH> see test.1
Flags: ANSI
$01340F80 : test.1
$01340F8A pop rbx
$01340F8B pop rdi
$01340F8C mov rax, [rsp] qword
$01340F90 push rdi
$01340F91 push rbx
$01340F92 push rax
$01340F93 push rdi
$01340F94 push rbx
$01340F95 jmp .S+10 ( $012BEA0A ) offset NEAR
FORTH> see test.2
Flags: ANSI
$01340FC0 : test.2
$01340FCA mov rbx, [rsp #16 +] qword
$01340FCF mov rcx, rbx
$01340FD2 mov rbx, [rsp 8 +] qword
$01340FD7 push rcx
$01340FD8 mov rcx, rbx
$01340FDB mov rbx, [rsp 8 +] qword
$01340FE0 push rcx
$01340FE1 push rbx
$01340FE2 jmp .S+10 ( $012BEA0A ) offset NEAR
FORTH> see test.3
Flags: ANSI
$01341000 : test.3
$0134100A pop rbx
$0134100B pop rdi
$0134100C mov rax, [rsp] qword
$01341010 push rdi
$01341011 push rbx
$01341012 push rax
$01341013 push rdi
$01341014 push rbx
$01341015 jmp .S+10 ( $012BEA0A ) offset NEAR
FORTH> see test.4
Flags: ANSI
$01341040 : test.4
$0134104A pop rbx
$0134104B pop rdi
$0134104C mov rax, [rsp] qword
$01341050 push rdi
$01341051 push rbx
$01341052 push rax
$01341053 push rdi
$01341054 push rbx
$01341055 jmp .S+10 ( $012BEA0A ) offset NEAR

FORTH> ( and finally all together ... )
FORTH> : test.5 1 2 3 3dup.1 3dup.2 3dup.3 3dup.4 .S ;

FORTH> see test.5
Flags: ANSI
$01341080 : test.4
$0134108A push 1 b#
$0134108C push 2 b#
$0134108E push 3 b#
$01341090 push 1 b#
$01341092 push 2 b#
$01341094 push 3 b#
$01341096 mov rbx, [rsp #16 +] qword
$0134109B mov rcx, rbx
$0134109E mov rbx, [rsp 8 +] qword
$013410A3 push rcx
$013410A4 mov rcx, rbx
$013410A7 mov rbx, [rsp 8 +] qword
$013410AC mov rdi, [rsp] qword
$013410B0 push rcx
$013410B1 push rbx
$013410B2 push rdi
$013410B3 push rcx
$013410B4 push rbx
$013410B5 push rdi
$013410B6 push rcx
$013410B7 push rbx
$013410B8 jmp .S+10 ( $012BEA0A ) offset NEAR

-marcel
mhx
2024-04-10 16:50:36 UTC
Permalink
The local version loses track of constants. If we reorder test.5 :

FORTH> : test.6 1 2 3 3dup.1 3dup.3 3dup.4 3dup.2 .S ; ok
FORTH> see test.6
Flags: ANSI
$01341100 : test.6
$0134110A push 1 b#
$0134110C push 2 b#
$0134110E push 3 b#
$01341110 push 1 b#
$01341112 push 2 b#
$01341114 push 3 b#
$01341116 push 1 b#
$01341118 push 2 b#
$0134111A push 3 b#
$0134111C push 1 b#
$0134111E push 2 b#
$01341120 push 3 b#
$01341122 mov rbx, [rsp #16 +] qword
$01341127 mov rcx, rbx
$0134112A mov rbx, [rsp 8 +] qword
$0134112F push rcx
$01341130 mov rcx, rbx
$01341133 mov rbx, [rsp 8 +] qword
$01341138 push rcx
$01341139 push rbx
$0134113A jmp .S+10 ( $012BEA0A ) offset NEAR
$0134113F ;

Maybe in vsn 7.0 :--)

-marcel
minforth
2024-04-10 20:59:49 UTC
Permalink
Post by mhx
The local version loses track of constants.
Interesting aspect. Within the test words all 3dup variants had been
inlined, I assume.
However inlining semanthically identic words with locals is
"not commutative", to grossly abuse the algebraic expression. Hmm...
Anton Ertl
2024-04-13 17:27:15 UTC
Permalink
Post by minforth
Post by mhx
The local version loses track of constants.
Interesting aspect. Within the test words all 3dup variants had been
inlined, I assume.
However inlining semanthically identic words with locals is
"not commutative", to grossly abuse the algebraic expression. Hmm...
I think mathematicians have a word for what you mean, but it's not
"commutative".

Anyway, it's not specific to locals. Everything that loses
information will affect everything that comes afterwards. E.g., in
Gforth we have a literal stack that only represents literals on the
data stack. So anything that moves values from the data stack (e.g.,
to the return stack) will lose the information about the constant.

Gforth does not have automatic inlining, so I use manual inlining here
(I also added some constant folding optimizations that are not built
in):

: 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
: 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;

: compile,-3dup.1 ( xt -- ) drop ]] >r 2dup r@ -rot r> [[ ;
' compile,-3dup.1 optimizes 3dup.1

: compile,-3dup.2 ( xt -- ) drop ]] 2 pick 2 pick 2 pick [[ ;
' compile,-3dup.2 optimizes 3dup.2

: fold3-4 ( xt -- ) 3 ['] 3lits> ['] >3lits ['] >4lits fold-constants ;
' fold3-4 optimizes third

: fold2-4 ( xt -- ) 2 ['] 2lits> ['] >2lits ['] >4lits fold-constants ;
' fold2-4 optimizes 2dup

: foo12 1 2 3 3dup.1 3dup.2 ;
: foo21 1 2 3 3dup.2 3dup.1 ;

Looking at the resulting code for FOO12 and FOO21, we see:

see foo12
: foo12 #1 #2 #3
Post by minforth
r 2dup i -rot r> third third third ; ok
see foo21
: foo21 #1 #2 #3 #1 #2 #3
Post by minforth
r 2dup i -rot r> ; ok
So 3DUP.1 loses information by involving the return stack, while
3DUP.2 only uses the data stack, and manages to work on the constants
when it is COMPILE,d (but I had to add an optimization rule for THIRD
to achieve that; doing the same with 2DUP did not help 3DUP.1,
though).

Having a literal-r-stack and a way to track literals through locals
would make all 3DUP variants work the same way when all parameters are
literals.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
a***@spenarnc.xs4all.nl
2024-04-14 11:08:25 UTC
Permalink
Post by Anton Ertl
Post by minforth
Post by mhx
The local version loses track of constants.
Interesting aspect. Within the test words all 3dup variants had been
inlined, I assume.
However inlining semanthically identic words with locals is
"not commutative", to grossly abuse the algebraic expression. Hmm...
I think mathematicians have a word for what you mean, but it's not
"commutative".
Anyway, it's not specific to locals. Everything that loses
information will affect everything that comes afterwards. E.g., in
Gforth we have a literal stack that only represents literals on the
data stack. So anything that moves values from the data stack (e.g.,
to the return stack) will lose the information about the constant.
To cite Andrew Tannenbaum:
"global optimisation and symbolic debugging are each others
arch enemies"
Post by Anton Ertl
- anton
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
Anton Ertl
2024-04-10 17:06:45 UTC
Permalink
***@mips.complang.tuwien.ac.at (Anton Ertl) writes:
[deleted the versions not of interest in this posting
Post by Anton Ertl
: 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
: 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;
...
Post by Anton Ertl
3dup.2 3dup.4
third 1->2 dup 1->1
mov r15,$10[r13] mov $00[r13],r8
third 2->3 sub r13,$08
mov r9,$08[r13] 2over 1->3
third 3->1 mov r15,$18[r13]
mov $00[r13],r8 mov r9,$10[r13]
sub r13,$18 rot 3->1
mov $10[r13],r15 mov $00[r13],r15
mov $08[r13],r9 sub r13,$10
;s 1->1 mov $08[r13],r9
mov rbx,[r14] ;s 1->1
add r14,$08 mov rbx,[r14]
mov rax,[rbx] add r14,$08
jmp eax mov rax,[rbx]
jmp eax
load two items from the memory parts of the stack, store three items
to the memory part of the stack, and one stack-pointer update. The
64-bit vs. 32-bit cells
different way of returning to the caller (;S code vs. ret)
sub vs. lea for the stack-pointer update
different order of loads and stores
different register allocation
But this is just a happy coincidence, and the other versions make it
clear that gforth-fast does not analyse the data flow even in
straight-line code.
3DUP.4 also performs these two loads and three stores (again in a
different order), but it distributes the stack pointer update across
two instructions.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
Loading...