Post by Stephen PelcPost by Anton Ertllocals stack
401 336 gforth-fast (AMD64)
179 132 lxf 1.6-982-823 (IA-32)
182 119 VFX FX Forth for Linux IA32 Version: 4.72 (IA-32)
241 159 VFX Forth 64 5.43 (AMD64)
163 175 iforth-5.1 mini (AMD64)
There are design decisions within locals that can impact optimisation.
The design of locals in VFX was influenced by Don Colburn's Forth's
and by a desire to use locals to simplify source code when interfacing
to a host operating system. Many operating systems return data
to the caller by passing the address of a variable/buffer as an input
parameter. Locals that can have an accessible address make such
code much easier to read and write.
Gforth has had variable-flavoured locals from the start, and
implemented VFX's local-buffer syntax some time ago without problems,
so Gforth's design decisions are obviously compatible with these
requirements.
Now Gforth's numbers above are the worst of all Forth systems, so why
would Gforth be relevant? The native code for locals by iForth seems
to be very much in the same spirit: A separate locals stack, and
locals are accessed relative to the locals-stack pointer; and iForth
has the best locals code size of all (but looking at the VFX code, my
guess is that this happens to be in the present case mainly because
iForth uses RSP for the data stack and some other stack for the return
stack). Actually, even with your approach of keeping the locals on
the return stack, and having a separate locals-frame pointer, I don't
see why the locals code should be worse. But looking at the start of
the VFX64 code for VICHECK1, there is a bit of superfluous work:
: VICHECK1 {: pindex paddr -- pindex' paddr :} \ Checks for valid index
\ paddr is the address of the data, the first cell of which contains
\ the array size
pindex 0 paddr @ WITHIN IF \ Index is valid
VICHECK1
( 0050A460 488BD4 ) MOV RDX, RSP
( 0050A463 48FF7500 ) PUSH QWORD [RBP]
( 0050A467 53 ) PUSH RBX
( 0050A468 52 ) PUSH RDX
( 0050A469 57 ) PUSH RDI
( 0050A46A 488BFC ) MOV RDI, RSP
( 0050A46D 4881EC00000000 ) SUB RSP, # 00000000
( 0050A474 488B5D08 ) MOV RBX, [RBP+08]
( 0050A478 488D6D10 ) LEA RBP, [RBP+10]
( 0050A47C 488B5710 ) MOV RDX, [RDI+10]
( 0050A480 488B12 ) MOV RDX, 0 [RDX]
( 0050A483 B900000000 ) MOV ECX, # 00000000
( 0050A488 482BD1 ) SUB RDX, RCX
( 0050A48B 488B4718 ) MOV RAX, [RDI+18]
( 0050A48F 482BC1 ) SUB RAX, RCX
( 0050A492 483BC2 ) CMP RAX, RDX
( 0050A495 0F8319000000 ) JNB/AE 0050A4B4
It's not clear to me why you push so much on the return stack at the
start, instead of just the two values pindex and paddr (which you do
in 0050A463 and 0050A467). Ok, you also push old locals-frame pointer
RDI in 0050A469, which is a result of having the locals on the return
stack instead of in a separate stack, but why push the old return
stack pointer? You know the size of your locals, just adjust RSP by
that much in the end.
The instruction at 0050A46D seems superfluous. My guess is that it's
there for the possible | part in the locals definition.
The next two instructions refill the TOS register RBX and adjust the
data stack pointer RBP. That completes the code for the locals
definition. From then on locals are loaded from memory, as
in iforth. Let's also inspect the end:
0 paddr \ Use zeroth index
THEN ;
( 0050A535 488D6DF0 ) LEA RBP, [RBP+-10]
( 0050A539 48C7450000000000 ) MOV QWord [RBP], # 00000000
( 0050A541 48895D08 ) MOV [RBP+08], RBX
( 0050A545 488B5F10 ) MOV RBX, [RDI+10]
( 0050A549 488B6708 ) MOV RSP, [RDI+08]
( 0050A54D 488B3F ) MOV RDI, 0 [RDI]
( 0050A550 C3 ) RET/NEXT
The THEN is right before 0050A549. The code before THEN pushes 0 and paddr
on the data stack, and stores the former TOS in memory before loading
the new TOS. The three instructions after the THEN restore the return
stack and locals-frame pointer and return.
So there is a little bit that can be done without much effort, but not
much.
I always thought that a separate locals stack is a thing I did in
Gforth out of lazyness, and pay for it by having to maintain a
separate stack pointer, but it turns out that with locals on the
return stack, you still need an extra register for locals in memory,
and you spend additional overhead.
Post by Stephen PelcIn the last
decade or so there has been very little customer demand for
faster code.
See below.
Post by Stephen PelcHowever, higher level source code has been much
in demand. An example is Nick Nelson's value flavoured structures,
which are of particular merit when converting code from 32 bit to
64 bit host Forths.
Gforth has worked on 64-bit hosts since early 1996, and I found that
Forth code tends to have fewer portability problems between 32-bit and
64-bit platforms than C code, and that's not just my code, the
applications in appbench and many others are also quite portable.
A major merit for value-flavoured structures is that you can change
the field size (e.g, from 1 byte to 2 bytes or vice versa) without
changing all the code accessing those fields. That's independent of
cell size.
Post by Stephen PelcJust because many of the Forth applications visible to the Forth
community now run on CPUs with 16 or 32 address registers
does not mean that all systems can implement the compiler
techniques required for high-performance locals.
It's obvious that hardly any Forth system implements register
allocation of locals, with the exception being lxf, which uses an
architecture with 8 general-purpose registers (address registers
recall bad memories from the 68000 days); and for lxf, register
allocation is limited to basic blocks or less.
Post by Stephen PelcI can buy a lot of CPU cycles for the cost of one day of programmer
time.
Some guy called Stephen Pelc (must be a different one) recentlu posted
<vbkdu0$1v8lq$***@dont-email.me>:
|We (MPE) converted much of our TCP/IP stack not to use locals. This
|was mostly on ARM7 devices, but the figures for other 32 bit CPUs of
|the period (say 15 years ago) were similar. Code density improved by
|about 25% and performance by about 50%.
How much time did that conversion cost? And this Stephen Pelc
suggested that Buzz McCool (and probably everyone else) should also
spend their time on avoiding and eliminating locals from their code.
I am with you here, not with the other Stephen Pelc: Programmers
should use locals liberally if it saves them time, even in the face of
slow locals implementations, because you can buy a lot of CPU cycles
for the additional programming cost of avoiding locals.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2024: https://euro.theforth.net