Discussion:
The performance cost of extended precision
(too old to reply)
Anton Ertl
2021-01-08 18:49:39 UTC
Permalink
On IA-32 and AMD64, iForth uses 80-bit floats (extended precision), at
least in the default configuration, and VFX's Ndp387.fth also uses
extended precision; on VFX you can change that by changing a single
line of code in Ndp387.fth, but offers no way to tell it that you want
something else (changing code that other people update (and not
through a version control system) is a no-no).

Anyway, some time ago I found that VFX runs my matrix multiplication
code almost twice as fast with 64-bit floats (double precision) that
with 80-bit floats. Today I investigated this further
<***@mips.complang.tuwien.ac.at>
<***@mips.complang.tuwien.ac.at>, and here I repeat the
most interesting findings. Basically, I translated this loop

?do
fdup over f@ f* dup f+! float+ swap float+ swap
loop

into C (to avoid the slow counted loops of present-day Forth
compilers) and compiled it with gcc -O -mfpmath=387, resulting in the
following code:

long double double
fldt (%rdi) fld %st(0)
fmul %st(1),%st fmull (%rdi)
fldt (%rsi) faddl (%rsi)
faddp %st,%st(1) fstpl (%rsi)
fstpt (%rsi) add %rdx,%rdi
add %rdx,%rdi add %rdx,%rsi
add %rdx,%rsi sub $0x1,%rcx
sub $0x1,%rcx jne f <axpy+0xf>
jne 9 <axpy+0x9>

7 cycles/iteration 2 cycles/iteration

The cycle results are from a Skylake microarchitecture (most Intel
CPUs since 2016). So you can see that extended precision can cost a
lot of performance. If you need the precision, go for it, but if you
don't, and need performance, using 64-bit floats can get you that.
It's probably a good default to go for precision, but going for
performance should not require doing a no-no.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
minf...@arcor.de
2021-01-08 19:45:23 UTC
Permalink
Post by Anton Ertl
On IA-32 and AMD64, iForth uses 80-bit floats (extended precision), at
least in the default configuration, and VFX's Ndp387.fth also uses
extended precision; on VFX you can change that by changing a single
line of code in Ndp387.fth, but offers no way to tell it that you want
something else (changing code that other people update (and not
through a version control system) is a no-no).
Anyway, some time ago I found that VFX runs my matrix multiplication
code almost twice as fast with 64-bit floats (double precision) that
with 80-bit floats. Today I investigated this further
most interesting findings. Basically, I translated this loop
?do
loop
into C (to avoid the slow counted loops of present-day Forth
compilers) and compiled it with gcc -O -mfpmath=387, resulting in the
long double double
fldt (%rdi) fld %st(0)
fmul %st(1),%st fmull (%rdi)
fldt (%rsi) faddl (%rsi)
faddp %st,%st(1) fstpl (%rsi)
fstpt (%rsi) add %rdx,%rdi
add %rdx,%rdi add %rdx,%rsi
add %rdx,%rsi sub $0x1,%rcx
sub $0x1,%rcx jne f <axpy+0xf>
jne 9 <axpy+0x9>
7 cycles/iteration 2 cycles/iteration
The cycle results are from a Skylake microarchitecture (most Intel
CPUs since 2016). So you can see that extended precision can cost a
lot of performance. If you need the precision, go for it, but if you
don't, and need performance, using 64-bit floats can get you that.
It's probably a good default to go for precision, but going for
performance should not require doing a no-no.
Isn't that an old story? Penalties for inefficient 80-bit memory save/store
operations can outweigh CPU calculation advantages.

For instance:
https://retrocomputing.stackexchange.com/questions/9751/did-any-compiler-fully-use-intel-x87-80-bit-floating-point
Anton Ertl
2021-01-09 06:48:38 UTC
Permalink
Post by ***@arcor.de
Isn't that an old story?
Not for me:-)
Post by ***@arcor.de
Penalties for inefficient 80-bit memory save/store
operations can outweigh CPU calculation advantages.
80-bit FLOPs are not faster (nor slower when using the 387
instructions), so it's unclear what advantages you mean. I don't
think you can weigh precision against speed: You either need the
precision or you don't. But yes, it's the memory accesses, not the
FLOPs that seem to make the big cost difference. So if you use DF@
DF! instead of F@ F! on VFX, you get the same (performance and
precision) as when you change the source code of Ndp387.fth.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
dxforth
2021-01-09 00:38:12 UTC
Permalink
Post by Anton Ertl
On IA-32 and AMD64, iForth uses 80-bit floats (extended precision), at
least in the default configuration, and VFX's Ndp387.fth also uses
extended precision; on VFX you can change that by changing a single
line of code in Ndp387.fth, but offers no way to tell it that you want
something else (changing code that other people update (and not
through a version control system) is a no-no).
Anyway, some time ago I found that VFX runs my matrix multiplication
code almost twice as fast with 64-bit floats (double precision) that
with 80-bit floats. Today I investigated this further
most interesting findings. Basically, I translated this loop
?do
loop
into C (to avoid the slow counted loops of present-day Forth
compilers) and compiled it with gcc -O -mfpmath=387, resulting in the
long double double
fldt (%rdi) fld %st(0)
fmul %st(1),%st fmull (%rdi)
fldt (%rsi) faddl (%rsi)
faddp %st,%st(1) fstpl (%rsi)
fstpt (%rsi) add %rdx,%rdi
add %rdx,%rdi add %rdx,%rsi
add %rdx,%rsi sub $0x1,%rcx
sub $0x1,%rcx jne f <axpy+0xf>
jne 9 <axpy+0x9>
7 cycles/iteration 2 cycles/iteration
The cycle results are from a Skylake microarchitecture (most Intel
CPUs since 2016). So you can see that extended precision can cost a
lot of performance. If you need the precision, go for it, but if you
don't, and need performance, using 64-bit floats can get you that.
It's probably a good default to go for precision, but going for
performance should not require doing a no-no.
Ndp387.fth was written using 80-bit FPU instructions and consequently
lack the 64-bit optimizations even when FPCELL is changed. I notice
FALIGN etc is typically null for x87 implementations. Has anyone
investigated float alignment and what difference, if any, it has on
80/64 performance?
Anton Ertl
2021-01-09 06:56:57 UTC
Permalink
Post by dxforth
Post by Anton Ertl
long double double
fldt (%rdi) fld %st(0)
fmul %st(1),%st fmull (%rdi)
fldt (%rsi) faddl (%rsi)
faddp %st,%st(1) fstpl (%rsi)
fstpt (%rsi) add %rdx,%rdi
add %rdx,%rdi add %rdx,%rsi
add %rdx,%rsi sub $0x1,%rcx
sub $0x1,%rcx jne f <axpy+0xf>
jne 9 <axpy+0x9>
7 cycles/iteration 2 cycles/iteration
...
Post by dxforth
Ndp387.fth was written using 80-bit FPU instructions and consequently
lack the 64-bit optimizations even when FPCELL is changed.
Not sure what 64-bit optimizations you mean. If you mean using SSE2
instead of 387 instructions, I have now compiled the double-precision
benchmark without -mfpmath=387, resulting in the following loop:

400580: 66 0f 28 c8 movapd %xmm0,%xmm1
400584: f2 0f 59 0f mulsd (%rdi),%xmm1
400588: f2 0f 58 0e addsd (%rsi),%xmm1
40058c: f2 0f 11 0e movsd %xmm1,(%rsi)
400590: 48 01 d7 add %rdx,%rdi
400593: 48 01 d6 add %rdx,%rsi
400596: 48 83 e9 01 sub $0x1,%rcx
40059a: 75 e4 jne 400580 <axpy+0x5>

It uses 2 cycles/iteration, like the 387 double code.
Post by dxforth
I notice
FALIGN etc is typically null for x87 implementations. Has anyone
investigated float alignment and what difference, if any, it has on
80/64 performance?
I have run the 387 code for both precisions with stride (in %rdx) 10
(i.e., 75% misalignments) and with stride 16 (all accesses maximally
aligned). The results were the same either way.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
dxforth
2021-01-10 00:06:28 UTC
Permalink
Post by Anton Ertl
Post by dxforth
Post by Anton Ertl
long double double
fldt (%rdi) fld %st(0)
fmul %st(1),%st fmull (%rdi)
fldt (%rsi) faddl (%rsi)
faddp %st,%st(1) fstpl (%rsi)
fstpt (%rsi) add %rdx,%rdi
add %rdx,%rdi add %rdx,%rsi
add %rdx,%rsi sub $0x1,%rcx
sub $0x1,%rcx jne f <axpy+0xf>
jne 9 <axpy+0x9>
7 cycles/iteration 2 cycles/iteration
...
Post by dxforth
Ndp387.fth was written using 80-bit FPU instructions and consequently
lack the 64-bit optimizations even when FPCELL is changed.
Not sure what 64-bit optimizations you mean.
...
64-bit F+ F- F* F/ can avoid an FLD
Post by Anton Ertl
Post by dxforth
I notice
FALIGN etc is typically null for x87 implementations. Has anyone
investigated float alignment and what difference, if any, it has on
80/64 performance?
I have run the 387 code for both precisions with stride (in %rdx) 10
(i.e., 75% misalignments) and with stride 16 (all accesses maximally
aligned). The results were the same either way.
- anton
Anton Ertl
2021-01-10 10:05:00 UTC
Permalink
Post by dxforth
Post by Anton Ertl
Post by dxforth
Post by Anton Ertl
long double double
fldt (%rdi) fld %st(0)
fmul %st(1),%st fmull (%rdi)
fldt (%rsi) faddl (%rsi)
faddp %st,%st(1) fstpl (%rsi)
fstpt (%rsi) add %rdx,%rdi
add %rdx,%rdi add %rdx,%rsi
add %rdx,%rsi sub $0x1,%rcx
sub $0x1,%rcx jne f <axpy+0xf>
jne 9 <axpy+0x9>
7 cycles/iteration 2 cycles/iteration
...
Post by dxforth
Ndp387.fth was written using 80-bit FPU instructions and consequently
lack the 64-bit optimizations even when FPCELL is changed.
Not sure what 64-bit optimizations you mean.
...
64-bit F+ F- F* F/ can avoid an FLD
Interestingly, the code produced by VFX with Ndp387.fth for the loop
body of

: foo ?do fdup over f@ f* dup f+! float+ swap float+ swap loop ;

is

fpcell=10 fpcell=8
FLD ST FLD ST
MOV EDX, [EBP] MOV EDX, [EBP]
FLD TBYTE 0 [EDX] FLD DOUBLE 0 [EDX]
FMULP ST(1), ST FMULP ST(1), ST
FLD TBYTE 0 [EBX] FADD DOUBLE 0 [EBX]
FADDP ST(1), ST FSTP DOUBLE 0 [EBX]
FSTP TBYTE 0 [EBX] ADD EBX, 08
ADD EBX, 0A MOV EDX, [EBP]
MOV EDX, [EBP] ADD EDX, 08
ADD EDX, 0A ADD [ESP], 01
ADD [ESP], 01 ADD [ESP+04], 01
ADD [ESP+04], 01 MOV [EBP], EDX
MOV [EBP], EDX JNO 080C6CB0
JNO 080C6CA0

So while F@ F* is not optimized in the way you indicate, F+! is,
because with fpcell=8 it is an alias for DF+!, which contains assembly
optimized for 64-bit floats. And on Skylake (but not on Zen or Zen2)
it really is an optimization: If you change the "long double" loop at
the top of this posting to use doubles, but otherwise the same
instructions, it takes 4 cycles/iteration on Skylake (still 2.7
cycles/iteration on Zen and Zen2).

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
Marcel Hendrix
2021-01-09 00:38:46 UTC
Permalink
On Friday, January 8, 2021 at 8:09:14 PM UTC+1, Anton Ertl wrote:
[..]
Post by Anton Ertl
Anyway, some time ago I found that VFX runs my matrix multiplication
code almost twice as fast with 64-bit floats (double precision) that
with 80-bit floats. Today I investigated this further
most interesting findings. Basically, I translated this loop
Yes, unfortunately extended precision is slow. However,
e.g. a very popular SPICE simulator that I have to use
every day performs lightyears better when its extended
precision solver is used. What is the use of getting
the wrong results very fast?

The choice for a default is arbitrary, but it is no accident
that in iForth it is extended precision. I've made it
very easy to write FP code that can execute with any
arbitrary precision, just by setting a single constant
at the beginning of the source file.
Post by Anton Ertl
?do
loop
into C (to avoid the slow counted loops of present-day Forth
compilers) and compiled it with gcc -O -mfpmath=387, resulting in the
[..]
You have mentioned the counted loop problem before, but it was,
and is, difficult for me to reproduce your results. For that to
succeed it looks like the compiler needs to generate completely
stackless code for the exit condition of the loop.
[..]
Post by Anton Ertl
7 cycles/iteration 2 cycles/iteration
At least I can get those numbers in plain Forth, although
I need to unroll the loop for that.
Post by Anton Ertl
The cycle results are from a Skylake microarchitecture (most Intel
CPUs since 2016). So you can see that extended precision can cost a
Post by Anton Ertl
lot of performance. If you need the precision, go for it, but if you
don't, and need performance, using 64-bit floats can get you that.
It's probably a good default to go for precision, but going for
performance should not require doing a no-no.
I'm not completely sure that I get what you are saying in the
last part of the last sentence.

-marcel

-- --------------
ANEW -test

#100000 =: #iters ( making this 10..100x larger results in slower code )

: init ( addr -- ) ( F: r -- ) #iters 0 ?DO FDUP F!+ LOOP FDROP DROP ;
: dinit ( addr -- ) ( F: r -- ) #iters 0 ?DO FDUP DF!+ LOOP FDROP DROP ;

CREATE x #iters FLOATS ALLOT
CREATE y #iters FLOATS ALLOT

-- Y <- a*X+Y
: test1 ( x y -- ) ( F: a -- )
#iters #10 /
0 ?do FDUP swap f@+ f* swap f+!+
FDUP swap f@+ f* swap f+!+
FDUP swap f@+ f* swap f+!+
FDUP swap f@+ f* swap f+!+
FDUP swap f@+ f* swap f+!+
FDUP swap f@+ f* swap f+!+
FDUP swap f@+ f* swap f+!+
FDUP swap f@+ f* swap f+!+
FDUP swap f@+ f* swap f+!+
FDUP swap f@+ f* swap f+!+
loop FDROP 2DROP ;

: test2 ( x y -- ) ( F: a -- )
#iters -ROT
BEGIN FDUP swap f@+ f* swap f+!+
ROT 1- DUP 2SWAP ROT 0=
UNTIL FDROP 3DROP ;


: test3 ( x y -- ) ( F: a -- )
#iters #10 /
0 ?do FDUP swap df@+ f* swap df+!+
FDUP swap df@+ f* swap df+!+
FDUP swap df@+ f* swap df+!+
FDUP swap df@+ f* swap df+!+
FDUP swap df@+ f* swap df+!+
FDUP swap df@+ f* swap df+!+
FDUP swap df@+ f* swap df+!+
FDUP swap df@+ f* swap df+!+
FDUP swap df@+ f* swap df+!+
FDUP swap df@+ f* swap df+!+
loop FDROP 2DROP ;

: test4 ( x y -- ) ( F: a -- )
#iters -ROT
BEGIN FDUP swap df@+ f* swap df+!+
ROT 1- DUP 2SWAP ROT 0=
UNTIL FDROP 3DROP ;

: BENCH ( -- )
CR ." extended precision test1 " #iters S>D (n,3) ." times: "
x 1e init y 2e init
TICKS-RESET x y 3e TEST1 TICKS? #iters UM/MOD NIP U>D (n,3) ." cycles/iteration."
CR ." extended precision test2 " #iters S>D (n,3) ." times: "
x 1e init y 2e init
TICKS-RESET x y 3e TEST2 TICKS? #iters UM/MOD NIP U>D (n,3) ." cycles/iteration."
CR ." double precision test3 " #iters S>D (n,3) ." times: "
x 1e dinit y 2e init
TICKS-RESET x y 3e TEST3 TICKS? #iters UM/MOD NIP U>D (n,3) ." cycles/iteration."
CR ." double precision test4 " #iters S>D (n,3) ." times: "
x 1e dinit y 2e dinit
TICKS-RESET x y 3e TEST4 TICKS? #iters UM/MOD NIP U>D (n,3) ." cycles/iteration." ;

( Try it. )

FORTH> cr BENCH cr BENCH cr BENCH

extended precision test1 100,000 times: 7 cycles/iteration.
extended precision test2 100,000 times: 9 cycles/iteration.
double precision test3 100,000 times: 2 cycles/iteration.
double precision test4 100,000 times: 5 cycles/iteration.

extended precision test1 100,000 times: 7 cycles/iteration.
extended precision test2 100,000 times: 9 cycles/iteration.
double precision test3 100,000 times: 2 cycles/iteration.
double precision test4 100,000 times: 5 cycles/iteration.

extended precision test1 100,000 times: 7 cycles/iteration.
extended precision test2 100,000 times: 9 cycles/iteration.
double precision test3 100,000 times: 2 cycles/iteration.
double precision test4 100,000 times: 5 cycles/iteration. ok

FORTH> ' test4 idis
$0164D780 : test4
$0164D78A pop rbx
$0164D78B pop rdi
$0164D78C push $000186A0 d#
$0164D791 mov rcx, rdi
$0164D794 lea rax, [rax 0 +] qword
$0164D798 fld ST(0)
$0164D79A fld [rcx] qword
$0164D79C fmulp ST(1), ST
$0164D79E fld [rbx] qword
$0164D7A0 faddp ST(1), ST
$0164D7A2 fstp [rbx] qword
$0164D7A4 pop rdi \ <====== not good?
$0164D7A5 lea rax, [rdi -1 +] qword
$0164D7A9 lea rdi, [rcx 8 +] qword
$0164D7AD lea rdx, [rbx 8 +] qword
$0164D7B1 cmp rax, 0 b#
$0164D7B5 push rax \ <====== not good?
$0164D7B6 push rdi \ <====== not good?
$0164D7B7 mov rcx, rdx
$0164D7BA mov rbx, rcx
$0164D7BD pop rcx \ <====== not good?
$0164D7BE jne $0164D798 offset NEAR
$0164D7C4 fpop,
$0164D7CE ffreep ST(0)
$0164D7D0 pop rdi
$0164D7D1 ;
Paul Rubin
2021-01-09 00:49:39 UTC
Permalink
Yes, unfortunately extended precision is slow. However, e.g. a very
popular SPICE simulator that I have to use every day performs
lightyears better when its extended precision solver is used. What is
the use of getting the wrong results very fast?
Does SPICE really do seriously numerically unstable things? I thought
Bill Gates told us that 64 bits should be enough for anyone, or
something like that.

There is starting to be some hardware with IEEE 128 bit float arithmetic
now, I think.
dxforth
2021-01-09 05:09:15 UTC
Permalink
Post by Paul Rubin
Yes, unfortunately extended precision is slow. However, e.g. a very
popular SPICE simulator that I have to use every day performs
lightyears better when its extended precision solver is used. What is
the use of getting the wrong results very fast?
Does SPICE really do seriously numerically unstable things? I thought
Bill Gates told us that 64 bits should be enough for anyone, or
something like that.
There is starting to be some hardware with IEEE 128 bit float arithmetic
now, I think.
So more memory and ever faster clocks just to run apps that never needed
this precision :)
none) (albert
2021-01-09 10:47:42 UTC
Permalink
Post by Paul Rubin
Yes, unfortunately extended precision is slow. However, e.g. a very
popular SPICE simulator that I have to use every day performs
lightyears better when its extended precision solver is used. What is
the use of getting the wrong results very fast?
Does SPICE really do seriously numerically unstable things? I thought
Bill Gates told us that 64 bits should be enough for anyone, or
something like that.
If somone told me that if you are doing numerically unstable things,
that 64 bits will not save you, I would tend to believe him.
If you dpn't know about how stable your calculations are, you
shouldm't be in the driver seat for 32/64 decisions.
Post by Paul Rubin
There is starting to be some hardware with IEEE 128 bit float arithmetic
now, I think.
Groetjes Albert
--
This is the first day of the end of your life.
It may not kill you, but it does make your weaker.
If you can't beat them, too bad.
***@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
Paul Rubin
2021-01-10 01:24:06 UTC
Permalink
Post by none) (albert
If somone told me that if you are doing numerically unstable things,
that 64 bits will not save you, I would tend to believe him.
Right, but how wide is the region where 64 bits leaves you in trouble,
while 80 bits gets you out of it?
Post by none) (albert
If you dpn't know about how stable your calculations are, you
shouldn't be in the driver seat for 32/64 decisions.
If things are really badly unstable you have to examine your algorithms.
But Prof. Kahan used to say that in routine situations, using 64 bits
instead of 32 was enough to paper over a lot of unawareness (not his
words but that's my sense of what he was saying).
Marcel Hendrix
2021-01-10 11:43:44 UTC
Permalink
Post by none) (albert
Yes, unfortunately extended precision is slow. However, e.g. a very
popular SPICE simulator that I have to use every day performs
lightyears better when its extended precision solver is used. What is
the use of getting the wrong results very fast?
Does SPICE really do seriously numerically unstable things? I thought
Bill Gates told us that 64 bits should be enough for anyone, or
something like that.
[..]
Post by none) (albert
If somone told me that if you are doing numerically unstable things,
that 64 bits will not save you, I would tend to believe him.
When you are given a matrix with a certain condition number, and the
matrix can't be changed, what do you do? By definition, more bits allow
you to solve matrices with worser and worser condition numbers, at
least in the engineering sense of 'solve'.
Post by none) (albert
If you dpn't know about how stable your calculations are, you
shouldm't be in the driver seat for 32/64 decisions.
In my view, in that case you should decide for 64 bits.

-marcel
Hugh Aguilar
2021-01-12 07:49:54 UTC
Permalink
Post by none) (albert
If somone told me that if you are doing numerically unstable things,
that 64 bits will not save you, I would tend to believe him.
If you dpn't know about how stable your calculations are, you
shouldm't be in the driver seat for 32/64 decisions.
That is hilarious!
Albert van der Horst wants to be in the driver's seat!
Am I the only one who immediately thinks of a clown car at the circus?
Post by none) (albert
\ You only need D- and D<=
: .cf2 BEGIN 100000 0 DO 2OVER D- 2DUP 0. D<= IF I . LEAVE THEN
LOOP 2DUP OR WHILE 2OVER D+
.S KEY DROP \ Can leave this out
2SWAP REPEAT ;
Make that correct by
2SWAP REPEAT 1 . ;
"
3.141592653589793238
1.000000000000000000
2DUP D. 2OVER D. KEY DROP .cf2
You don't know what continued fractions are
Said the cab-driver to the mathematician
--- your code is nonsense ---
It gives the same results as your code, except for the correction
I gave.
His code does not give the same results as my CF.4TH code
(actually written by Nathaniel Grossman in a 1984 FD article).
CF.4TH produces rational approximations that can be used by */ etc..
He doesn't seem to know that this is the purpose of continued fractions.
Jon Nicoll
2021-01-09 22:23:58 UTC
Permalink
Yes, unfortunately extended precision is slow. However, e.g. a very
popular SPICE simulator that I have to use every day performs
lightyears better when its extended precision solver is used. What is
the use of getting the wrong results very fast?
Does SPICE really do seriously numerically unstable things? I thought
Bill Gates told us that 64 bits should be enough for anyone, or
something like that.
I am not sure that this is exactly what you are asking, but I understand that
there are some (useful, real-life) circuits that cannot be simulated with
SPICE. I am not very sure of the reasons for this; it may be related to
their operation depends on the exact details of their instability.
There is starting to be some hardware with IEEE 128 bit float arithmetic
now, I think.
Paul Rubin
2021-01-10 01:19:33 UTC
Permalink
Post by Jon Nicoll
I am not sure that this is exactly what you are asking, but I
understand that there are some (useful, real-life) circuits that
cannot be simulated with SPICE. I am not very sure of the reasons for
this; it may be related to their operation depends on the exact
details of their instability.
There are certainly unstable circuits that are hard to simulate, but for
the same reason their real-world behaviour is also not predictable or
consistent.
a***@math.uni.wroc.pl
2021-01-10 13:17:59 UTC
Permalink
Post by Paul Rubin
Post by Jon Nicoll
I am not sure that this is exactly what you are asking, but I
understand that there are some (useful, real-life) circuits that
cannot be simulated with SPICE. I am not very sure of the reasons for
this; it may be related to their operation depends on the exact
details of their instability.
There are certainly unstable circuits that are hard to simulate, but for
the same reason their real-world behaviour is also not predictable or
consistent.
No. Basically, you are trying to solve large system of differential
equations. It is well known that solution methods for differential
equations may behave badly even if equation behaves well.
Most methods replace differental equation by finite difference
equation. Finite difference version typically have more
solutions and some added solutions may be unstable.
There is large literature about "stable" methods. The catch
is that innocent looking equations which describe _very_
stable physical system cause trouble for most methods
of solving differential equations. There was progress
since time when original SPICE was created, and I do
not know if SPICE is up to date with newer developement.
Anyway, current paradigm is to use special methods
which may require large number of operations but are
"stable". However, "unstable" methods can sometimes
deliver results using much smaller number of arithmetic
operations. If multiple precision were cheaper
"unstable" method with enough arithmetic accuracy to
overcame instability could be faster than "stable"
method.

Concerning SPICE, I do not know if it is bad method
or deliberate design decision to depend on higher
numerical accuracy.
--
Waldek Hebisch
Jan Coombs
2021-01-10 15:17:02 UTC
Permalink
On Sun, 10 Jan 2021 13:17:59 +0000 (UTC)
Post by a***@math.uni.wroc.pl
Anyway, current paradigm is to use special methods
which may require large number of operations but are
"stable". However, "unstable" methods can sometimes
deliver results using much smaller number of arithmetic
operations. If multiple precision were cheaper
"unstable" method with enough arithmetic accuracy to
overcame instability could be faster than "stable"
method.
So would a processor that supported unbounded floats make this more
efficient? Would keeping track of the error bound also help?

Jan Coombs
--
a***@math.uni.wroc.pl
2021-01-10 21:16:47 UTC
Permalink
Post by Jan Coombs
On Sun, 10 Jan 2021 13:17:59 +0000 (UTC)
Post by a***@math.uni.wroc.pl
Anyway, current paradigm is to use special methods
which may require large number of operations but are
"stable". However, "unstable" methods can sometimes
deliver results using much smaller number of arithmetic
operations. If multiple precision were cheaper
"unstable" method with enough arithmetic accuracy to
overcame instability could be faster than "stable"
method.
So would a processor that supported unbounded floats make this more
efficient? Would keeping track of the error bound also help?
For really large precisions software floats are probably as
good as one can get. Problem is that there can be significant
drop in performace when switching from hardware to software
versions. Software versions typically are "unbounded precision"
and frequently need memory allocation for each result, so
there may be 100 times slowdown when switching from hardware
to software floating point. Some systems implement higher
but fixed precision software floating point. For example
some Lisp compilers offer "double-double" and "quad-double"
types with twice (respectivly 4 times) precision of double.
Each fixed precision operation is realized by multiple
operations on doubles, so is significantly slower than
double but much cheaper than arbitrary precision. It
would be nice to have good hardware and software support
for such types, in particular hardware operations designed
in such way that one can easily combine them to do
higher precision (current methods are smart hacks, but
some operations needed seem to be pure overhead).
--
Waldek Hebisch
minf...@arcor.de
2021-01-10 21:59:29 UTC
Permalink
Post by a***@math.uni.wroc.pl
Post by Jan Coombs
On Sun, 10 Jan 2021 13:17:59 +0000 (UTC)
Post by a***@math.uni.wroc.pl
Anyway, current paradigm is to use special methods
which may require large number of operations but are
"stable". However, "unstable" methods can sometimes
deliver results using much smaller number of arithmetic
operations. If multiple precision were cheaper
"unstable" method with enough arithmetic accuracy to
overcame instability could be faster than "stable"
method.
So would a processor that supported unbounded floats make this more
efficient? Would keeping track of the error bound also help?
For really large precisions software floats are probably as
good as one can get. Problem is that there can be significant
drop in performace when switching from hardware to software
versions. Software versions typically are "unbounded precision"
and frequently need memory allocation for each result, so
there may be 100 times slowdown when switching from hardware
to software floating point. Some systems implement higher
but fixed precision software floating point. For example
some Lisp compilers offer "double-double" and "quad-double"
types with twice (respectivly 4 times) precision of double.
Each fixed precision operation is realized by multiple
operations on doubles, so is significantly slower than
double but much cheaper than arbitrary precision. It
would be nice to have good hardware and software support
for such types, in particular hardware operations designed
in such way that one can easily combine them to do
higher precision (current methods are smart hacks, but
some operations needed seem to be pure overhead).
Sure. An old slide presentation said it already
https://www-zeuthen.desy.de/acat05/talks/de_Dinechin.Florent.2/QuadAndMore.pdf

High precision math is a special field, used e.g. in particle physics.
There is a multitude of Fortran and C HPR math libraries available.
Python and Julia are gaining ground.

I fail to see see any benefit by using Forth within this field, unless
a Forth system offers direct fine-tunable access to fp-math operators
that are provided by a CPU.
Marcel Hendrix
2021-01-10 11:35:59 UTC
Permalink
Post by Paul Rubin
Yes, unfortunately extended precision is slow. However, e.g. a very
popular SPICE simulator that I have to use every day performs
lightyears better when its extended precision solver is used. What is
the use of getting the wrong results very fast?
Does SPICE really do seriously numerically unstable things?
The users of SPICE, and worse, many authors of device models, do
(the latter even on purpose).

Support for SPICE is 99% educating people on simple numerical
principles that can't be suspect to be important without in-depth
involvement with the source code (which is unavailable with only
very few exceptions).
Post by Paul Rubin
I thought
Bill Gates told us that 64 bits should be enough for anyone, or
something like that.
He said 640K, and they suspect he said that because of 16 bits
hardware being good enough.
Post by Paul Rubin
There is starting to be some hardware with IEEE 128 bit float arithmetic
now, I think.
That is really interesting. Do you have a reference?

-marcel
Paul Rubin
2021-01-10 17:04:04 UTC
Permalink
Post by Marcel Hendrix
Support for SPICE is 99% educating people on simple numerical
principles that can't be suspect to be important without in-depth
involvement with the source code (which is unavailable with only
very few exceptions).
Hmm, interesting. I thought SPICE was freely available, but maybe more
importantly, also didn't realize that numerical stability issues came up
with it that often. I don't really know what it does, but have thought
of it as a tool that lets you set up a circuit model, and then runs what
amounts to a primitive ODE solver over the circuit. Is that reasonable?
Post by Marcel Hendrix
[Bill Gates] said 640K
Yeah, I was making a semi-comical reference to that.
Post by Marcel Hendrix
Post by Paul Rubin
There is starting to be some hardware with IEEE 128 bit float
arithmetic now, I think.
That is really interesting. Do you have a reference?
I guess I was thinking of this, which turns out to list mostly obsolete
hardware:

https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Hardware_support

POWER9 is mentioned though, suggesting that maybe POWER10 has it too.
Marcel Hendrix
2021-01-10 11:58:44 UTC
Permalink
Post by Paul Rubin
Yes, unfortunately extended precision is slow. However, e.g. a very
popular SPICE simulator that I have to use every day performs
lightyears better when its extended precision solver is used. What is
the use of getting the wrong results very fast?
Does SPICE really do seriously numerically unstable things?
The users of SPICE, and worse, many authors of device models, do
(the latter even on purpose).

Support for SPICE is 99% educating people on simple numerical
principles that can't be suspected to be important without in-depth
involvement with the source code (which is unavailable with only
very few exceptions).
Post by Paul Rubin
I thought
Bill Gates told us that 64 bits should be enough for anyone, or
something like that.
He said 640K, and they suspect he said that because of 16 bits
hardware being good enough.
Post by Paul Rubin
There is starting to be some hardware with IEEE 128 bit float arithmetic
now, I think.
That is really interesting. Do you have a reference?

-marcel
Anton Ertl
2021-01-09 07:15:35 UTC
Permalink
[...]
Post by Marcel Hendrix
You have mentioned the counted loop problem before, but it was,
and is, difficult for me to reproduce your results. For that to
succeed it looks like the compiler needs to generate completely
stackless code for the exit condition of the loop.
I am not sure what you mean with stackless. Anyway, if you update one
value, you need to keep that in a register; if you update two values
(as iForth and VFX DO...LOOP are doing), you need to keep both in
registers. You can keep the limit in memory: the out-of-order
hardware tends to perform the load of the limit early, so its latency
does not play a role; actually, the branch is predicted, so even if
the load was not performed early, its latency would only play a role
on a misprediction.

I have outlined how to code counted loops in
<***@mips.complang.tuwien.ac.at>. Here are some excerpts
from that posting:

|For an empty DO LOOP this results in the following code:
|
| 0: 48 83 c7 01 add $0x1,%rdi
| 4: 48 3b 3e cmp (%rsi),%rdi
| 7: 75 f7 jne 0 <loop>
|
|(with the index in rdi and the return stack pointer in rsi).
|
|This takes 1 cycle per iteration on a Skylake.

|For +LOOP, it's also important to keep the loop counter in a register.
|But the termination check is harder. One approach is to adjust the
|counter (in %rdi) such that you can use
|
|add incr, %rdi
|jno start
|
|will catch the termination. Then you need to implement I as the sum
|of an offset (kept on the return stack) and the counter.
|
|An alternative is to have a more complex loop termination check. The
|latter tends to cost additional temporary registers at +LOOP time
|(where existing Forth systems do not keep many stack values in
|registers), the former at I time (where registers may be more
|precious). So let's see how it might look:
|
| 0: 48 8b 0e mov (%rsi),%rcx
| 3: 48 01 f9 add %rdi,%rcx
| 6: 48 01 d7 add %rdx,%rdi
| 9: 48 01 d1 add %rdx,%rcx
| c: 71 f2 jno 0 <ploop>
|
|%rdx is the increment, %rcx is the temporary register, %rsi is the
|return stack pointer, which points to the offset for the loop
|(minint-limit), %rdi is the counter.
|
|This loop takes 1.37 cycles per iteration on a Skylake, and 2 cycles
|per iteration on a Zen and Zen2. The earlier version probably takes
|one cycle per iteration. The question is if the better register
|behaviour of the latter version is worth the additional cycles.

If you use counter adjustment, you probably want to do it for LOOP,
too, so you can use the same I implementation with both LOOP and
+LOOP. So LOOP could look as follows:

add $1, %rdi
jno start

The I implementation for counter adjustment could look as follows

mov (%rsi),%rax
add %rdi, %rax #value of I is now in %rax

whereas without counter adjustment you could just use %rdi as the
value of I directly. I lean towards not using counter adjustment. If
+LOOP is too slow, the programmer can use /LOOP and -LOOP with simpler
(and hopefully faster) loop termination checks.
Post by Marcel Hendrix
Post by Anton Ertl
The cycle results are from a Skylake microarchitecture (most Intel
CPUs since 2016). So you can see that extended precision can cost a
Post by Anton Ertl
lot of performance. If you need the precision, go for it, but if you
don't, and need performance, using 64-bit floats can get you that.
It's probably a good default to go for precision, but going for
performance should not require doing a no-no.
I'm not completely sure that I get what you are saying in the
last part of the last sentence.
I mentioned earlier that requiring to change source code coming from
MPE is a no-no. A good solution for Ndp387.fth would have been to
check if FPCELL is already defined, and only define it as 10 if it is
not, allowing the user to define FPCELL with value 8 or 4 before
loading Ndp387.fth, thus allowing to configure the precision without
changing the source file.

Your solution of defining a constant at the start looks good. Does
that mean that the first FP word encountered looks in the dictionary
for that constant and then configures FP accordingly?

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
Brian Fox
2021-01-09 15:17:34 UTC
Permalink
Post by Anton Ertl
I am not sure what you mean with stackless. Anyway, if you update one
value, you need to keep that in a register; if you update two values
(as iForth and VFX DO...LOOP are doing), you need to keep both in
registers.
<SNIP>

Does using a machine with a larger available register set allow one to
improve the general performance of DO/LOOP? (Perhaps you are referring
specifically to intel)

This makes me wonder should we just standardize some form of Chuck's
FOR/NEXT loop?
minf...@arcor.de
2021-01-09 15:36:04 UTC
Permalink
Post by Brian Fox
I am not sure what you mean with stackless. Anyway, if you update one
value, you need to keep that in a register; if you update two values
(as iForth and VFX DO...LOOP are doing), you need to keep both in
registers.
<SNIP>
Does using a machine with a larger available register set allow one to
improve the general performance of DO/LOOP? (Perhaps you are referring
specifically to intel)
This makes me wonder should we just standardize some form of Chuck's
FOR/NEXT loop?
Write an RfD and lead it to majority. Nothing easier. See
http://www.forth200x.org/

BTW I don't care about Chuck's, I like my own:

: T1 5 FOR N . NEXt ;
T1 0 1 2 3 4 ok
: T2 100 5 2 FOR> N . NEXT ;
T2 100 102 104 106 108 ok
: T3 100 5 2 <FOR N. NEXT ;
T3 108 106 104 102 100 ok

Handy when you walk over arrays.
dxforth
2021-01-10 01:44:25 UTC
Permalink
Post by ***@arcor.de
Post by Brian Fox
I am not sure what you mean with stackless. Anyway, if you update one
value, you need to keep that in a register; if you update two values
(as iForth and VFX DO...LOOP are doing), you need to keep both in
registers.
<SNIP>
Does using a machine with a larger available register set allow one to
improve the general performance of DO/LOOP? (Perhaps you are referring
specifically to intel)
This makes me wonder should we just standardize some form of Chuck's
FOR/NEXT loop?
Write an RfD and lead it to majority. Nothing easier. See
http://www.forth200x.org/
Beginning to look like another itch that won't go away until 200x
scratches it.
Anton Ertl
2021-01-09 22:48:12 UTC
Permalink
Post by Brian Fox
Post by Anton Ertl
I am not sure what you mean with stackless. Anyway, if you update one
value, you need to keep that in a register; if you update two values
(as iForth and VFX DO...LOOP are doing), you need to keep both in
registers.
<SNIP>
Does using a machine with a larger available register set allow one to
improve the general performance of DO/LOOP?
DO...LOOP requires one register (in addition to the return stack
pointer) to be implemented efficiently. All general-purpose
architectures these days have way more registers.
Post by Brian Fox
(Perhaps you are referring
specifically to intel)
No. However, there are some architectures without overflow or carry
flag. This rules out some implementation options, but there are still
plenty left.
Post by Brian Fox
This makes me wonder should we just standardize some form of Chuck's
FOR/NEXT loop?
FOR...NEXT also requires one register to be implemented efficiently.
With an efficient implementation, both need 1 cycle/iteration (for an
empty or not much filled loop). Let's see how existing Forth systems
fare (on Skylake):

for...next do...loop
cyc inst cyc inst
4.9 3.0 5.1 2.0 lxf
5.2 2.3 5.6 3.2 iforth
6.4 11.7 6.0 9.1 gforth-fast
- - 5.2 2.0 sf
- - 5.5 3.0 vfxlin

Any of these implementations is far from the practical optimum. The
differences between for...next and do...loop are small, and I don't
expect them to be larger with a more efficient implementation.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
Marcel Hendrix
2021-01-10 11:56:04 UTC
Permalink
On Sunday, January 10, 2021 at 12:29:39 AM UTC+1, Anton Ertl wrote:
[..]
Post by Anton Ertl
for...next do...loop
cyc inst cyc inst
4.9 3.0 5.1 2.0 lxf
5.2 2.3 5.6 3.2 iforth
6.4 11.7 6.0 9.1 gforth-fast
- - 5.2 2.0 sf
- - 5.5 3.0 vfxlin
Any of these implementations is far from the practical optimum. The
differences between for...next and do...loop are small, and I don't
expect them to be larger with a more efficient implementation.
- anton
The decisions for iForth DO LOOP are, in addition to ignorance, based
on the desire to have a fast I, >R, R>, R@ etc.. Also, there is the problem
what to do with nested loops.

-marcel

PS: Thanks a lot for showing decisively that the register-based FP
instruction set is not faster than the stack-based of the FPU. I have
never seen it demonstrated anywhere.
Marcel Hendrix
2021-01-10 12:00:39 UTC
Permalink
On Sunday, January 10, 2021 at 12:29:39 AM UTC+1, Anton Ertl wrote:
[..]
Post by Anton Ertl
for...next do...loop
cyc inst cyc inst
4.9 3.0 5.1 2.0 lxf
5.2 2.3 5.6 3.2 iforth
6.4 11.7 6.0 9.1 gforth-fast
- - 5.2 2.0 sf
- - 5.5 3.0 vfxlin
Any of these implementations is far from the practical optimum. The
differences between for...next and do...loop are small, and I don't
expect them to be larger with a more efficient implementation.
- anton
The decisions for iForth's DO LOOP are, in addition to ignorance, based
on the desire to have a fast I, >R, R>, R@ etc.. Also, there is the problem
what to do with nested loops.

-marcel

PS: Thanks a lot for showing decisively that the register-based FP
instruction set is not faster than the stack-based of the FPU. I have
never seen it demonstrated anywhere.
Anton Ertl
2021-01-10 14:10:56 UTC
Permalink
Post by Marcel Hendrix
The decisions for iForth's DO LOOP are, in addition to ignorance, based
what to do with nested loops.
Ok, what to do about the index/index1 register (see
<***@mips.complang.tuwien.ac.at> for nomenclature)?

1) Simpler compiler: You dedicate a register to it permanently. On a
DO, you save the old contents on the return stack, along with the new
limit/offset/offset1. You set the register to the new index/index1.
When leaving the loop, restore the register from the return stack.
CATCH also needs to save it and THROW needs to restore it.

2) More compiler complexity: You use any currently free register for
it. On DO or on a call, you can save registers on the return stack to
make room for stuff inside the loop, or to satisfy your calling
convention and restore them behind the LOOP or call. When compiling
J, you know there is no user-generated stuff on the return stack since
the outer DO, and can access the return stack accordingly. In this
way you have more registers available in words without DO LOOP.

Saving the index before calls and restoring it after calls probably
does not slow down things much: The called word probably introduces
enough latency that saving and restoring the index is no longer in the
critical path. CATCH and THROW are handled automatically, if the
compiler sees them as calls.

Restoring can also be done lazily: After a call, only restore the
index when encountering a return stack access, I, or a control flow
word like IF. So if you have back-to-back calls, you dont restore and
save all the time.
Post by Marcel Hendrix
PS: Thanks a lot for showing decisively that the register-based FP
instruction set is not faster than the stack-based of the FPU. I have
never seen it demonstrated anywhere.
Don't jump to conclusions: It's one case where it is not. It's
throughput-limited code with 8 instructions every two cycles (pretty
much the limit of Skylake), one of which is an FP add and one an FP
mul. Throughput-limited code with a higher proportion of FLOPs might
show tighter capacity limits for the 387 than for SSE2. And when your
problem can be vectorized, 387 really cannot compete against SSE2,
AVX, and AVX512.

Conversely, I have also measured the FP multiplication latency:

SSE2 387
4 5 Skylake
4 5 Zen
3 5 Zen2

So on latency-limited code you can also see (small) advantages for
SSE2.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
Anton Ertl
2021-01-10 10:23:40 UTC
Permalink
Post by Anton Ertl
Post by Brian Fox
This makes me wonder should we just standardize some form of Chuck's
FOR/NEXT loop?
FOR...NEXT also requires one register to be implemented efficiently.
With an efficient implementation, both need 1 cycle/iteration (for an
empty or not much filled loop). Let's see how existing Forth systems
for...next do...loop
cyc inst cyc inst
4.9 3.0 5.1 2.0 lxf
5.2 2.3 5.6 3.2 iforth
6.4 11.7 6.0 9.1 gforth-fast
- - 5.2 2.0 sf
- - 5.5 3.0 vfxlin
Any of these implementations is far from the practical optimum. The
differences between for...next and do...loop are small, and I don't
expect them to be larger with a more efficient implementation.
Just to give an idea, here are FOR NEXT and DO LOOP on lxf:

FOR NEXT DO LOOP
dec dword [esp] inc dword [esp]
js "0804FBFD" jno "0804FC1E"
jmp "0804FBF2"

It's interesting that the FOR NEXT loop runs slightly faster, but
anyway, as you can see, the loop itself can be reduced to one inc/dec
and one branch, whether you use DO..LOOP or FOR..NEXT. So FOR..NEXT
does not appear to be particularly useful.

I have discussed the slowness of keeping the loop counter(s) in memory
rather than in registers elsewhere, and won't discuss this here;
instead, I will use symbolic names for the various data involved. For
efficience, you should keep index/index1 in a register; you can keep
the other permanent data (limit) in memory.

Instead I discuss ways to implement DO, LOOP, I, and +LOOP:

1) The straightforward way to implement LOOP is to use the index as the
counter, and compare it to the limit.

loop:
inc index
cmp index,limit
jne start

This means that DO is simple:

do:
make room for or save index and limit
mov tos->index
mov sec->limit
pop tos and sec

And I is also simple: you can use the index directly.

+LOOP is more complex, reflecting the complexity of the specification.
E.g., it could look as follows:

+loop:
olddiff = index-limit
index = index+tos
newdiff = olddiff+tos
tmp1 = olddiff xor newdiff
tmp2 = olddiff xor tos
pop tos
test tmp1,tmp2 # test is like and, but changes only the flags
jns start

This code emulates the overflow computation for which +LOOP was
designed. You can simplify it a little by using the overflow feature
of architectures that have it:

+loop:
offset = minint-limit
tmp = offset+index
index = index+tos
tmp=tmp+tos
pop tos without changing flags
jno start


2) You notice that the latter +LOOP implementation computes offset in
every loop iteration. You can pull this out of the loop by putting
it in the return stack in DO, resulting in:

do:
make room for or save index, limit and offset
mov tos->index
mov sec->limit
offset = minint-limit
pop tos and sec

+loop:
tmp = offset+index
index = index+tos
tmp=tmp+tos
pop tos without changing flags
jno start

LOOP and I are unchanged; limit is needed for LOOP, offset for +LOOP.
If DO knows how the loop ends, it can eliminate the unused value and
its computation.


3) Instead of using the index as the loop control variable, use
offset+index (called index1 in the following), whose overflow
indicates loop termination; also, for alignment with the common
practice, I use offset1=-offset instead of offset. This approach
simplifies +LOOP quite a bit, LOOP a little, but you now have to
recreate the index when you perform I.

do:
make room for or save index1 and offset
offset1 = sec-minint #or use + or XOR
index1 = tos-offset1
pop tos and sec

loop:
inc index1
jno start

i:
index = index1+offset1

+loop:
index1 = index1+tos
pop tos without changing flags
jno start


4) Instead of recomputing index every time you use I, you just keep
both index and index1 up-to-date throughout the loop. This eliminates
the need for offset1, but it means that you should keep two values
(index and index1) instead of one in registers if you want your loop
to perform well.

do:
make room for or save index and index1
mov tos->index
tmp = sec-minint #or use + or XOR
index1 = index-tmp
pop tos and sec

loop:
inc index
inc index1
jno start

I uses the value of index directly.

+loop:
index = index+tos
index1 = index1+tos
pop tos without changing flags
jno start


5) The motivation for going for approach 2, 3, and 4 comes from +LOOP
and the desire to have a common DO for LOOP and +LOOP. While most
compilers don't look ahead from DO to the end of the loop, many see
the increment when they see +LOOP; this offers another approach for
reducing the +LOOP overhead while staying otherwise with approach 1:
for a positive constant increment +LOOP can be implemented much
simpler:

/LOOP#:
index = index + increment
tmp = index-limit
cmp increment,tmp #Intel argument ordering
jnbe start

Negative constant increments require a slight variation of this
approach (but that part is not implemented in Gforth and is left as an
exercise to the reader). In the rare cases where the increment is not
known at compile time, fall back to the +LOOP of approach 1.


Gforth uses approach 5 (for positive increments, approach 1 for
all others), SwiftForth and lxf use approach 3, iForth and VFX use
approach 4. In particular with approach 3, the only advantage of
FOR...NEXT is that it saves a few instructions in DO (4 instructions
in lxf). But even approach 1 and approach 4 can perform 1
cycle/iteration for an empty DO LOOP. And I have measured 1.37
cycles/iteration on Skylake for DO +LOOP with approach 2; that all
assumes that index/index1 is in a register (or in two registers for
approach 4).

Could a compiler decide which approach to use depending on the code?
If it sees the whole loop before deciding to generate code, it could
select the approach depending on whether and how +LOOP is used and how
many occurences of I there are.

But even a compiler that generates the code immediately could start
out with approach 3; when seeing I, compute the index, and put it in a
register. If that register is not needed for something else, further
occurences of I use that value. But if the register was used for
something else in between, just recompute the index.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
Brian Fox
2021-01-10 15:07:56 UTC
Permalink
Post by Anton Ertl
FOR NEXT DO LOOP
dec dword [esp] inc dword [esp]
js "0804FBFD" jno "0804FC1E"
jmp "0804FBF2"
Thanks Anton. Gives me lots of ideas to play with.
dxforth
2021-01-11 00:40:51 UTC
Permalink
Post by Brian Fox
Post by Anton Ertl
FOR NEXT DO LOOP
dec dword [esp] inc dword [esp]
js "0804FBFD" jno "0804FC1E"
jmp "0804FBF2"
Thanks Anton. Gives me lots of ideas to play with.
It took me a moment to realize the DEC JS is the FOR part,
while JMP is the NEXT. A neat strategy for handling cnt=0
scenario I'd not seen before.
Anton Ertl
2021-01-11 07:20:57 UTC
Permalink
Post by dxforth
Post by Brian Fox
Post by Anton Ertl
FOR NEXT DO LOOP
dec dword [esp] inc dword [esp]
js "0804FBFD" jno "0804FC1E"
jmp "0804FBF2"
Thanks Anton. Gives me lots of ideas to play with.
It took me a moment to realize the DEC JS is the FOR part,
while JMP is the NEXT.
I wondered about that, and thought it was a code generation snafu.
But you are right:

: foo for bar next ;

results in the following loop code:

804FBF7 FF0C24 dec dword [esp]
804FBFA 0F8807000000 js "0804FC07"
804FC00 E8DFFFFFFF call BAR
804FC05 EBF0 jmp "0804FBF7"
Post by dxforth
A neat strategy for handling cnt=0
scenario I'd not seen before.
I would generate the code as follows:

dec <index>
js end
start:
... loop body ...
dec <index>
jns start
end:

or, for slightly smaller size:

jmp entry
start:
... loop body ...
entry:
dec <index>
jns start

I would expect these variants to perform better in some situations on
some CPUs (but a smaller effect than putting the index into a
register).

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
P Falth
2021-01-11 18:24:09 UTC
Permalink
Post by Anton Ertl
Post by dxforth
FOR NEXT DO LOOP
dec dword [esp] inc dword [esp]
js "0804FBFD" jno "0804FC1E"
jmp "0804FBF2"
Thanks Anton. Gives me lots of ideas to play with.
It took me a moment to realize the DEC JS is the FOR part,
while JMP is the NEXT.
I wondered about that, and thought it was a code generation snafu.
: foo for bar next ;
804FBF7 FF0C24 dec dword [esp]
804FBFA 0F8807000000 js "0804FC07"
804FC00 E8DFFFFFFF call BAR
804FC05 EBF0 jmp "0804FBF7"
Post by dxforth
A neat strategy for handling cnt=0
scenario I'd not seen before.
dec <index>
js end
... loop body ...
dec <index>
jns start
jmp entry
... loop body ...
dec <index>
jns start
I would expect these variants to perform better in some situations on
some CPUs (but a smaller effect than putting the index into a
register).
- anton
It is so long time ago that I made that implementation so I have no
memory of why that solution was chosen.

Just for comparison I also implemented Anton's second proposal above.
The results for an empty loop are

timer-reset 1000000000 t4 .elapsed elapsed time: 2360 milli-seconds ok
timer-reset 1000000000 t5 .elapsed elapsed time: 2500 milli-seconds ok
timer-reset 1000000000 t6 .elapsed elapsed time: 2141 milli-seconds ok
timer-reset 1000000000 t4 .elapsed elapsed time: 2328 milli-seconds ok
timer-reset 1000000000 t5 .elapsed elapsed time: 2469 milli-seconds ok
timer-reset 1000000000 t6 .elapsed elapsed time: 2172 milli-seconds ok

T4 is my original for loop
T5 is Anton's suggestion
T6 is : T6 0 DO LOOP ;

This is running on an Xeon E5-4657L, max turbo 2.9

For another comparison I did run also the empty do loop on my lxf64.
It is a token threaded system

timer-reset 1000000000 t1 .elapsed 2480 ms elapsed ok
timer-reset 1000000000 t2 .elapsed 376 ms elapsed ok

T2 is T1 run thru the code generator in development.
On this system the index is kept in a register (R15)
The inner loop is
$A0F006 49FFC7 inc r15
$A0F009 71FB jno 0a0f006h

looks like a speed up of almost 7 times for the empty loop!

Best Regards
Peter
minf...@arcor.de
2021-01-11 18:58:08 UTC
Permalink
Post by P Falth
Post by Anton Ertl
Post by dxforth
FOR NEXT DO LOOP
dec dword [esp] inc dword [esp]
js "0804FBFD" jno "0804FC1E"
jmp "0804FBF2"
Thanks Anton. Gives me lots of ideas to play with.
It took me a moment to realize the DEC JS is the FOR part,
while JMP is the NEXT.
I wondered about that, and thought it was a code generation snafu.
: foo for bar next ;
804FBF7 FF0C24 dec dword [esp]
804FBFA 0F8807000000 js "0804FC07"
804FC00 E8DFFFFFFF call BAR
804FC05 EBF0 jmp "0804FBF7"
Post by dxforth
A neat strategy for handling cnt=0
scenario I'd not seen before.
dec <index>
js end
... loop body ...
dec <index>
jns start
jmp entry
... loop body ...
dec <index>
jns start
I would expect these variants to perform better in some situations on
some CPUs (but a smaller effect than putting the index into a
register).
- anton
It is so long time ago that I made that implementation so I have no
memory of why that solution was chosen.
Just for comparison I also implemented Anton's second proposal above.
The results for an empty loop are
timer-reset 1000000000 t4 .elapsed elapsed time: 2360 milli-seconds ok
timer-reset 1000000000 t5 .elapsed elapsed time: 2500 milli-seconds ok
timer-reset 1000000000 t6 .elapsed elapsed time: 2141 milli-seconds ok
timer-reset 1000000000 t4 .elapsed elapsed time: 2328 milli-seconds ok
timer-reset 1000000000 t5 .elapsed elapsed time: 2469 milli-seconds ok
timer-reset 1000000000 t6 .elapsed elapsed time: 2172 milli-seconds ok
T4 is my original for loop
T5 is Anton's suggestion
T6 is : T6 0 DO LOOP ;
This is running on an Xeon E5-4657L, max turbo 2.9
For another comparison I did run also the empty do loop on my lxf64.
It is a token threaded system
timer-reset 1000000000 t1 .elapsed 2480 ms elapsed ok
timer-reset 1000000000 t2 .elapsed 376 ms elapsed ok
T2 is T1 run thru the code generator in development.
On this system the index is kept in a register (R15)
The inner loop is
$A0F006 49FFC7 inc r15
$A0F009 71FB jno 0a0f006h
looks like a speed up of almost 7 times for the empty loop!
That's a definitive achievement!

Modern C compilers would optimize an empty loop completely away.
This does not belittle the achievement, it only indicates that
micro-benchmarks have to be seleced carefully.
Anton Ertl
2021-01-11 19:11:36 UTC
Permalink
Post by ***@arcor.de
Post by P Falth
On this system the index is kept in a register (R15)
The inner loop is
$A0F006 49FFC7 inc r15
$A0F009 71FB jno 0a0f006h
looks like a speed up of almost 7 times for the empty loop!
That's a definitive achievement!
Modern C compilers would optimize an empty loop completely away.
This does not belittle the achievement, it only indicates that
micro-benchmarks have to be seleced carefully.
In the present case the microbenchmark measures the performance of
different ways to implement a counted loop.

Given that that and busy-waiting loops* are pretty much the only
reason to write an empty loop, such "modern C compilers" frustrate the
programmer's intent in yet another case; shame on their maintainers.

* For busy-waiting on IA-32 and AMD64, you probably want to insert the
PAUSE instruction in the loop; it allows the other thread to work
faster and reduces power consumption.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
P Falth
2021-01-11 22:20:27 UTC
Permalink
Post by ***@arcor.de
Post by P Falth
Post by Anton Ertl
Post by dxforth
FOR NEXT DO LOOP
dec dword [esp] inc dword [esp]
js "0804FBFD" jno "0804FC1E"
jmp "0804FBF2"
Thanks Anton. Gives me lots of ideas to play with.
It took me a moment to realize the DEC JS is the FOR part,
while JMP is the NEXT.
I wondered about that, and thought it was a code generation snafu.
: foo for bar next ;
804FBF7 FF0C24 dec dword [esp]
804FBFA 0F8807000000 js "0804FC07"
804FC00 E8DFFFFFFF call BAR
804FC05 EBF0 jmp "0804FBF7"
Post by dxforth
A neat strategy for handling cnt=0
scenario I'd not seen before.
dec <index>
js end
... loop body ...
dec <index>
jns start
jmp entry
... loop body ...
dec <index>
jns start
I would expect these variants to perform better in some situations on
some CPUs (but a smaller effect than putting the index into a
register).
- anton
It is so long time ago that I made that implementation so I have no
memory of why that solution was chosen.
Just for comparison I also implemented Anton's second proposal above.
The results for an empty loop are
timer-reset 1000000000 t4 .elapsed elapsed time: 2360 milli-seconds ok
timer-reset 1000000000 t5 .elapsed elapsed time: 2500 milli-seconds ok
timer-reset 1000000000 t6 .elapsed elapsed time: 2141 milli-seconds ok
timer-reset 1000000000 t4 .elapsed elapsed time: 2328 milli-seconds ok
timer-reset 1000000000 t5 .elapsed elapsed time: 2469 milli-seconds ok
timer-reset 1000000000 t6 .elapsed elapsed time: 2172 milli-seconds ok
T4 is my original for loop
T5 is Anton's suggestion
T6 is : T6 0 DO LOOP ;
This is running on an Xeon E5-4657L, max turbo 2.9
For another comparison I did run also the empty do loop on my lxf64.
It is a token threaded system
timer-reset 1000000000 t1 .elapsed 2480 ms elapsed ok
timer-reset 1000000000 t2 .elapsed 376 ms elapsed ok
T2 is T1 run thru the code generator in development.
On this system the index is kept in a register (R15)
The inner loop is
$A0F006 49FFC7 inc r15
$A0F009 71FB jno 0a0f006h
looks like a speed up of almost 7 times for the empty loop!
That's a definitive achievement!
Modern C compilers would optimize an empty loop completely away.
This does not belittle the achievement, it only indicates that
micro-benchmarks have to be seleced carefully.
Sorry but I was not clear in my statement. The 7 times speedup refers to
using a register for the index instead of placing it on the return stack in memory.
That is also why I chose an empty loop. To see the impact of a design decision.

My code generator do not do many optimizations and I would certainly not
add complexity to eliminate empty loops. If someone writes an empty loop
they will get that as I expect that that was wanted.

Peter
dxforth
2021-01-12 00:39:24 UTC
Permalink
Post by Anton Ertl
Post by dxforth
Post by Brian Fox
Post by Anton Ertl
FOR NEXT DO LOOP
dec dword [esp] inc dword [esp]
js "0804FBFD" jno "0804FC1E"
jmp "0804FBF2"
Thanks Anton. Gives me lots of ideas to play with.
It took me a moment to realize the DEC JS is the FOR part,
while JMP is the NEXT.
I wondered about that, and thought it was a code generation snafu.
: foo for bar next ;
804FBF7 FF0C24 dec dword [esp]
804FBFA 0F8807000000 js "0804FC07"
804FC00 E8DFFFFFFF call BAR
804FC05 EBF0 jmp "0804FBF7"
Post by dxforth
A neat strategy for handling cnt=0
scenario I'd not seen before.
dec <index>
js end
... loop body ...
dec <index>
jns start
jmp entry
... loop body ...
dec <index>
jns start
I would expect these variants to perform better in some situations on
some CPUs (but a smaller effect than putting the index into a
register).
Here's an implementation of the latter in DX-Forth:

application
-? code n 1 # 0 [bp] sub 1 $ jc bran ) jmp 1 $:
2 # bp add 2 # si add next end-code

system
: FOR ( u ) postpone >r postpone ahead postpone begin
1 cs-roll ; immediate

: NEXT ( ) postpone then postpone n <resolve ; immediate
application behead n n

At this point the code looks very similar to eForth's which went on to
factor out AFT in an effort to extract more flexibility.

Picking up on what Albert said, it makes little sense today to be
looking for a replacement for DO LOOP that's of the same ilk. DO LOOP
was a product of the times when forth had limited resources and no
optimizer. Modern forth can afford a generic looping scheme such as
Minforth suggested and leave it to the compiler to optimize away any
clumsiness. If that excludes those with vintage compilers such as
myself, so be it.
Hugh Aguilar
2021-01-12 07:38:26 UTC
Permalink
...it makes little sense today to be
looking for a replacement for DO LOOP that's of the same ilk. DO LOOP
was a product of the times when forth had limited resources and no
optimizer.
This is true.
Charles Moore's first (and arguably, only) Forth that got used for
practical applications, was on the PDP-11.
The PDP-11 had a severe shortage of registers. This is why he had to use
registers for multiple unrelated purposes, and leave it up to the user to make sure
that these unrelated purposes didn't clash with each other.
The most obvious example was that the return-stack was used for:
1.) holding return-addresses (obviously; that is why it is called the return-stack).
2.) holding DO LOOP parameters.
3.) holding >R data
This is extremely confusing! Novices wonder why their >R data is not available
inside of a DO LOOP. They wonder why the index for a DO LOOP is called I inside of
the DO LOOP but is called J inside of a nested DO LOOP --- it is the same index,
but with a different name!

ANS-Forth reeks of the PDP-11 limitations and design decisions!
ANS-Forth was over a decade obsolete when it was released in 1994.
The idea of a Forth "Standard" in 1994 that only made sense for a 1970s vintage
processor was absurd --- everybody knew this except Elizabeth Rather --- she was just
a maintenance programmer for Charles Moore's legacy Forth, trying to keep that
corpse warm forever, but not really understanding anything about programming.

Quite humorously, Ilya Tarasov described Forth-200x as the galvanized corpse
of ANS-Forth --- jolted with electricity to make it twitch and grimace, but still quite dead.

Less humorously, I have described the modern-day Forth community as being similar to
the Donner Party --- no longer able to hunt down prey in forest, they kill and eat their own.
The Forth-200x committee strive to defeat me by telling a lot of lies about me,
saying that my code doesn't work although it obviously does work:
https://groups.google.com/g/comp.lang.forth/c/VneJx1NnLu8
Defeating me with lies isn't a long-term plan. What else do they have in mind for the future?
They have no plan for the future, except an endless stream of idiotic jibber-jabber about
recognizers and other nonsense that has nothing to do with Forth programming.
...leave it to the compiler to optimize away any
clumsiness. If that excludes those with vintage compilers such as
myself, so be it.
You don't need a complicated optimizer.
You just need efficient local variables.
I and J were essentially local variables, but with a lot of weird restrictions
(such as the fact that there were only two of them, they were always called I and J,
and they were clashing with the >R data).
If you have local variables, you don't need I and J --- you can use local variables
for your index and limit --- all you have to do is optimize the WHILE so it does
the increment, comparison and branch efficiently.
This is not a complicated optimizer --- this is pretty straight-forward.
Anton Ertl
2021-01-12 10:32:15 UTC
Permalink
Post by dxforth
Picking up on what Albert said, it makes little sense today to be
looking for a replacement for DO LOOP that's of the same ilk. DO LOOP
was a product of the times when forth had limited resources and no
optimizer. Modern forth can afford a generic looping scheme such as
Minforth suggested and leave it to the compiler to optimize away any
clumsiness.
Which suggestion by Minforth are you referring to?

The suggestions that I am aware of are to use variations on DO..LOOP
(?DO +DO U+DO -DO U-DO /LOOP -LOOP), or FOR..NEXT, or
BEGIN..WHILE..REPEAT.

I find that FOR..NEXT and BEGIN..WHILE..REPEAT have one lack compared
to DO..LOOP variants: DO..LOOP and friends give me one value (I) that
I don't have to juggle on the data or return stack (remember that the
Swap Dragon hates jugglers:-), which often allows me to write neater
code. If the programmer has to write clumsy code to replace DO..LOOP
and friends, sophisticated compilers may be able to avoid the
execution cost of that clumsiness, but the clumsiness still costs:
when writing the program, when reading the program, when modifying the
program.

FOR..NEXT may put a value in I, but the value is usually not the most
convenient one for the loop's contents, making the code inside the
loop more clumsy.

A stack purist might see DO..LOOP as a better alternative to locals,
or as the first step on a slippery slope. A Forth traditionalist
should certainly welcome DO..LOOP, because Chuck Moore himself brought
it down from the mountain on stone tablets (OTOH, in the meantime he
has abandoned it in favour of general loops implemented through tail
recursion).

What I still find missing in that regard is a good counted loop for
going backwards through an array. If we have boundary addresses start
end, with, say cell stride, we want I to produce

end-cell end-2*cell end-3*cell ... start+cell start

Neither standard +LOOP nor Gforth's -LOOP offer this at the moment.

However, walking backwards through an array is pretty rare, so maybe
it's ok to be clumsy in those cases. But OTOH, the behaviour of +LOOP
with negative increments is unintuitive (and probably requires looking
up when used rarely), -LOOP is non-standard and will probably also be
looked up, so one might just as well go for a backwards-stepping loop
for arrays.

There are two common ways for describing arrays: start count and start
end; my impression is that start count is more frequent. So one would
pass start, count, and the element size to the looping construct. A
disadvantage of this approach is that it is only good for walking the
array backwards one element at a time. If you want to walk a part of
an array and/or skip elements, it's not so great; words for
manipulating such array descriptions (e.g., a generalized /STRING)
might help.

One other thing that I have noticed when looking at the 44 occurences
of +LOOP in the current Gforth image, I have seen 5 occurences of

I - +LOOP

[4 like this, one with a little distance between I - and LOOP]

I.e., we already have the next index on the stack, and then do I -
+LOOP to satisfy the requirements of +LOOP. One may wonder if it's
not better to just use BEGIN. Let's look at one example:

: u8width ( xcaddr u -- n )
0 rot rot bounds ?DO
I xc@+ swap >r
dup #tab = IF drop 1+ dfaligned ELSE xc-width + THEN
r> I - +LOOP ;

This computes the screen width of a string. Here's my untested
rewrite to use BEGIN:

: u8width ( xcaddr u -- n )
over + >r 0 begin ( xcaddr1 n2 r:end )
dup #tab = if
drop r> 1+ 8 naligned ( xc-addr2 n3 r:end )
else
xc-width r> + then ( xc-addr2 n3 r:end )
repeat
nip r> drop ;

Apart from the misuse of DFALIGNED, the original looks better to me.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
dxforth
2021-01-13 00:43:37 UTC
Permalink
Post by Anton Ertl
Post by dxforth
Picking up on what Albert said, it makes little sense today to be
looking for a replacement for DO LOOP that's of the same ilk. DO LOOP
was a product of the times when forth had limited resources and no
optimizer. Modern forth can afford a generic looping scheme such as
Minforth suggested and leave it to the compiler to optimize away any
clumsiness.
Which suggestion by Minforth are you referring to?
: T1 5 FOR N . NEXt ;
T1 0 1 2 3 4 ok
: T2 100 5 2 FOR> N . NEXT ;
T2 100 102 104 106 108 ok
: T3 100 5 2 <FOR N. NEXT ;
T3 108 106 104 102 100 ok
Handy when you walk over arrays.
Which appears similar to the Tachyon forth scheme. If it does everything
DO LOOP and FOR NEXT currently do (which needs investigating) why not
replace the former if all it takes is compile-time resources? At some
point Forth has to break with the past to go forward, otherwise everything
looks like a band-aid fix.
Paul Rubin
2021-01-13 02:26:05 UTC
Permalink
Post by ***@arcor.de
: T2 100 5 2 FOR> N . NEXT ;
T2 100 102 104 106 108 ok
... If it does everything DO LOOP and FOR NEXT currently do (which
needs investigating) why not replace the former if all it takes is
compile-time resources?
How does it handle getting at the loop indexes of nested loops, like I J
K do for DO loops?
At some point Forth has to break with the past to go forward,
otherwise everything looks like a band-aid fix.
It is a clever hack, but I think not exactly a step forward. But there
is not likely to be much agreement on what "forward" means. "Forward"
might point in distinct directions for different types of Forths
(minimal, hardware-implemented, full-featured, etc.)
Marcel Hendrix
2021-01-13 08:19:58 UTC
Permalink
On Wednesday, January 13, 2021 at 3:26:07 AM UTC+1, Paul Rubin wrote:
[..]
It is a clever hack, but I think not exactly a step forward. But there
is not likely to be much agreement on what "forward" means. "Forward"
might point in distinct directions for different types of Forths
(minimal, hardware-implemented, full-featured, etc.)
The difference with say 30 years ago is that in those days there
were many lofty ideas on designing Forth, all talk, no testing at
all. Nowadays it is almost the reverse. The direction of Forth
as a language, both internal and external, is what its actual
users try to do with it. We may not like it, but it means that
the vendors drive what happens next.

-marcel
Hugh Aguilar
2021-01-13 16:30:05 UTC
Permalink
Post by Marcel Hendrix
[..]
It is a clever hack, but I think not exactly a step forward. But there
is not likely to be much agreement on what "forward" means. "Forward"
might point in distinct directions for different types of Forths
(minimal, hardware-implemented, full-featured, etc.)
The difference with say 30 years ago is that in those days there
were many lofty ideas on designing Forth, all talk, no testing at
all. Nowadays it is almost the reverse. The direction of Forth
as a language, both internal and external, is what its actual
users try to do with it. We may not like it, but it means that
the vendors drive what happens next.
This is 180 degrees from the truth!
In 1994 I wrote MFX for the MiniForth processor at Testra, and MFX was used
to write their motion-control program for the laser etcher.
This was tested thoroughly, and there were no bugs.
I saw the laser etcher in operation and it was moving fast. The lines etched were
of a consistent width, indicating that there were no slow-downs that would cause
a blotch, not matter how many twists and turns the line made.
The laser-etcher was able to etch very readable text into wood.

At the same time, the mighty Forth vendors (not LMI who quit) were pushing
ANS-Forth through ANSI without any testing at all. There was no reference compiler.
The ANS-Forth document was badly infected with ambiguity and general nonsense.
Years later Forth Inc. came out with SwiftForth that was so bug-ridden as to be useless.
As late as version-2, SwiftForth would crash if (LOCAL) was used --- this indicates that
(LOCAL) was never tested even once for several years, or they would have certainly
noticed that the whole system crashed --- these were not subtle bugs!

The vendors are in the driver's seat, but it is a clown-car at the circus.
The Forth-200x mailing-list is full of "many lofty ideas on designing Forth,
all talk, no testing at all." By far, their most lofty (idiotic) idea is recognizers,
and they have been talking about this nonsense for years.
The Forth-200x committee will never succeed.
The Forth-200x committee may not like it,
but it means that the actual Forth programmers drive what happens next,
and the sales clowns representing vendors have no authority whatsoever.
dxforth
2021-01-13 08:20:53 UTC
Permalink
Post by Paul Rubin
Post by ***@arcor.de
: T2 100 5 2 FOR> N . NEXT ;
T2 100 102 104 106 108 ok
... If it does everything DO LOOP and FOR NEXT currently do (which
needs investigating) why not replace the former if all it takes is
compile-time resources?
How does it handle getting at the loop indexes of nested loops, like I J
K do for DO loops?
A question better answered by the implementers. N above returns the
current index. What forth implements K ?
Post by Paul Rubin
At some point Forth has to break with the past to go forward,
otherwise everything looks like a band-aid fix.
It is a clever hack, but I think not exactly a step forward.
What would be a step forward?
Post by Paul Rubin
But there
is not likely to be much agreement on what "forward" means. "Forward"
might point in distinct directions for different types of Forths
(minimal, hardware-implemented, full-featured, etc.)
Not a problem. Folks satisfied with minimal threaded-code systems
have moved on. What are the rest waiting for?
Paul Rubin
2021-01-13 09:46:01 UTC
Permalink
Post by dxforth
A question better answered by the implementers. N above returns the
current index. What forth implements K ?
Gforth has K. I thought it was standard.
Post by dxforth
Post by Paul Rubin
It is a clever hack, but I think not exactly a step forward.
What would be a step forward?
For loop control? Shrug. In general? Improving support for local
variables and moving towards a style that uses them freely, is one thing
that comes to mind.
Post by dxforth
Not a problem. Folks satisfied with minimal threaded-code systems
have moved on. What are the rest waiting for?
The loop words you described seemed suited for minimal systems. Don't
people still use those? eForth is still popular, as are Camelforth
variants etc.
minf...@arcor.de
2021-01-13 10:36:15 UTC
Permalink
Post by dxforth
What would be a step forward?
For loop control? Shrug. In general? Improving support for local
variables and moving towards a style that uses them freely, is one thing
that comes to mind.
That's not difficult, as long as one does not waste time for futile discussions
about standardization.

"I did it my way." (Frank SInatra):

The actual locals syntax goes simplified:
: T1 {: L1 L2 | L3 -- L4 :} <code> ;
with -- L4 discarded as comment. What a waste.
But necessary since standard Forth knows nothing of named stack elements.

Now and then I use my own 'amendment'
: T2 { L1 L2 | L3 -> L4 } <code> ; \ -- replaced by ->
with L4 being a local synonymous to the top of stack (former L1 position).
Stack operations within <code> appear above L4. The stack pointer is
automatically adjusted afterwards.

It is also possible to do
: T3 { L1 L2 | L3 -> L2 } <code> ;
minf...@arcor.de
2021-01-13 12:36:59 UTC
Permalink
Post by ***@arcor.de
Post by dxforth
What would be a step forward?
For loop control? Shrug. In general? Improving support for local
variables and moving towards a style that uses them freely, is one thing
that comes to mind.
That's not difficult, as long as one does not waste time for futile discussions
about standardization.
: T1 {: L1 L2 | L3 -- L4 :} <code> ;
with -- L4 discarded as comment. What a waste.
But necessary since standard Forth knows nothing of named stack elements.
Now and then I use my own 'amendment'
: T2 { L1 L2 | L3 -> L4 } <code> ; \ -- replaced by ->
with L4 being a local synonymous to the top of stack (former L1 position).
Stack operations within <code> appear above L4. The stack pointer is
automatically adjusted afterwards.
It is also possible to do
: T3 { L1 L2 | L3 -> L2 } <code> ;
p.s. a VERY readable definition of ROT goes
: ROT { a b c -> b c a } ;
dxforth
2021-01-14 01:49:32 UTC
Permalink
Post by Paul Rubin
Post by dxforth
A question better answered by the implementers. N above returns the
current index. What forth implements K ?
Gforth has K. I thought it was standard.
Post by dxforth
Post by Paul Rubin
It is a clever hack, but I think not exactly a step forward.
What would be a step forward?
For loop control? Shrug. In general? Improving support for local
variables and moving towards a style that uses them freely, is one thing
that comes to mind.
I meant without abandoning forth precepts altogether :)
Post by Paul Rubin
Post by dxforth
Not a problem. Folks satisfied with minimal threaded-code systems
have moved on. What are the rest waiting for?
The loop words you described seemed suited for minimal systems. Don't
people still use those? eForth is still popular, as are Camelforth
variants etc.
IIRC Camelforth uses DO LOOP. IMO a minimal FOR NEXT is false economy.
More application bytes will be spent working around its shortcomings than
had one simply implemented DO LOOP.
Anton Ertl
2021-01-13 12:25:44 UTC
Permalink
Post by dxforth
Post by Anton Ertl
Post by dxforth
Picking up on what Albert said, it makes little sense today to be
looking for a replacement for DO LOOP that's of the same ilk. DO LOOP
was a product of the times when forth had limited resources and no
optimizer. Modern forth can afford a generic looping scheme such as
Minforth suggested and leave it to the compiler to optimize away any
clumsiness.
Which suggestion by Minforth are you referring to?
: T1 5 FOR N . NEXt ;
T1 0 1 2 3 4 ok
Compared to "0 ?DO I . LOOP", very little benefit.
Compared to "?DO ... LOOP", less flexible.
Post by dxforth
Post by Anton Ertl
: T2 100 5 2 FOR> N . NEXT ;
T2 100 102 104 106 108 ok
This lets us work directly with addr count arrays. The corresponding
?DO..+LOOP is a bit more cumbersome:

: t2 100 5 2* bounds ?do i . 2 +loop ;
Post by dxforth
Post by Anton Ertl
: T3 100 5 2 <FOR N. NEXT ;
T3 108 106 104 102 100 ok
This is the case for which I find the existing DO..+LOOP and also
Gforth's U-DO..-LOOP lacking. With DO..+LOOP I have to write

: t3 100 5 dup if 2* bounds swap 2 - do i . -2 +loop else 2drop then ;

With U-DO..-LOOP, I can write

: t3 100 5 2* swap 2 - tuck + -do i . 2 -loop ;

Minforth's variant is significantly shorter.
Post by dxforth
Which appears similar to the Tachyon forth scheme. If it does everything
DO LOOP and FOR NEXT currently do (which needs investigating)
If you have the start end representation of arrays or ranges, these
words don't fit; you would have to do the inverse of 2* BOUNDS, i.e.,
OVER - 2/, if you want to use FOR>.
Post by dxforth
why not
replace the former if all it takes is compile-time resources?
Actually implementing these words does not appear to be particularly
problematic in any of the implementation technologies that come to my
mind, including simple threaded-code. You may need three cells for
the loop parameters (rather than two for DO..LOOP and DO..+LOOP), so
for a simple implementation of J, you must use three cells (or keep
the extra data elsewhere).
Post by dxforth
At some
point Forth has to break with the past to go forward, otherwise everything
looks like a band-aid fix.
Nice bumper sticker, but I fail to see it's relevance here. You can
have both DO..LOOP and friends, and Minforth's FOR..NEXT variants. N
is in conflict with SwiftForth's (and now Gforth's) word N (part of
the IDE), but I fail to see the advantage of renaming I to N anyway.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
dxforth
2021-01-14 05:51:17 UTC
Permalink
Post by Anton Ertl
Post by dxforth
At some
point Forth has to break with the past to go forward, otherwise everything
looks like a band-aid fix.
Nice bumper sticker, but I fail to see it's relevance here. You can
have both DO..LOOP and friends, and Minforth's FOR..NEXT variants. N
is in conflict with SwiftForth's (and now Gforth's) word N (part of
the IDE), but I fail to see the advantage of renaming I to N anyway.
Presumably to maintain ANS compliance. As for relevance, when Moore
decided DO LOOP was problematic for him he got rid of it. How many
counted loop constructs does a language need? More than one looks
like indecision.
Paul Rubin
2021-01-14 08:49:46 UTC
Permalink
As for relevance, when Moore decided DO LOOP was problematic for him
he got rid of it.
He was developing for his nano-cpus (GA144 etc.) with 64 memory cells to
hold the program and data, tiny even by minimalist Forther standards.
So he really had to stay with the tiniest possible construct.
How many counted loop constructs does a language need? More than one
looks like indecision.
DO or ?DO has been ok for me. I never felt that I needed FOR/NEXT
if something like DO is present.
minf...@arcor.de
2021-01-14 15:51:07 UTC
Permalink
Post by Paul Rubin
As for relevance, when Moore decided DO LOOP was problematic for him
he got rid of it.
He was developing for his nano-cpus (GA144 etc.) with 64 memory cells to
hold the program and data, tiny even by minimalist Forther standards.
So he really had to stay with the tiniest possible construct.
How many counted loop constructs does a language need? More than one
looks like indecision.
DO or ?DO has been ok for me. I never felt that I needed FOR/NEXT
if something like DO is present.
I don't use standard DO's anymore (let alone UNLOOP) because FOR/FOR>/<FOR
are more practical in my applications. Most usage count has FOR> for traversing long
time series data, where a sample point is a structured compound data element.

IMO such code is easier to read and maintain than with ?DO..+LOOP constructs.
DO..+LOOPs break loop parameters apart because they require loop parameter calculation
in front of the loop, and laying down the step width only at the end of the loop. This
is "unnatural".

Coding and code maintenance time is more precious than squeezing nanoseconds
out of programs (thereby making them more complex and more error-prone)
in about 99.9% of all cases.

It's the famous KISS principle again.
Howerd Oakford
2021-01-14 17:07:09 UTC
Permalink
Post by Paul Rubin
As for relevance, when Moore decided DO LOOP was problematic for him
he got rid of it.
He was developing for his nano-cpus (GA144 etc.) with 64 memory cells to
hold the program and data, tiny even by minimalist Forther standards.
So he really had to stay with the tiniest possible construct.
How many counted loop constructs does a language need? More than one
looks like indecision.
DO or ?DO has been ok for me. I never felt that I needed FOR/NEXT
if something like DO is present.
Hi Paul,

Agreed - if you have DO LOOP you don't need FOR NEXT.

The code for FOR and NEXT in colorForth looks like this :
: push ( a -- ) $50 1, drop ;
: for ( n -- ) push begin ;
: next ( -- ) $75240CFF , ;

Much simpler than DO LOOP .

Push is just assembles the 1 byte X86 opcode for "PUSH eax" .
"begin" puts HERE on the stack.
"next" assembles 4 bytes that decrement, compare and jmp back to "begin"
if not zero.

"begin" is actually a primitive in the NASM source :
begin_:
mov [ list ], esp
here:
_DUP_
mov _TOS_, [v_H]
ret

It is using the loop opcodes that are built into the x86 architecture.

In the GA144, each F18 processor also has a loop construct built into
the 3.6 opcode, 18 bit word, so the F18 assembler uses that.

I don't think DO LOOP is any more problematic for Chuck than any other
bit or byte that is not necessary ;-)

Cheers,
Howerd
dxforth
2021-01-15 01:32:43 UTC
Permalink
Post by Paul Rubin
As for relevance, when Moore decided DO LOOP was problematic for him
he got rid of it.
He was developing for his nano-cpus (GA144 etc.) with 64 memory cells to
hold the program and data, tiny even by minimalist Forther standards.
So he really had to stay with the tiniest possible construct.
FOR NEXT has a longer history than that. Suffice to say Moore didn't
care for DO LOOP for his reasons, and others for theirs.
Post by Paul Rubin
How many counted loop constructs does a language need? More than one
looks like indecision.
DO or ?DO has been ok for me. I never felt that I needed FOR/NEXT
if something like DO is present.
If 200x replaced DO LOOP with something that did away with the former's
quirks and restrictions you wouldn't use it?
Paul Rubin
2021-01-15 01:43:14 UTC
Permalink
Post by dxforth
If 200x replaced DO LOOP with something that did away with the former's
quirks and restrictions you wouldn't use it?
I don't think FOR NEXT is the right replacement. If they did something
else of course I'd look at it.

As for the question of what "moving forward" with Forth looks like, I
think we should look towards Oforth and 8th for ideas, rather than
tweaking these low level counted loop things. That said, I haven't
actually used Oforth or 8th.
dxforth
2021-01-15 06:26:55 UTC
Permalink
Post by Paul Rubin
Post by dxforth
If 200x replaced DO LOOP with something that did away with the former's
quirks and restrictions you wouldn't use it?
I don't think FOR NEXT is the right replacement. If they did something
else of course I'd look at it.
If the gold standard is the classic FOR loops of other languages then
ISTM Minforth's is heading in that direction.
Post by Paul Rubin
As for the question of what "moving forward" with Forth looks like, I
think we should look towards Oforth and 8th for ideas, rather than
tweaking these low level counted loop things. That said, I haven't
actually used Oforth or 8th.
So you're more attracted to Forth than the others. Could be forth
is rubbing off on you :)
Paul Rubin
2021-01-15 06:37:39 UTC
Permalink
Post by dxforth
If the gold standard is the classic FOR loops of other languages then
ISTM Minforth's is heading in that direction.
Other languages are moving away from counted loops in favor of loops
over collections, it seems to me.
minf...@arcor.de
2021-01-15 07:01:32 UTC
Permalink
Post by Paul Rubin
Post by dxforth
If the gold standard is the classic FOR loops of other languages then
ISTM Minforth's is heading in that direction.
Other languages are moving away from counted loops in favor of loops
over collections, it seems to me.
While slicing could be implemented by simple loops over equidistant array or
list elements, iterators often can't. In the end you'll need pointers to data in memory,
pointers between data elements in memory, heap management, and after a while
garbage collection integrated in Forth.

But with that Forth leaves its main domain and becomes just an awkward Lua.
Nobody needs such a beast.
Paul Rubin
2021-01-15 07:33:31 UTC
Permalink
Post by ***@arcor.de
While slicing could be implemented by simple loops over equidistant
array or list elements, iterators often can't. In the end you'll need
pointers to data in memory, pointers between data elements in memory,
heap management, and after a while garbage collection integrated in
Forth.
C++ avoids that stuff and iterates over containers by calling methods
defined in the container classes.
Post by ***@arcor.de
But with that Forth leaves its main domain and becomes just an awkward Lua.
Nobody needs such a beast.
Oforth and 8th both have garbage collection, heap management, and all
that. No idea whether they have left Forth's main domain, are awkward
Luas, etc. They both have user bases by now, who presumably think they
are getting something worthwhile.

Is Forth's main visible characteristic, the exposed 2-stack VM,
something more than an archaic implementation hack? Not for me to
judge, I guess.
minf...@arcor.de
2021-01-15 15:37:35 UTC
Permalink
Post by Paul Rubin
Post by ***@arcor.de
While slicing could be implemented by simple loops over equidistant
array or list elements, iterators often can't. In the end you'll need
pointers to data in memory, pointers between data elements in memory,
heap management, and after a while garbage collection integrated in
Forth.
C++ avoids that stuff and iterates over containers by calling methods
defined in the container classes.
Post by ***@arcor.de
But with that Forth leaves its main domain and becomes just an awkward Lua.
Nobody needs such a beast.
Oforth and 8th both have garbage collection, heap management, and all
that. No idea whether they have left Forth's main domain, are awkward
Luas, etc. They both have user bases by now, who presumably think they
are getting something worthwhile.
IMO 8th and Oforth are desktop languages with their own well-deserved merits.

Still the main domain of classic Forth is programming for resource restricted
devices, often with direct hardware control and sometimes with exotic CPUs.
That's where Forth still can outshine Micropython and Lua. Even C/C++ that
are the dominant programming languages there, but lack interactivity.

Old folklore. But making Forth fatter than it is yields little benefits for e.g.
embedded programming.
Paul Rubin
2021-01-24 05:02:39 UTC
Permalink
Post by ***@arcor.de
Still the main domain of classic Forth is programming for resource
restricted devices, often with direct hardware control and sometimes
with exotic CPUs. That's where Forth still can outshine Micropython
and Lua. Even C/C++ that are the dominant programming languages there,
but lack interactivity.
This is reasonable. GC exists for classic Forth (Anton's library)
though it is unable to scan the return stack using pure ANS primitives
iirc. I'd personally welcome some kind of extension to support that.

As for resource restricted devices, there are a few levels of them.

0) those that can't reasonably run Forth programs, e.g. the smallest
PIC and Padauk processors.
1) Those that can accomdate a Forth runtime and support tethered Forth
development, but can't really run a resident text interpreter.
E.g.: 8 bitters with a few K of program space and a few hundred bytes
of ram. Since the other end of the tethered target is a desktop, the
host-side text interpreter can be as fancy as you like.
2) those that can handle a classic-style resident text interpreter but
not a desktop-flavored one. E.g.: classic Arduino, which has 32k of
program space and 2.5k of ram. The whole Forth tradition including
the classic compact text interpreter came from historical environments
like this, classic 8- and 16-bit micro and minicomputers.
3) those that can run stuff like Micropython (small ARM MCU's and the
like, BBC micro:bit, and the new Raspberry Pi Pico). 256k of program
flash and 32k of ram supports a nice minimal micropython. 128k of
flash might be enough for Lua. Note that the complete RPi Pico board
including the dual core Cortex M0+ CPU with 264K of RAM and 2MB of
SPI flash, costs $4.00 US retail and will be made by the gazillion.
It is aimed at the educational market to run Micropython from the start.
4) Anything larger, like embedded Linux, basically counts as a desktop.

"Classic Forth" really seems only relevant to the 2nd level of these.
For the rest, you either want a tethered Forth, or you can afford let's
say a "luxury" Forth with GC, or you can't use Forth aat all.

Since the 2nd level is becoming narrower and narrower, while "tiny" will
always be with us, maybe we should think of the future of Forth in terms
of tethered systems where the target system is resource constrained but
the host is powerful.
minf...@arcor.de
2021-01-24 08:10:43 UTC
Permalink
Post by ***@arcor.de
Still the main domain of classic Forth is programming for resource
restricted devices, often with direct hardware control and sometimes
with exotic CPUs. That's where Forth still can outshine Micropython
and Lua. Even C/C++ that are the dominant programming languages there,
but lack interactivity.
This is reasonable. GC exists for classic Forth (Anton's library)
though it is unable to scan the return stack using pure ANS primitives
iirc. I'd personally welcome some kind of extension to support that.
IIUC return stack scanning would only be required for object systems
with dynamically allocated code fragments (methods in heap memory).

I am no expert but that sounds rather exotic to me.

Having dynamically allocated (not allotted) data is really helpful for many
programming tasks. But more on desktop systems.
Anton Ertl
2021-01-24 10:39:41 UTC
Permalink
Post by ***@arcor.de
This is reasonable. GC exists for classic Forth (Anton's library)
though it is unable to scan the return stack using pure ANS primitives
iirc. I'd personally welcome some kind of extension to support that.
You can write it. Or maybe I will.
Post by ***@arcor.de
IIUC return stack scanning would only be required for object systems
with dynamically allocated code fragments (methods in heap memory).
Return stack scanning is required whenever the only reference to an
object can be found on the return stack, e.g.:

: foo
3 alloc throw s" foo" 2 pick swap move
Post by ***@arcor.de
r collect-garbage 3 alloc throw s" bar" 2 pick swap move
r> 3 type ;
foo

FOO first allocates 3 bytes from garbage-collected memory, stores
string "foo" there, and puts the address of that memory on the return
stack. COLLECT-GARBAGE does not find a reference to that memory, so
it reclaims that memory, and the next ALLOC allocates the same memory
and stores "bar" there. So when we get the address returned by the
first alloc from the return stack, its contents have been overwritten,
and FOO prints "bar" (tested with a freshly included gc.fs; if you
have other allocations before running FOO, results may vary).

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
none) (albert
2021-01-24 13:13:39 UTC
Permalink
Post by ***@arcor.de
Post by ***@arcor.de
Still the main domain of classic Forth is programming for resource
restricted devices, often with direct hardware control and sometimes
with exotic CPUs. That's where Forth still can outshine Micropython
and Lua. Even C/C++ that are the dominant programming languages there,
but lack interactivity.
This is reasonable. GC exists for classic Forth (Anton's library)
though it is unable to scan the return stack using pure ANS primitives
iirc. I'd personally welcome some kind of extension to support that.
IIUC return stack scanning would only be required for object systems
with dynamically allocated code fragments (methods in heap memory).
I am no expert but that sounds rather exotic to me.
Having dynamically allocated (not allotted) data is really helpful for many
programming tasks. But more on desktop systems.
Dynamically allocated data is quite helpful even with explicit free
and no garbage collection. Garbage collection is useful at an
abstraction level that is not often used in Forth.
I'm quite happy with ALLOCATE FREE RESIZE and SIZE in my
glyphs recognisation and compiler optimisation attempts without it.
Groetjes Albert
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
***@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
minf...@arcor.de
2021-01-24 18:42:30 UTC
Permalink
Post by none) (albert
Post by ***@arcor.de
Post by ***@arcor.de
Still the main domain of classic Forth is programming for resource
restricted devices, often with direct hardware control and sometimes
with exotic CPUs. That's where Forth still can outshine Micropython
and Lua. Even C/C++ that are the dominant programming languages there,
but lack interactivity.
This is reasonable. GC exists for classic Forth (Anton's library)
though it is unable to scan the return stack using pure ANS primitives
iirc. I'd personally welcome some kind of extension to support that.
IIUC return stack scanning would only be required for object systems
with dynamically allocated code fragments (methods in heap memory).
I am no expert but that sounds rather exotic to me.
Having dynamically allocated (not allotted) data is really helpful for many
programming tasks. But more on desktop systems.
Dynamically allocated data is quite helpful even with explicit free
and no garbage collection. Garbage collection is useful at an
abstraction level that is not often used in Forth.
Yes. Last year I thought about using arbitrary precision floats, specifically
Fabrice Bellard's size and speed optimized MPFR library, called libbf:
https://bellard.org/libbf/

I did some tests and they worked out quite well, but did not have the time
(nor feel the pressure) to pursue it further. BTW I was pleased
to read Ron Aaron's message when he included libbf into his 8th compiler.

Arbitrarily sized big floats need dynamic memory allocation. Garbage
collection is not really required, explicit freeing would be enough.
none) (albert
2021-01-26 13:48:52 UTC
Permalink
Post by ***@arcor.de
Post by none) (albert
Post by ***@arcor.de
Post by ***@arcor.de
Still the main domain of classic Forth is programming for resource
restricted devices, often with direct hardware control and sometimes
with exotic CPUs. That's where Forth still can outshine Micropython
and Lua. Even C/C++ that are the dominant programming languages there,
but lack interactivity.
This is reasonable. GC exists for classic Forth (Anton's library)
though it is unable to scan the return stack using pure ANS primitives
iirc. I'd personally welcome some kind of extension to support that.
IIUC return stack scanning would only be required for object systems
with dynamically allocated code fragments (methods in heap memory).
I am no expert but that sounds rather exotic to me.
Having dynamically allocated (not allotted) data is really helpful for many
programming tasks. But more on desktop systems.
Dynamically allocated data is quite helpful even with explicit free
and no garbage collection. Garbage collection is useful at an
abstraction level that is not often used in Forth.
Yes. Last year I thought about using arbitrary precision floats, specifically
https://bellard.org/libbf/
I did some tests and they worked out quite well, but did not have the time
(nor feel the pressure) to pursue it further. BTW I was pleased
to read Ron Aaron's message when he included libbf into his 8th compiler.
Arbitrarily sized big floats need dynamic memory allocation. Garbage
collection is not really required, explicit freeing would be enough.
Please note that the so called string stacks are also an example of this.
Strings can be arbitrary length, but the stack keeps track of what is and
what is not allocated or free.

Groetjes Albert
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
***@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
Hugh Aguilar
2021-01-26 15:22:46 UTC
Permalink
Please note that the so called string stacks are also an example of [garbage collection].
Strings can be arbitrary length, but the stack keeps track of what is and
what is not allocated or free.
LOL
Albert van der Horst is lying when he says "string stacks" (plural) do this.
My STRING-STACK.4TH is the only string-stack that does this.
none) (albert
2021-01-24 13:09:37 UTC
Permalink
In article <***@nightsong.com>,
Paul Rubin <***@nospam.invalid> wrote:
<SNIP>
Post by Paul Rubin
"Classic Forth" really seems only relevant to the 2nd level of these.
For the rest, you either want a tethered Forth, or you can afford let's
say a "luxury" Forth with GC, or you can't use Forth aat all.
Since the 2nd level is becoming narrower and narrower, while "tiny" will
always be with us, maybe we should think of the future of Forth in terms
of tethered systems where the target system is resource constrained but
the host is powerful.
The 2nd level is to this day both the most pleasant and most easy
to use, see mecrisp forth and noforths. The raspberry banana and
orange pi with a full blown linux in second place (3th level).
Then it becomes firmware with a single flash command.
Serial to parallel controls with 40 outputs are feasible on a
Launchpad, but only because Forth is much faster than Python.
You can bit bang a piece of music over a midi line, in Forth,
not in Python.

Micropython is *no* fun. One can only plumb together stuff that is
pre programmed in c. Neither the plumbing nor the studying of those
c-sources is fun or gives insight.

With a 10 euro serial card, or a motherboard with a serial port
you are seriously in business.
(quite a few modern mother board have true serial ports at TTL
not RS232 levels, but it is not advertised much. E.g. MSI.
Or an old laptop or desktop will do.

Even an Orange pi (4 core, 64 bit, gigabytes) has classic serial
port (5V) somewhere in a corner.

Groetjes Albert
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
***@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
Paul Rubin
2021-02-28 23:22:33 UTC
Permalink
The 2nd level is to this day both the most pleasant and most easy to
use, see mecrisp forth and noforths. ... Then it becomes firmware
with a single flash command. Micropython is *no* fun.
Well, Micropython being fun or not is subjective, I suppose (it seems
like fun to me). Maybe you'd find that Lisp is fun even if MP isn't.

What I'm really wondering, though, is about the preference for a
resident Forth over a tethered one. I'm imagining the usual workflow
with a resident Forth over a serial port:

1. Run terminal emulation software on host PC to interactively
type things like "2 2 + .".
2. Write your application (foo.fs) in a host side text editor, and
send it to the target through the serial port, maybe through some
file transfer pulldown in your terminal emulator app. That breaks
the abstraction of emulating a simple terminal, but that is ok.
3. Now continue to interactively type to the target board.
4. Your terminal emulator might also have something like keyboard
macros, so hitting F1 sends some preset block of text, or whatever.

Is that what you generally do? Or do you use a target side editor,
maybe even with BLOCKs? Does Mecrisp support a file system on the
target?

Originally I'd intended to ask in this post about why you prefer a
resident target interpreter instead of a tethered one, but I'll save
that for later. Meanwhile, here is another tiny target board that just
showed up on Seeed, which I'd say is small enough to make tethering
attractive. It is 8051-based, has 10kB of program flash and 768 bytes
of ram, and costs $1.49 in single quantity:

https://www.seeedstudio.com/CH551G-Development-board-p-4764.html

There's an interesting series of articles about metacompilers by Brad
Rodriguez in Forth Dimensions and I've started reading the first of
them. They seem applicable to this. The FD issues are referenced in
http://bradrodriguez.com/papers/moving4.htm .
With a 10 euro serial card, or a motherboard with a serial port
you are seriously in business.
Most small microprocessor boards these days have USB ports, so
simulating a serial port over a normal USB cable is a matter of some
device side software. That is surely preferable to needing a real
serial port or add-on board. Mecrisp on the RPi Pico doesn't yet
support the Pico's USB, but I like to hope that is a temporary
limitation.

Ron AARON
2021-01-15 07:27:24 UTC
Permalink
Post by Paul Rubin
Post by dxforth
If the gold standard is the classic FOR loops of other languages then
ISTM Minforth's is heading in that direction.
Other languages are moving away from counted loops in favor of loops
over collections, it seems to me.
8th offers both kinds of iteration. 'loop' (and loop-) are counted
loops, and 'a:each' etc. for iterating arrays ('m:each' for maps, etc).
Anton Ertl
2021-01-15 08:25:23 UTC
Permalink
Post by Paul Rubin
Post by dxforth
If the gold standard is the classic FOR loops of other languages then
ISTM Minforth's is heading in that direction.
Which other languages are you referring to, and why would anyone
consider them the gold standard. Looking at Fortran, Pascal,
Modula-2, their use of stop instead of count is closer to DO..LOOP and
DO..+LOOP than to minforth's FOR..NEXT, with the main difference being
that Forth does not include the end value (for positive increments),
while other languages usually do.

Fortran has:

do var = start, stop [,step]
! statement(s)


end do

Pascal has

for < variable-name > := < initial_value > to [down to] < final_value > do
S;

Modula-2 has:
FOR Index := 5 TO -35 BY -7 DO
WriteInt(Index,5);
END;

Basic has:

FOR i = 10 TO 1 STEP -1
NEXT

These languages don't have address arithmetic (or at least discourage
it), so iterating over an array uses a for loop with the array index
as loop variable.

C has (and many have copied it:

for ( init; condition; increment )
statement;

which is very general (and actually is just syntactic sugar for a
while loop).

Well, I have to admit that Pascal, Modula-2, Basic, and C use FOR/for,
and BASIC even uses NEXT, so if you mean that with "Gold standard",
yes, minforth is closer to Pascal, Modula-2, Basic, and C while the
traditional words are closer to Fortran.
Post by Paul Rubin
Other languages are moving away from counted loops in favor of loops
over collections, it seems to me.
Well, that's what minforth does: His words are good for looping over
arrays. But they don't fit some existing uses of DO +LOOP at all.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
none) (albert
2021-01-15 17:10:02 UTC
Permalink
Post by Anton Ertl
Post by Paul Rubin
Post by dxforth
If the gold standard is the classic FOR loops of other languages then
ISTM Minforth's is heading in that direction.
Which other languages are you referring to, and why would anyone
consider them the gold standard. Looking at Fortran, Pascal,
Modula-2, their use of stop instead of count is closer to DO..LOOP and
DO..+LOOP than to minforth's FOR..NEXT, with the main difference being
that Forth does not include the end value (for positive increments),
while other languages usually do.
do var = start, stop [,step]
! statement(s)


end do
Pascal has
for < variable-name > := < initial_value > to [down to] < final_value > do
S;
FOR Index := 5 TO -35 BY -7 DO
WriteInt(Index,5);
END;
FOR i = 10 TO 1 STEP -1
NEXT
These languages don't have address arithmetic (or at least discourage
it), so iterating over an array uses a for loop with the array index
as loop variable.
for ( init; condition; increment )
statement;
which is very general (and actually is just syntactic sugar for a
while loop).
Well, I have to admit that Pascal, Modula-2, Basic, and C use FOR/for,
and BASIC even uses NEXT, so if you mean that with "Gold standard",
yes, minforth is closer to Pascal, Modula-2, Basic, and C while the
traditional words are closer to Fortran.
Post by Paul Rubin
Other languages are moving away from counted loops in favor of loops
over collections, it seems to me.
Well, that's what minforth does: His words are good for looping over
arrays. But they don't fit some existing uses of DO +LOOP at all.
Time to repost my vision on this. The tests demonstrate the application.
The following code work on ciforth.
There is an assumption that the return stack is accessable via the
pointer RSP@ and grows down.
This should result in fast code in Forths that know how to optimise
(away) return stack access.
If you are unfamilar with ciforth look at the legenda below that
serves for gforth.
The main point is doing away with loop words that mark
the start or end, and perform loop index manipulations at the same
time. Because we already did away with the difference between
[: and :NONAME , loops now just work in interpret mode.

\ --------------------------------------------------
\ "AUTOLOAD" WANTED AUTOLOAD

WANT [: :I INCLUDE REGRESS ALIAS

\ This makes sure it work in the official 5.3.x release.
'[: ALIAS {
';] ALIAS }
:I 3RDROP RDROP RDROP RDROP ;
:I R2 RSP@ CELL+ ;
:I R3 R2 CELL+ ;
:I R4 R3 CELL+ ;
:I R5 R4 CELL+ ;
'RDROP ALIAS R-
'R2 ALIAS ii \ The loop counter
'R3 ALIAS lim \ The inclusive loop limit
'R4 ALIAS xt \ What is to be executed in each loop
'R5 ALIAS inc \ The loop increment
:I ix R3 @ ; \ Assumed called from do-body (quotation).

:I do' BEGIN R2 @ R3 @ < WHILE R@ EXECUTE 1 R2 +! REPEAT 3RDROP ;

\ A macro, it will do all the looping you want.
: do[..]' ( R1..R4: ii lim xt inc )
BEGIN ii @ lim @ - inc @ XOR 0< WHILE xt @ EXECUTE inc @ ii +! REPEAT ;
\ A simpler macro, it will do almost all the looping you want.
\ The loop increment is one.
: do' BEGIN ii @ lim @ < WHILE xt @ EXECUTE 1 ii +! REPEAT ;
\ Auxiliary words end here.

\ The looping constructs
: do) >R >R 0 >R ( R: xt lim ix ) do' R- R- R- ;
: do] >R 1+ >R 1 >R ( R: xt lim ix ) do' R- R- R- ;
\ " HIGH 1+ LOW DO <BODY> LOOP " translates to "LOW HIGH { <BODY> } DO[] "
: do[] >R 1+ >R >R ( R: xt lim ix ) do' R- R- R- ;

: do[..] OVER >R >R 0< + 1+ >R >R do[..]' R- R- R- R- ;
\ : SPACES 'SPACE do] ;

: SPACES 'SPACE SWAP do] ;

REGRESS : test 3 { 12 } do) ; test S: 12 12 12
REGRESS : test 0 { 12 } do) ; test S:
REGRESS : test -1 { 12 } do) ; test S:
REGRESS : test 3 { 12 } do] ; test S: 12 12 12
REGRESS : test 3 { ix DUP . } do] ; test S: 1 2 3
REGRESS : test 3 { ix DUP . } do) ; test S: 0 1 2
REGRESS : test 2 5 { ix DUP . } do[] ; test S: 2 3 4 5
REGRESS : test 2 7 2 { ix DUP . } do[..] ; test S: 2 4 6

REGRESS 3 { 12 } do) S: 12 12 12
REGRESS 0 { 12 } do) S:
REGRESS 3 { 12 } do] S: 12 12 12
REGRESS 3 { ix DUP . } do] S: 1 2 3
REGRESS 3 { ix DUP . } do) S: 0 1 2
REGRESS 2 5 { ix DUP . } do[] S: 2 3 4 5
REGRESS 2 7 2 { ix DUP . } do[..] S: 2 4 6
REGRESS 7 2 -2 { ix DUP . } do[..] S: 7 5 3
REGRESS 7 1 -2 { ix DUP . } do[..] S: 7 5 3 1
REGRESS 1 7 -2 { IX DUP . } do[..] S:

:I jx RSP@ 8 CELLS + @ ;
REGRESS 3 { 2 { jx DUP . } do] } do] S: 1 1 2 2 3 3
REGRESS 1 3 { 1 1 { jx DUP . } do[] } do[] S: 1 2 3
\ Next higher loop index is different for explicit increments
:I jx' RSP@ 9 CELLS + @ ;
REGRESS 1 3 1 { 1 1 1 { jx' DUP . } do[..] } do[..] S: 1 2 3
\ Leaving
:I leave RSP@ 5 CELLS + RSP! ;
REGRESS : test 7 { ix 3 = IF ix leave THEN } do] ; test S: 3
:I leave' RSP@ 6 CELLS + RSP! ;
REGRESS : test 1 7 1 { ix 3 = IF ix leave' THEN } do[..] ; test S: 3
.( ALL TESTS PASSED )
EXIT
Summary of advantage:
1. end of headache where loops end
2. end of special measures for empty loops
3. loop increment is a compile time constant,
easier for humans and for the optimizer
4. decrementing loops behave the same
5. eminently optimizable
6. works in interpret mode
\ ---------------------------------------


\ ---------------------------------------
The code could be made to work on gforth with
{ --> [:
} --> ;]
RSP@ --> RP@
:I A b C ; --> : A POSTPONE b POSTPONE C ; IMMEDIATE
REGRESS --> t{
S: --> }t \
'DROP --> ' DROP / ['] DROP
\ --------------------------------------------------------
Post by Anton Ertl
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
--
This is the first day of the end of your life.
It may not kill you, but it does make your weaker.
If you can't beat them, too bad.
***@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
dxforth
2021-01-16 02:01:20 UTC
Permalink
Post by Anton Ertl
...
for ( init; condition; increment )
statement;
which is very general (and actually is just syntactic sugar for a
while loop).
Something forth hasn't managed to do despite today's resources and
optimizers. With rare exception has anyone in forth thought beyond
DO LOOP et al. The only instance I'm aware is the 'Curly Control
Structure Set' (FD 13/5, 14/1).
dxforth
2021-01-17 21:45:56 UTC
Permalink
Post by dxforth
Post by Anton Ertl
...
for ( init; condition; increment )
statement;
which is very general (and actually is just syntactic sugar for a
while loop).
Something forth hasn't managed to do despite today's resources and
optimizers. With rare exception has anyone in forth thought beyond
DO LOOP et al. The only instance I'm aware is the 'Curly Control
Structure Set' (FD 13/5, 14/1).
Reproducing the C's for-loop in forth would look something like:

0 FOR 10 < WHILE I . STEP 1+ NEXT

would count 0..9

I expect it to be straightforward to code and would use a separate
loop stack. For optimizing compilers there'd be little reason to
use DO LOOP.
Anton Ertl
2021-01-22 08:21:16 UTC
Permalink
Post by dxforth
0 FOR 10 < WHILE I . STEP 1+ NEXT
would count 0..9
I expect it to be straightforward to code and would use a separate
loop stack.
Using the return stack as the loop stack (and assuming a maching I),
this could be implemented as:

: for ( entry: x -- x; loopback: -- x)
]] >r begin i [[ ; immediate

: step ( -- x )
]] r> [[ ; immediate

: next ( x -- )
]] >r repeat r> drop [[ ; immediate

: foo1 10 0 do i . loop ;
: foo2 0 FOR 10 < WHILE I . STEP 1+ NEXT ;

Tested with gforth.

It provides the DO...LOOP advantage of getting
the index out of the way; it allows the C "for" advantages of using an
arbitrary termination check (which also covers the signed/unsigned
problem) and an arbitrary index progression.

Limitations/drawbacks: the limit in the way (the example uses a
constant, but where this is not possible, it can make things messy);
the index can only be one cell; you can use only one WHILE (maybe have
NEXT2 and NEXT3 for 2 or 3 WHILEs).

Let's see the generated code; I marked the start of the loop bodies
(where the LOOP or NEXT jumps to) with ">":

vfxlin 4.72:
10 0 do i . loop 0 FOR 10 < WHILE I . STEP 1+ NEXT
PUSH 080C0D6F PUSH 00
PUSH 7FFFFFF6 NOP
PUSH 00 NOP
NOP NOP
NOP NOP
NOP NOP
NOP NOP
Post by dxforth
MOV EDX, [ESP] >MOV EDX, [ESP]
LEA EBP, [EBP+-04] CMP EDX, 0A
MOV [EBP], EBX JNL/GE 080C0DB9
MOV EBX, EDX MOV EDX, [ESP]
CALL 080531C4 . LEA EBP, [EBP+-04]
ADD [ESP], 01 MOV [EBP], EBX
ADD [ESP+04], 01 MOV EBX, EDX
JNO 080C0D50 CALL 080531C4 .
LEA ESP, [ESP+0C] POP EDX
NEXT, INC EDX
( 48 bytes, 17 instructions ) PUSH EDX
JMP 080C0D98
LEA ESP, [ESP+04]
NEXT,
( 46 bytes, 21 instructions )

The loop body starts after the sequence of nops.

iforth-5.1:
10 0 do i . loop 0 FOR 10 < WHILE I . STEP 1+ NEXT
: foo1 : foo2
mov rcx, #10 d# lea rbp, [rbp -8 +] qword
xor rbx, rbx mov [rbp 0 +] qword, 0 d#
call (DO) offset NEAR pop rbx
nop nop
nop >mov rdi, [rbp 0 +] qword
Post by dxforth
mov rdi, [rbp 0 +] qword cmp rdi, #10 b#
push rbx jge $10226216 offset NEAR
push rdi mov rdi, [rbp 0 +] qword
lea rbp, [rbp -8 +] qword push rbx
mov [rbp 0 +] qword, $10226177 d# push rdi
jmp .+10 ( $1013888A ) offset NEAR lea rbp, [rbp -8 +] qword
pop rbx mov [rbp 0 +] qword, $102261FD d#
add [rbp 0 +] qword, 1 b# jmp .+10 ( $1013888A ) offset NEAR
add [rbp 8 +] qword, 1 b# mov rbx, [rbp 0 +] qword
jno $10226160 offset NEAR lea rbp, [rbp 8 +] qword
add rbp, #24 b# lea rbx, [rbx 1 +] qword
push rbx lea rbp, [rbp -8 +] qword
; mov [ebp 0 +] dword, rbx
pop rbx
jmp $102261D8 offset SHORT
push rbx
pop rbx
mov rdi, [rbp 0 +] qword
lea rbp, [rbp 8 +] qword
push rbx
;

Again, the nops precede the loop body.

lxf (I have to use R@ instead of I in two places):
10 0 do i . loop 0 FOR 10 < WHILE r@ . STEP 1+ NEXT
mov dword [esp-8h] , # 7FFFFFF6h mov dword [esp-4h] , # 0h
mov dword [esp-4h] , # 8000000Ah lea esp , [esp-4h]
lea esp , [esp-8h] >cmp dword [esp] , # Ah
Post by dxforth
mov eax , [esp] jge "0804FD52"
add eax , [esp+4h] mov [ebp-4h] , ebx
mov [ebp-4h] , ebx mov ebx , [esp]
mov ebx , eax lea ebp , [ebp-4h]
lea ebp , [ebp-4h] call .
call . mov eax , [esp]
inc dword [esp] inc eax
jno "0804FD07" mov [esp] , eax
lea esp , [esp+8h] jmp "0804FD31"
ret near lea esp , [esp+4h]
ret near
SwiftForth (again, I have to use r@ in two places):
10 0 do i . loop 0 FOR 10 < WHILE r@ . STEP 1+ NEXT
-7FFFFFF6 # PUSH 0 # PUSH
7FFFFFF6 # PUSH >A # 0 [ESP] CMP
Post by dxforth
4 # EBP SUB 8087AFD JNL
EBX 0 [EBP] MOV 4 # EBP SUB
0 [ESP] EBX MOV EBX 0 [EBP] MOV
4 [ESP] EBX ADD 0 [ESP] EBX MOV
8050FAF ( . ) CALL 8050FAF ( . ) CALL
0 [ESP] INC 4 # EBP SUB
8087A99 JNO EBX 0 [EBP] MOV
8 # ESP ADD EBX POP
RET EBX INC
EBX PUSH
0 [EBP] EBX MOV
4 # EBP ADD
8087AD1 JMP
EAX POP
RET
Post by dxforth
For optimizing compilers there'd be little reason to use DO LOOP.
Yes, but it seems that existing Forth compilers don't optimize enough
yet.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
minf...@arcor.de
2021-01-22 11:08:13 UTC
Permalink
Post by Anton Ertl
Post by dxforth
0 FOR 10 < WHILE I . STEP 1+ NEXT
would count 0..9
I expect it to be straightforward to code and would use a separate
loop stack.
Using the return stack as the loop stack (and assuming a maching I),
: for ( entry: x -- x; loopback: -- x)
]] >r begin i [[ ; immediate
: step ( -- x )
]] r> [[ ; immediate
: next ( x -- )
]] >r repeat r> drop [[ ; immediate
: foo1 10 0 do i . loop ;
: foo2 0 FOR 10 < WHILE I . STEP 1+ NEXT ;
<snip>
Post by dxforth
For optimizing compilers there'd be little reason to use DO LOOP.
Yes, but it seems that existing Forth compilers don't optimize enough
yet.
I would expect loop indices to reside in CPU registers for speed.

Another idea could be to introduce stack frames for housing locals and loop parameters.
So loop parameters don't get into the way of legal return stack operations and UNLOOP
becomes unnecessary.

Of course the compiler would have to lay down two different patterns for ; and EXIT:
- the classical and fast single RET for words w/o locals and counted loops
- a RET plus stack frame adjustment (like EBP as in many C compilers) for
words with locals or counted loops (Forth calling convention?)

Systems with local stacks could of course implement loop parameters as 'hidden'
locals.
Anton Ertl
2021-01-22 19:06:54 UTC
Permalink
Post by ***@arcor.de
I would expect loop indices to reside in CPU registers for speed.
Existing Forth compilers are not there yet (apart apparently from the
unreleased 64-bit version of lxf/ntf).
Post by ***@arcor.de
Another idea could be to introduce stack frames for housing locals and loop parameters.
So loop parameters don't get into the way of legal return stack operations and UNLOOP
becomes unnecessary.
Gforth uses a locals stack for locals, so return stack operations
don't interfere with locals. Putting the loop parameters on the
locals stack would be possible, but Gforth puts them on the return
stack, following Forth tradition (and to support code that relies on
that tradition).
Post by ***@arcor.de
- the classical and fast single RET for words w/o locals and counted loops
- a RET plus stack frame adjustment (like EBP as in many C compilers) for
words with locals or counted loops (Forth calling convention?)
Gforth's EXIT is a compiling word that compiles a locals-stack
adjustment if necessary (the programmer still has to clean up the
return stack, including UNLOOP). A while ago we tried to change EXIT
into a word that can be ticked and EXECUTEd, and if so be equivalent
to writing EXIT there. This turned out to be a game of whack-a-mole:
one special case after another turned up, and they required increasing
complexity to deal with; some of that is described in our paper
[ertl15]; after that paper was finished, another special case turned
up, and that was the straw that broke the camel's back: we reverted to
the simple compiling EXIT; we did keep the change to DOES>, which is
an improvement.

@InProceedings{ertl15,
author = {M. Anton Ertl and Bernd Paysan},
title = {From \texttt{exit} to \texttt{set-does>} --- A Story of {Gforth} Re-Implementation},
crossref = {euroforth15},
pages = {41--47},
url = {http://www.euroforth.org/ef15/papers/ertl.pdf},
url-slides = {http://www.euroforth.org/ef15/papers/ertl-slides.pdf},
OPTnote = {not refereed},
abstract = {We changed \code{exit} from an immediate to a
non-immediate word; this requires changes in the
de-allocation of locals, which leads to changes in
the implementation of colon definitions, and to
generalizing \code{does>} into \code{set-does>}
which allows the defined word to call arbitrary
execution tokens. The new implementation of locals
cleanup can usually be optimized to similar
performance as the old implementation. The new
implementation of \code{does>} has performance
similar to the old implementation, while using
\code{set-does>} results in speedups in certain
cases.}
}

@Proceedings{euroforth15,
title = {31st EuroForth Conference},
booktitle = {31st EuroForth Conference},
year = {2015},
key = {EuroForth'15},
url = {http://www.complang.tuwien.ac.at/anton/euroforth/ef15/papers/},
url-pdf = {http://www.complang.tuwien.ac.at/anton/euroforth/ef15/papers/proceedings.pdf}
}

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
none) (albert
2021-01-22 21:44:39 UTC
Permalink
Post by Anton Ertl
Post by ***@arcor.de
I would expect loop indices to reside in CPU registers for speed.
Existing Forth compilers are not there yet (apart apparently from the
unreleased 64-bit version of lxf/ntf).
I've done some experiments with ciforth, and had some success.
However, the loop indices land into CPU registers as a result of
optimising away items residing on the return stack.

: test10a 10 0 DO I + LOOP ;
is compiled as

: test10a
$0A 0
DO I + 1 (+LOOP)
0BRANCH [ -48 , ] ( between ? I ) UNLOOP ;

And then (redacted) lower case are the equivalent R9..16 registers.
MOVI, X| R| cx| 10 IL,
MOVI, X| R| dx| 0 IL,
b1:
POP|X, BX|
MOV, X| F| dx'| R| AX|
ADD, X| F| BX'| R| AX|
PUSH|X, AX|

ADDI, X| R| dx| 1 IL,
CMP, X| T| dx'| R| CX|
J|X, S| Y| b1 RL,
JMP, 0 (RL,)

There is about a dozen steps involved.
Post by Anton Ertl
- anton
Groetjes Albert
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
***@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
Stephen Pelc
2021-01-23 18:09:03 UTC
Permalink
Post by Anton Ertl
Post by ***@arcor.de
I would expect loop indices to reside in CPU registers for speed.
Existing Forth compilers are not there yet (apart apparently from the
unreleased 64-bit version of lxf/ntf).
VFX Forth 64 keeps the loop indices in registers.

Stephen
--
Stephen Pelc, ***@vfxforth.com <<< NEW
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, +44 (0)78 0390 3612
web: http://www.mpeforth.com - free VFX Forth downloads
Anton Ertl
2021-01-24 11:44:55 UTC
Permalink
Post by Stephen Pelc
Post by Anton Ertl
Existing Forth compilers are not there yet (apart apparently from the
unreleased 64-bit version of lxf/ntf).
VFX Forth 64 keeps the loop indices in registers.
Great! So of the 4 systems I tested, two have descendants that should
be fast, and two have yet to go there.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
dxforth
2021-01-23 02:51:50 UTC
Permalink
Post by Anton Ertl
Post by dxforth
0 FOR 10 < WHILE I . STEP 1+ NEXT
would count 0..9
I expect it to be straightforward to code and would use a separate
loop stack.
Using the return stack as the loop stack (and assuming a maching I),
: for ( entry: x -- x; loopback: -- x)
]] >r begin i [[ ; immediate
: step ( -- x )
]] r> [[ ; immediate
: next ( x -- )
]] >r repeat r> drop [[ ; immediate
: foo1 10 0 do i . loop ;
: foo2 0 FOR 10 < WHILE I . STEP 1+ NEXT ;
Tested with gforth.
It provides the DO...LOOP advantage of getting
the index out of the way; it allows the C "for" advantages of using an
arbitrary termination check (which also covers the signed/unsigned
problem) and an arbitrary index progression.
Limitations/drawbacks: the limit in the way (the example uses a
constant, but where this is not possible, it can make things messy);
the index can only be one cell; you can use only one WHILE (maybe have
NEXT2 and NEXT3 for 2 or 3 WHILEs).
Correct - wasn't thinking. Forth prefers limit be passed as parameter.
The syntax becomes:

10 0 FOR > WHILE ... STEP 1+ NEXT

An implementation (using return stack for demo purposes):

: LDROP \ drop loop parameters
postpone 2r> postpone 2drop ; immediate

: FOR
postpone begin postpone 2dup postpone 2>r ; immediate

: NEXT
postpone repeat postpone ldrop ; immediate

synonym STEP 2r>

: t1 9 0 for >= while r@ . step 1+ next ;
: t2 0 9 for <= while r@ . step 1- next ;
Hugh Aguilar
2021-01-18 01:51:16 UTC
Permalink
Post by Paul Rubin
Post by dxforth
If the gold standard is the classic FOR loops of other languages then
ISTM Minforth's is heading in that direction.
Other languages are moving away from counted loops in favor of loops
over collections, it seems to me.
Loops over collections??? Gosh! How is this possible?

Lets try writing code to iterate over a linked-list using rquotations:
-------------------------------------------------------------------------------------------
VFX? SwiftForth? or [if] \ these don't work with HumptyDumpty's code because these HOFs have locals of their own.

\ These are for use with REX and rquotations. The | prefix is the naming convention for anything that uses quotations.
\ I am getting rid of that "toucher" term and replacing it with "quotation."

\ In VFX it is okay to use REX on an xt or use EXECUTE on an rq --- so these can be used for everything --- but this doesn't work on SwiftForth.
\ On SwiftForth you can only use REX on an rq and EXECUTE on an xt --- so you have to use both kinds of HOF as appropriate.

: |each ( i*x head rq -- j*x ) \ quotation: i*x node -- j*x
{ rq | next -- }
begin ?dup while \ -- node
dup .fore @ to next
rq rex
next repeat ;

: |find-node ( i*x head rq -- j*x node|false ) \ quotation: i*x node -- j*x flag
{ node rq | next -- node|false }
begin node while
node .fore @ to next
node rq rex if node exit then
next to node repeat
false ;

: |find-prior ( i*x head rq -- j*x -1|node|false ) \ quotation: i*x node -- j*x flag
-1 { node rq prior | next -- prior|false } \ prior is -1, meaning found node was the head
begin node while
node .fore @ to next
node rq rex if prior exit then
node to prior next to node repeat
false ;

[then]
-------------------------------------------------------------------------------------------

I'm using BEGIN WHILE REPEAT in this words.
This doesn't actually matter to the user though, because |EACH |FIND-NODE and |FIND-PRIOR
are considered to be black-box functions from a code-library, so the user doesn't need to
worry about how they work internally.
Here are some examples of using |EACH to iterate over a list and collect data:
-------------------------------------------------------------------------------------------
seq
w field .gender \ 'M' or 'F' \ if someone isn't one or the other, then he/she isn't a person
constant person

: init-person ( str node -- node )
init-seq >r
[char] M r@ .gender ! \ default is male
r> ;

: new-person ( str -- node )
person alloc
init-person ;

person
w field .hotness \ scale of 1 to 10
constant chick

: init-chick ( hotness str node -- node )
init-person >r
[char] F r@ .gender !
r@ .hotness !
r> ;

: new-chick ( hotness str -- node )
chick alloc
init-chick ;

: chick-counter ( head -- chicks )
0
swap r[ .gender @ [char] F = if 1+ then ]r |each ;

: macro-chick-counter ( head -- chicks )
0
swap each[ .gender @ [char] F = if 1+ then ]each ;

: segregate-genders { head | chicks dudes -- chicks dudes }
head r[
dup .gender @ [char] F = if chicks one-link to chicks
else dudes one-link to dudes then
]r |each
chicks dudes ;

: test { head | people chicks tens nines eights -- }
head r[
1 +to people
dup .gender @ [char] F = if 1 +to chicks
.hotness @
dup 8 = if 1 +to eights then
dup 9 = if 1 +to nines then
10 = if 1 +to tens then \ Yay! We found at least one!
else drop then
]r |each
cr ." total people: " people .
cr ." total chicks: " chicks .
cr ." level 10 chicks: " tens .
cr ." level 9 chicks: " nines .
cr ." level 8 chicks: " eights .
;
-------------------------------------------------------------------------------------------
none) (albert
2021-01-11 12:30:29 UTC
Permalink
In article <***@mips.complang.tuwien.ac.at>,
Anton Ertl <***@mips.complang.tuwien.ac.at> wrote:
<SNIP>
Post by Anton Ertl
Could a compiler decide which approach to use depending on the code?
If it sees the whole loop before deciding to generate code, it could
select the approach depending on whether and how +LOOP is used and how
many occurences of I there are.
I'd say if you want to consider the whole code, you are effectively
contemplating an optimizer. Going that way you are no more restricted
in any way, and FOR .. NEXT , DO .. LOOP, DO ..1 +LOOP, all converge
to an optimum for the processor at hand.
Post by Anton Ertl
But even a compiler that generates the code immediately could start
out with approach 3; when seeing I, compute the index, and put it in a
register. If that register is not needed for something else, further
occurences of I use that value. But if the register was used for
something else in between, just recompute the index.
You end up with something that may be more complicated than a
general purpose optimiser. I admit that loop optimising is a low
hanging fruit, given large benefits for the implementation effort.

I'm currently experimenting with an optimiser that analyses
machine code resulting from a Forth compilation.
A DO LOOP results in return stack manipulations, and
the end of the loop involves a conditional jump. Both result also
from other sequences like >R .. R> or DUP < UNTIL .
They are optimised at the same time.
Post by Anton Ertl
- anton
Groetjes Ablert
--
This is the first day of the end of your life.
It may not kill you, but it does make your weaker.
If you can't beat them, too bad.
***@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
minf...@arcor.de
2021-01-11 14:37:07 UTC
Permalink
Post by none) (albert
I'm currently experimenting with an optimiser that analyses
machine code resulting from a Forth compilation.
A DO LOOP results in return stack manipulations, and
the end of the loop involves a conditional jump. Both result also
from other sequences like >R .. R> or DUP < UNTIL .
They are optimised at the same time.
Interesting experiments. In my younger years I did code optimization
by writing Forth definitions into Prolog clauses, and using the Prolog
engine for repeated pattern matching and elimination/replacement.
It worked quite nicely but became too complex for everyday usage.
But that was still on the high level, assembly-level optimization was
thought of, but never realized in practice.

Time has went on, and code optimization has become a special AI field.
I found a nice overview here:
https://arxiv.org/pdf/1805.03441.pdf
Anton Ertl
2021-01-09 12:25:40 UTC
Permalink
Post by Anton Ertl
long double double
fldt (%rdi) fld %st(0)
fmul %st(1),%st fmull (%rdi)
fldt (%rsi) faddl (%rsi)
faddp %st,%st(1) fstpl (%rsi)
fstpt (%rsi) add %rdx,%rdi
add %rdx,%rdi add %rdx,%rsi
add %rdx,%rsi sub $0x1,%rcx
sub $0x1,%rcx jne f <axpy+0xf>
jne 9 <axpy+0x9>
7 cycles/iteration 2 cycles/iteration
13 cycles/iteration 2.7 cycles/iteration

I have now also measured on Zen2 and Zen, with the result in the last
line (little difference between Zen and Zen2). If you are doing a lot
of 80-bit FP stuff, buy Intel. If you have a Zen or Zen2, it's even
more important than on Intel to set the precision approporiately.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
Loading...