Discussion:
Floating point implementations on AMD64
(too old to reply)
Anton Ertl
2024-04-13 17:55:18 UTC
Permalink
I just looked at the floating-point implementations of recent
SwiftForth and VFX (finally present in the system from the start), and
on iForth-5.1-mini (for comparison):

1 FLOATS .

reports:

16 iforth
10 sf64
10 vfx64

For

: foo f+ f* ;

the resulting code is:

SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
: foo f+ f* ; ok
see foo
44E8B9 ST(0) ST(1) FADDP DEC1
44E8BB ST(0) ST(1) FMULP DEC9
44E8BD RET C3 ok


VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023

: foo f+ f* ; ok
see foo
FOO
( 0050A250 DEC1 ) FADDP ST(1), ST
( 0050A252 DEC9 ) FMULP ST(1), ST
( 0050A254 C3 ) RET/NEXT
( 5 bytes, 3 instructions )


iForth:
$10226000 : foo 488BC04883ED088F4500 ***@H.m..E.
$1022600A fld [r13 0 +] tbyte41DB6D00 A[m.
$1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m.
$10226012 fxch ST(2) D9CA YJ
$10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m
$10226018 faddp ST(1), ST DEC1 ^A
$1022601A fxch ST(1) D9C9 YI
$1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m.
$10226026 fmulp ST(1), ST DEC9 ^I
$10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}.
$10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok

So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.

Gforth sticks out by using 8-byte FP values; most of those are stored
in memory (supporting deep FP stacks), with the top of FP stack in an
xmm register on AMD64:

: foo f+ f* ; ok
see-code foo
$7FF2CE8034E0 f+ 1->1
7FF2CE4A6E43: mov rax,r12
7FF2CE4A6E46: lea r12,$08[r12]
7FF2CE4A6E4B: addsd xmm15,$08[rax]
$7FF2CE8034E8 f* 1->1
7FF2CE4A6E51: mov rax,r12
7FF2CE4A6E54: lea r12,$08[r12]
7FF2CE4A6E59: mulsd xmm15,$08[rax]
$7FF2CE8034F0 ;s 1->1
7FF2CE4A6E5F: mov rbx,[r14]
7FF2CE4A6E62: add r14,$08
7FF2CE4A6E66: mov rax,[rbx]
7FF2CE4A6E69: jmp eax

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
Krishna Myneni
2024-04-13 23:47:20 UTC
Permalink
Post by Anton Ertl
I just looked at the floating-point implementations of recent
SwiftForth and VFX (finally present in the system from the start), and
1 FLOATS .
16 iforth
10 sf64
10 vfx64
For
: foo f+ f* ;
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
: foo f+ f* ; ok
see foo
44E8B9 ST(0) ST(1) FADDP DEC1
44E8BB ST(0) ST(1) FMULP DEC9
44E8BD RET C3 ok
VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023
: foo f+ f* ; ok
see foo
FOO
( 0050A250 DEC1 ) FADDP ST(1), ST
( 0050A252 DEC9 ) FMULP ST(1), ST
( 0050A254 C3 ) RET/NEXT
( 5 bytes, 3 instructions )
$1022600A fld [r13 0 +] tbyte41DB6D00 A[m.
$1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m.
$10226012 fxch ST(2) D9CA YJ
$10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m
$10226018 faddp ST(1), ST DEC1 ^A
$1022601A fxch ST(1) D9C9 YI
$1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m.
$10226026 fmulp ST(1), ST DEC9 ^I
$10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}.
$10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.
...

For me, an 8 item hardware fp stack limit is too limiting to be useful.
This is mostly because of my use of the fp stack for initializing tables
(arrays and matrices), and my coding style of returning more than 8
floats on the fp stack for some types of computation. No doubt one can
limit themselves to an 8-item fp stack, but I'd hate to have to code wit
such a limit.

--
Krishna
Stephen Pelc
2024-04-14 20:02:01 UTC
Permalink
Post by Krishna Myneni
Post by Anton Ertl
I just looked at the floating-point implementations of recent
SwiftForth and VFX (finally present in the system from the start), and
1 FLOATS .
16 iforth
10 sf64
10 vfx64
For
: foo f+ f* ;
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
: foo f+ f* ; ok
see foo
44E8B9 ST(0) ST(1) FADDP DEC1
44E8BB ST(0) ST(1) FMULP DEC9
44E8BD RET C3 ok
VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023
: foo f+ f* ; ok
see foo
FOO
( 0050A250 DEC1 ) FADDP ST(1), ST
( 0050A252 DEC9 ) FMULP ST(1), ST
( 0050A254 C3 ) RET/NEXT
( 5 bytes, 3 instructions )
$1022600A fld [r13 0 +] tbyte41DB6D00 A[m.
$1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m.
$10226012 fxch ST(2) D9CA YJ
$10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m
$10226018 faddp ST(1), ST DEC1 ^A
$1022601A fxch ST(1) D9C9 YI
$1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m.
$10226026 fmulp ST(1), ST DEC9 ^I
$10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}.
$10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.
...
For me, an 8 item hardware fp stack limit is too limiting to be useful.
This is mostly because of my use of the fp stack for initializing tables
(arrays and matrices), and my coding style of returning more than 8
floats on the fp stack for some types of computation. No doubt one can
limit themselves to an 8-item fp stack, but I'd hate to have to code wit
such a limit.
The manual (gasp) documents how to change the default FP package.

Changing the default pack also changes the system call interfaces to
match.

Stephen
--
Stephen Pelc, ***@vfxforth.com
MicroProcessor Engineering, Ltd. - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)78 0390 3612, +34 649 662 974
http://www.mpeforth.com
MPE website
http://www.vfxforth.com/downloads/VfxCommunity/
downloads
Krishna Myneni
2024-04-14 23:34:45 UTC
Permalink
Post by Stephen Pelc
Post by Krishna Myneni
Post by Anton Ertl
I just looked at the floating-point implementations of recent
SwiftForth and VFX (finally present in the system from the start), and
...
Post by Stephen Pelc
Post by Krishna Myneni
Post by Anton Ertl
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.
...
For me, an 8 item hardware fp stack limit is too limiting to be useful.
This is mostly because of my use of the fp stack for initializing tables
(arrays and matrices), and my coding style of returning more than 8
floats on the fp stack for some types of computation. No doubt one can
limit themselves to an 8-item fp stack, but I'd hate to have to code wit
such a limit.
The manual (gasp) documents how to change the default FP package.
Good to know.

--
Krishna
dxf
2024-04-15 01:37:39 UTC
Permalink
Post by Stephen Pelc
Post by Krishna Myneni
Post by Anton Ertl
I just looked at the floating-point implementations of recent
SwiftForth and VFX (finally present in the system from the start), and
1 FLOATS .
16 iforth
10 sf64
10 vfx64
For
: foo f+ f* ;
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
: foo f+ f* ; ok
see foo
44E8B9 ST(0) ST(1) FADDP DEC1
44E8BB ST(0) ST(1) FMULP DEC9
44E8BD RET C3 ok
VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023
: foo f+ f* ; ok
see foo
FOO
( 0050A250 DEC1 ) FADDP ST(1), ST
( 0050A252 DEC9 ) FMULP ST(1), ST
( 0050A254 C3 ) RET/NEXT
( 5 bytes, 3 instructions )
$1022600A fld [r13 0 +] tbyte41DB6D00 A[m.
$1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m.
$10226012 fxch ST(2) D9CA YJ
$10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m
$10226018 faddp ST(1), ST DEC1 ^A
$1022601A fxch ST(1) D9C9 YI
$1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m.
$10226026 fmulp ST(1), ST DEC9 ^I
$10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}.
$10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.
...
For me, an 8 item hardware fp stack limit is too limiting to be useful.
This is mostly because of my use of the fp stack for initializing tables
(arrays and matrices), and my coding style of returning more than 8
floats on the fp stack for some types of computation. No doubt one can
limit themselves to an 8-item fp stack, but I'd hate to have to code wit
such a limit.
The manual (gasp) documents how to change the default FP package.
Changing the default pack also changes the system call interfaces to
match.
Specifically chapter 14 in the PDF doc.

integers
remove-FP-pack
include Lib/x64/Hfpx64

swaps in the 80-bit external stack model.

The HTML doc appears to lack this information (or hard to find) should a user
select that by mistake.
Anton Ertl
2024-04-20 15:23:22 UTC
Permalink
Post by Stephen Pelc
The manual (gasp) documents how to change the default FP package.
Changing the default pack also changes the system call interfaces to
match.
It is great that now an FP package is available by default, rather
than having to load some obscurely-named file from an
installation-specific path. E.g. appbench-1.4 contains a file
setup/vfx.fth which contains the lines:

1 cells 4 = [if]
include /usr/share/doc/VfxForth/Lib/Ndp387.fth
[then]

The result is that when I just tested various systems on the
appbench-1.4 suite, vfx64 worked (for four of the benchmarks), and I
won't spoil my positive message by reporting on what did not work.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
dxf
2024-04-22 11:05:10 UTC
Permalink
Post by Anton Ertl
Post by Stephen Pelc
The manual (gasp) documents how to change the default FP package.
Changing the default pack also changes the system call interfaces to
match.
It is great that now an FP package is available by default, rather
than having to load some obscurely-named file from an
installation-specific path. E.g. appbench-1.4 contains a file
1 cells 4 = [if]
include /usr/share/doc/VfxForth/Lib/Ndp387.fth
[then]
The result is that when I just tested various systems on the
appbench-1.4 suite, vfx64 worked (for four of the benchmarks), and I
won't spoil my positive message by reporting on what did not work.
What didn't work?

When loading an alternate f/p package I found it removes parts of the
compiler unrelated to fp. Hopefully this can be rectified. IIRC this
problem didn't exist with the previous setup where fp had to be explicitly
loaded.
Stephen Pelc
2024-04-22 19:56:55 UTC
Permalink
Post by dxf
When loading an alternate f/p package I found it removes parts of the
compiler unrelated to fp. Hopefully this can be rectified. IIRC this
problem didn't exist with the previous setup where fp had to be explicitly
loaded.
Please send me more details by email.

Stephen
--
Stephen Pelc, ***@vfxforth.com
MicroProcessor Engineering, Ltd. - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)78 0390 3612, +34 649 662 974
http://www.mpeforth.com
MPE website
http://www.vfxforth.com/downloads/VfxCommunity/
downloads
dxf
2024-04-14 03:29:01 UTC
Permalink
Post by Anton Ertl
I just looked at the floating-point implementations of recent
SwiftForth and VFX (finally present in the system from the start), and
...
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.
All the more reason to make fp loadable so users can choose the model
they want instead of built-in. IIRC VFX and SWF previously did this.
mhx
2024-04-14 07:03:15 UTC
Permalink
Anton Ertl wrote:
[..]
Post by Anton Ertl
$1022600A fld [r13 0 +] tbyte41DB6D00 A[m.
$1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m.
$10226012 fxch ST(2) D9CA YJ
$10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m
$10226018 faddp ST(1), ST DEC1 ^A
$1022601A fxch ST(1) D9C9 YI
$1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m.
$10226026 fmulp ST(1), ST DEC9 ^I
$10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}.
$10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.
Turbo Pascal had a fast FP mode that used the FPU stack. I found almost
immediately that that is unusable for serious work.

The used scheme is rather complicated. iForth uses the internal stack
when it can prove that there will be no under- or overflow. Non-inlined
calls (F. below) always use the memory stack.

FORTH> pi fvalue val1 pi/2 fvalue val2 ok
FORTH> : test val1 fdup val2 foo val1 f+ val2 f* f. ; ok
FORTH> see test
Flags: ANSI
$015FDA80 : test
$015FDA8A fld $015FD650 tbyte-offset
$015FDA90 fld ST(0)
$015FDA92 fld $015FD670 tbyte-offset
$015FDA98 faddp ST(1), ST
$015FDA9A fmulp ST(1), ST
$015FDA9C fld $015FD650 tbyte-offset
$015FDAA2 faddp ST(1), ST
$015FDAA4 fld $015FD670 tbyte-offset
$015FDAAA fmulp ST(1), ST
$015FDAAC fpush,
$015FDAB6 jmp F.+10 ( $0124ED42 ) offset NEAR
$015FDABB ;

Apparently there are special interrupts that one can enable
to signal FPU stack underflow (and then spill to memory)
but I never got them to work reliably. The software
analysis works fine, but can be fooled in case of rather
contrived circumstances. I have not encountered a bug in the
past two decades.

-marcel
dxf
2024-04-14 08:02:10 UTC
Permalink
Post by mhx
[..]
$1022600A  fld           [r13 0 +] tbyte41DB6D00                  A[m.
$1022600E  fld           [r13 #16 +] tbyte
                                        41DB6D10                  A[m.
$10226012  fxch          ST(2)          D9CA                      YJ
$10226014  lea           r13, [r13 #32 +] qword
                                        4D8D6D20                  M.m $10226018  faddp         ST(1), ST      DEC1                      ^A
$1022601A  fxch          ST(1)          D9C9                      YI
$1022601C  fpopswap,                    41DB6D00D9CA4D8D6D10      A[m.YJM.m.
$10226026  fmulp         ST(1), ST      DEC9                      ^I
$10226028  fpush,                       4D8D6DF0D9C941DB7D00      M.mpYIA[}.
$10226032  ;                            488B45004883C508FFE0      H.E.H.E..` ok
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.
Turbo Pascal had a fast FP mode that used the FPU stack. I found almost
immediately that that is unusable for serious work.
Were that the case Intel had plenty opportunity to change it. They had
an academic advising them.
Krishna Myneni
2024-04-14 12:50:42 UTC
Permalink
Post by dxf
Post by mhx
[..]
$1022600A  fld           [r13 0 +] tbyte41DB6D00                  A[m.
$1022600E  fld           [r13 #16 +] tbyte
                                        41DB6D10                  A[m.
$10226012  fxch          ST(2)          D9CA                      YJ
$10226014  lea           r13, [r13 #32 +] qword
                                        4D8D6D20                  M.m $10226018  faddp         ST(1), ST      DEC1                      ^A
$1022601A  fxch          ST(1)          D9C9                      YI
$1022601C  fpopswap,                    41DB6D00D9CA4D8D6D10      A[m.YJM.m.
$10226026  fmulp         ST(1), ST      DEC9                      ^I
$10226028  fpush,                       4D8D6DF0D9C941DB7D00      M.mpYIA[}.
$10226032  ;                            488B45004883C508FFE0      H.E.H.E..` ok
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.
Turbo Pascal had a fast FP mode that used the FPU stack. I found almost
immediately that that is unusable for serious work.
Were that the case Intel had plenty opportunity to change it. They had
an academic advising them.
Let's take a non-trivial example to illustrate why the 8-deep fp stack
may not be that useful for numerical computation. This example is
actually from the FSL demo. The word computes the Lorenz equations,
which give rise to the famous butterfly attractor. This is a system of
three nonlinear first order differential equations in three variables,
x, y, z, which are time dependent. The Lorenz equations define the
instantaneous derivatives of these variables:

dx/dt = sigma*(y - x)
dy/dt = x*(rho -z) - y
dz/dt = x*y - beta*z

where sigma, rho, and beta are constant parameters.

Let's say we want to write a word DERIVS which computes and stores the
derivatives, given the instantaneous values of x, y, z. This is the
basis for any numerical code which solves the trajectory in time,
starting from an initial condition.

DERIVS ( F: x y z -- )

Hence, we want to place some values x, y, and z onto the fp stack and
compute the three derivatives. Ideally these three values remain on the
fp stack and don't need to be fetched from memory constantly until the
three derivatives are computed, especially if one is using the hardware
fp stack. We allow the constant parameters to be fetched from memory and
the results of the derivative computation to be stored to memory so they
don't overflow the stack. This should be doable with the 8-element
hardware fp stack.

Below I give Forth code which computes the derivatives. This code is
usable only on systems with a separate FP stack. It will be interesting
to see the compiled code given by Forth systems using the hardware fpu
stack to compute the results. While this example may behave properly, if
we go to a fourth order system or higher, it gets less likely that the
hardware stack remains usable.

--
Krishna

== begin fpstack-test.4th ==
\ fpstack-test.4th
\
\ Compute the Lorenz equations, a set of three coupled
\ nonlinear differential equations.
\
\ dx/dt = sigma*(y - x)
\ dy/dt = x*(rho -z) - y
\ dz/dt = x*y - beta*z
\
\ sigma, rho, and beta are constant parameters.
\
\ The following code requires a separate fp stack

include ans-words \ only for kForth64
include fsl/fsl-util


[UNDEFINED] FPICK [IF]
cr .( Your system may not use a separate floating point stack!)
ABORT
[THEN]

[UNDEFINED] F2OVER [IF]
: f2over ( F: r1 r2 r3 r4 -- r1 r2 r3 r4 r1 r2 )
3 fpick 3 fpick ;
[THEN]

[UNDEFINED] F+! [IF]
: f+! ( a -- ) ( F: r -- ) dup f@ f+ f! ;
[THEN]

16.0e0 fconstant sigma
45.92e0 fconstant rho
4.0e0 fconstant beta

\ Compute the derivatives given the instantaneous values
\ x, y, z for a given time t.

\ xdot{ is an array consisting of dx/dt, dy/dt, dz/dt
3 float array xdot{

: derivs ( F: x y z -- )
fdup f2over \ F: x y z z x y
f- sigma f* fnegate
xdot{ 0 } f! \ F: x y z z
rho fover f- \ F: x y z z rho-z
4 fpick f* \ F: x y z z x*(rho - z)
3 fpick f-
xdot{ 1 } f! \ F: x y z z
fdrop
beta f* fnegate
xdot{ 2 } f!
f* xdot{ 2 } f+!
;

0 [IF]
include ttester
\ Test DERIVS
1e-15 set-near
t{ 0.1e 0.6e 4.0e derivs -> }t
t{ xdot{ 0 } f@ -> 8.0e0 }t
t{ xdot{ 1 } f@ -> 3.592e0 }t
t{ xdot{ 2 } f@ -> -15.94e0 }t
[THEN]

== end fpstack-test.4th ==
dxf
2024-04-14 14:27:41 UTC
Permalink
...
Below I give Forth code which computes the derivatives. This code is usable only on systems with a separate FP stack. It will be interesting to see the compiled code given by Forth systems using the hardware fpu stack to compute the results. While this example may behave properly, if we go to a fourth order system or higher, it gets less likely that the hardware stack remains usable.
Systems that default to hardware fpu stack may well offer a software stack option.
Anton Ertl
2024-04-14 15:19:41 UTC
Permalink
Post by Krishna Myneni
dx/dt = sigma*(y - x)
dy/dt = x*(rho -z) - y
dz/dt = x*y - beta*z
where sigma, rho, and beta are constant parameters.
Let's say we want to write a word DERIVS which computes and stores the
derivatives, given the instantaneous values of x, y, z. This is the
basis for any numerical code which solves the trajectory in time,
starting from an initial condition.
DERIVS ( F: x y z -- )
Hence, we want to place some values x, y, and z onto the fp stack and
compute the three derivatives. Ideally these three values remain on the
fp stack and don't need to be fetched from memory constantly until the
three derivatives are computed, especially if one is using the hardware
fp stack. We allow the constant parameters to be fetched from memory and
the results of the derivative computation to be stored to memory so they
don't overflow the stack. This should be doable with the 8-element
hardware fp stack.
I have adapted your Forth code:

[UNDEFINED] F2OVER [IF]
: f2over ( F: r1 r2 r3 r4 -- r1 r2 r3 r4 r1 r2 )
3 fpick 3 fpick ;
[THEN]

16.0e0 fconstant sigma
45.92e0 fconstant rho
4.0e0 fconstant beta

fvariable dx/dt
fvariable dy/dt
fvariable dz/dt

: derivs ( F: x y z -- )
fdup f2over \ F: x y z z x y
f- sigma f* fnegate
dx/dt f! \ F: x y z z
rho fover f- \ F: x y z z rho-z
4 fpick f* \ F: x y z z x*(rho - z)
3 fpick f-
dy/dt f! \ F: x y z z
fdrop
beta f* fnegate
frot frot f* f+ dz/dt f!
;

0.1e 0.6e 4.0e derivs
dx/dt f@ f. cr \ 8.
dy/dt f@ f. cr \ 3.592
dz/dt f@ f. cr \ -15.94

In particular, I eliminated the additional memory accesses to DZ/DT.

SwiftForth, VFX and iforth produce the expected results for your test
case. The code is:

SwiftForth 4.0.0-RC87 VFX Forth 64 5.43 iforth-5.1-mini
ST(0) FLD FLD ST fld ST(0)
44E8BC ( f2over ) CALL CALL 0050A080 F2OVER fld [r13 0 +] tbyte
ST(0) ST(1) FSUBP FSUBP ST(1), ST fxch ST(1)
44E8FB ( sigma ) CALL CALL 0050A2BB SIGMA fld [r13 #16 +] tby
ST(0) ST(1) FMULP FMULP ST(1), ST lea r13, [r13 #32 +]
FCHS FCHS fxch ST(3)
-8 [RBP] RBP LEA FSTP TBYTE FFF9CFE8 [RIP] fxch ST(1)
RBX 0 [RBP] MOV CALL 0050A2FB RHO fld ST(3)
4C508 [RDI] RBX LEA FLD ST(1) fld ST(3)
0 [RBX] TBYTE FSTP FSUBP ST(1), ST fsubp ST(1), ST
0 [RBP] RBX MOV LEA RBP, [RBP+-08] fld $101BC720 tbyte
8 [RBP] RBP LEA MOV [RBP], RBX fmulp ST(1), ST
44E923 ( rho ) CALL MOV EBX, # 00000004 fchs
ST(1) FLD CALL 005030C0 FPICK fstp $10226470 tbyte
ST(0) ST(1) FSUBP FMULP ST(1), ST fld $101BC710 tbyte
-8 [RBP] RBP LEA LEA RBP, [RBP+-08] fld ST(1)
RBX 0 [RBP] MOV MOV [RBP], RBX fsubp ST(1), ST
4 # EBX MOV MOV EBX, # 00000003 fld ST(4)
43C901 ( FPICK ) CALL CALL 005030C0 FPICK fmulp ST(1), ST
ST(0) ST(1) FMULP FSUBP ST(1), ST fld ST(3)
-8 [RBP] RBP LEA FSTP TBYTE FFF9CFC1 [RIP] fsubp ST(1), ST
RBX 0 [RBP] MOV FSTP ST fstp $10226490 tbyte
3 # EBX MOV CALL 0050A33B BETA ffreep ST(0)
43C901 ( FPICK ) CALL FMULP ST(1), ST fld $101BC700 tbyte
ST(0) ST(1) FSUBP FCHS fmulp ST(1), ST
-8 [RBP] RBP LEA FXCH ST(1) fchs
RBX 0 [RBP] MOV FXCH ST(2) fxch ST(1)
4C530 [RDI] RBX LEA FXCH ST(1) fxch ST(2)
0 [RBX] TBYTE FSTP FXCH ST(2) fxch ST(1)
0 [RBP] RBX MOV FMULP ST(1), ST fxch ST(2)
8 [RBP] RBP LEA FADDP ST(1), ST fmulp ST(1), ST
ST(0) FSTP FSTP TBYTE FFF9CFB4 [RIP] fxch ST(1)
44E94B ( beta ) CALL RET/NEXT fpopswap,
ST(0) ST(1) FMULP faddp ST(1), ST
FCHS fstp $102264B0 tbyte
43C807 ( FROT ) CALL ;
43C807 ( FROT ) CALL
ST(0) ST(1) FMULP
ST(0) ST(1) FADDP
-8 [RBP] RBP LEA
RBX 0 [RBP] MOV
4C558 [RDI] RBX LEA
0 [RBX] TBYTE FSTP
0 [RBP] RBX MOV
8 [RBP] RBP LEA
RET

FPICK is apparently implemented on SwiftForth and VFX through an
indirect branch that branches to one of 8 variants of "FLD ST(...)",
while iForth manages to resolve this during compilation.

I have also looked at VFX 5.11 which uses XMM registers instead of the
FP stack, but it does not inline FP operations, so you mostly see a long
sequence of calls.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
Krishna Myneni
2024-04-14 23:32:11 UTC
Permalink
Post by Krishna Myneni
Post by Krishna Myneni
dx/dt = sigma*(y - x)
dy/dt = x*(rho -z) - y
dz/dt = x*y - beta*z
where sigma, rho, and beta are constant parameters.
Let's say we want to write a word DERIVS which computes and stores the
derivatives, given the instantaneous values of x, y, z. This is the
basis for any numerical code which solves the trajectory in time,
starting from an initial condition.
DERIVS ( F: x y z -- )
Hence, we want to place some values x, y, and z onto the fp stack and
compute the three derivatives. Ideally these three values remain on the
fp stack and don't need to be fetched from memory constantly until the
three derivatives are computed, especially if one is using the hardware
fp stack. We allow the constant parameters to be fetched from memory and
the results of the derivative computation to be stored to memory so they
don't overflow the stack. This should be doable with the 8-element
hardware fp stack.
[UNDEFINED] F2OVER [IF]
: f2over ( F: r1 r2 r3 r4 -- r1 r2 r3 r4 r1 r2 )
3 fpick 3 fpick ;
[THEN]
16.0e0 fconstant sigma
45.92e0 fconstant rho
4.0e0 fconstant beta
fvariable dx/dt
fvariable dy/dt
fvariable dz/dt
: derivs ( F: x y z -- )
fdup f2over \ F: x y z z x y
f- sigma f* fnegate
dx/dt f! \ F: x y z z
rho fover f- \ F: x y z z rho-z
4 fpick f* \ F: x y z z x*(rho - z)
3 fpick f-
dy/dt f! \ F: x y z z
fdrop
beta f* fnegate
frot frot f* f+ dz/dt f!
;
0.1e 0.6e 4.0e derivs
In particular, I eliminated the additional memory accesses to DZ/DT.
Nice. FROT FROT is expensive on a memory based FP stack, unless it is
optimized by the compiler, but for fpu stack use it's probably very
fast. I see that VFX Forth and iforth use a series of FXCH instructions
to implement FROT FROT.
Post by Krishna Myneni
SwiftForth, VFX and iforth produce the expected results for your test
SwiftForth 4.0.0-RC87 VFX Forth 64 5.43 iforth-5.1-mini
ST(0) FLD FLD ST fld ST(0)
44E8BC ( f2over ) CALL CALL 0050A080 F2OVER fld [r13 0 +] tbyte
ST(0) ST(1) FSUBP FSUBP ST(1), ST fxch ST(1)
44E8FB ( sigma ) CALL CALL 0050A2BB SIGMA fld [r13 #16 +] tby
ST(0) ST(1) FMULP FMULP ST(1), ST lea r13, [r13 #32 +]
FCHS FCHS fxch ST(3)
-8 [RBP] RBP LEA FSTP TBYTE FFF9CFE8 [RIP] fxch ST(1)
RBX 0 [RBP] MOV CALL 0050A2FB RHO fld ST(3)
4C508 [RDI] RBX LEA FLD ST(1) fld ST(3)
0 [RBX] TBYTE FSTP FSUBP ST(1), ST fsubp ST(1), ST
0 [RBP] RBX MOV LEA RBP, [RBP+-08] fld $101BC720 tbyte
8 [RBP] RBP LEA MOV [RBP], RBX fmulp ST(1), ST
44E923 ( rho ) CALL MOV EBX, # 00000004 fchs
ST(1) FLD CALL 005030C0 FPICK fstp $10226470 tbyte
ST(0) ST(1) FSUBP FMULP ST(1), ST fld $101BC710 tbyte
-8 [RBP] RBP LEA LEA RBP, [RBP+-08] fld ST(1)
RBX 0 [RBP] MOV MOV [RBP], RBX fsubp ST(1), ST
4 # EBX MOV MOV EBX, # 00000003 fld ST(4)
43C901 ( FPICK ) CALL CALL 005030C0 FPICK fmulp ST(1), ST
ST(0) ST(1) FMULP FSUBP ST(1), ST fld ST(3)
-8 [RBP] RBP LEA FSTP TBYTE FFF9CFC1 [RIP] fsubp ST(1), ST
RBX 0 [RBP] MOV FSTP ST fstp $10226490 tbyte
3 # EBX MOV CALL 0050A33B BETA ffreep ST(0)
43C901 ( FPICK ) CALL FMULP ST(1), ST fld $101BC700 tbyte
ST(0) ST(1) FSUBP FCHS fmulp ST(1), ST
-8 [RBP] RBP LEA FXCH ST(1) fchs
RBX 0 [RBP] MOV FXCH ST(2) fxch ST(1)
4C530 [RDI] RBX LEA FXCH ST(1) fxch ST(2)
0 [RBX] TBYTE FSTP FXCH ST(2) fxch ST(1)
0 [RBP] RBX MOV FMULP ST(1), ST fxch ST(2)
8 [RBP] RBP LEA FADDP ST(1), ST fmulp ST(1), ST
ST(0) FSTP FSTP TBYTE FFF9CFB4 [RIP] fxch ST(1)
44E94B ( beta ) CALL RET/NEXT fpopswap,
ST(0) ST(1) FMULP faddp ST(1), ST
FCHS fstp $102264B0 tbyte
43C807 ( FROT ) CALL ;
43C807 ( FROT ) CALL
ST(0) ST(1) FMULP
ST(0) ST(1) FADDP
-8 [RBP] RBP LEA
RBX 0 [RBP] MOV
4C558 [RDI] RBX LEA
0 [RBX] TBYTE FSTP
0 [RBP] RBX MOV
8 [RBP] RBP LEA
RET
FPICK is apparently implemented on SwiftForth and VFX through an
indirect branch that branches to one of 8 variants of "FLD ST(...)",
while iForth manages to resolve this during compilation.
Good to see that x, y, z are not repeatedly fetched from memory.

For this example, the hardware fpu stack is sufficient. But, it's easy
to see that the benefits of a hardware-only stack would diminish quickly
as the size of the problem increased a small amount, and then the
programmer (or compiler) would have to keep careful track of how many
fpu registers are used.
Post by Krishna Myneni
I have also looked at VFX 5.11 which uses XMM registers instead of the
FP stack, but it does not inline FP operations, so you mostly see a long
sequence of calls.
--
Krishna
Anton Ertl
2024-04-14 08:34:35 UTC
Permalink
Post by mhx
Apparently there are special interrupts that one can enable
to signal FPU stack underflow (and then spill to memory)
but I never got them to work reliably.
From what I read about this, the intention was that the FP stack would
extend into memory (and thus not be limited to 8 elements): software
should react to FP stack overflows and underflows and store some
elements on overflow, and reload some elements on underflow. However,
this functionality was implemented in a buggy way on the 8087, so it
never worked as intended. Hoever, when they noticed this, the 8087
was already on the market, and Hyrum's law ensured that this behaviour
could not be changed.

And apparently this feature was not considered to be important enough
to add a new architectural feature that allows implementing the FP
stack extension to memory. I guess that the implementations of
Fortran and Algol-family languages (e.g., C) in the 1980s only used
the stack within an expression, so avoiding FP stack overflows with
compiler analysis (like you do), is relatively easy.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
dxf
2024-04-14 10:05:33 UTC
Permalink
Post by Anton Ertl
Post by mhx
Apparently there are special interrupts that one can enable
to signal FPU stack underflow (and then spill to memory)
but I never got them to work reliably.
From what I read about this, the intention was that the FP stack would
extend into memory (and thus not be limited to 8 elements): software
should react to FP stack overflows and underflows and store some
elements on overflow, and reload some elements on underflow. However,
this functionality was implemented in a buggy way on the 8087, so it
never worked as intended. Hoever, when they noticed this, the 8087
was already on the market, and Hyrum's law ensured that this behaviour
could not be changed.
Do you have a reference for that? Below is a paper written by one of the
designers and it doesn't appear to be mentioned. It's of course possible
to maintain a stack in software and use the FPU to do the calculations.
There are instructions to load/store Temp Real format to memory and that
gets a mention.

https://dl.acm.org/doi/pdf/10.1145/800053.801923
Anton Ertl
2024-04-14 11:25:07 UTC
Permalink
Post by dxf
Post by Anton Ertl
From what I read about this, the intention was that the FP stack would
extend into memory (and thus not be limited to 8 elements): software
should react to FP stack overflows and underflows and store some
elements on overflow, and reload some elements on underflow. However,
this functionality was implemented in a buggy way on the 8087, so it
never worked as intended. Hoever, when they noticed this, the 8087
was already on the market, and Hyrum's law ensured that this behaviour
could not be changed.
Do you have a reference for that?
Kahan writes about the original intention in

http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf

especially starting at the last paragraph of page 2.

And about the bug (or rather design mistake):

https://history.siam.org/pdfs2/Kahan_final.pdf

Start with the second-to-last paragraph on page 163. He digresses for
a page, but continues on the fourth paragraph of page 165 and
continues to the first paragraph of page 168.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
dxf
2024-04-14 13:35:24 UTC
Permalink
Post by Anton Ertl
Post by dxf
Post by Anton Ertl
From what I read about this, the intention was that the FP stack would
extend into memory (and thus not be limited to 8 elements): software
should react to FP stack overflows and underflows and store some
elements on overflow, and reload some elements on underflow. However,
this functionality was implemented in a buggy way on the 8087, so it
never worked as intended. Hoever, when they noticed this, the 8087
was already on the market, and Hyrum's law ensured that this behaviour
could not be changed.
Do you have a reference for that?
Kahan writes about the original intention in
http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf
especially starting at the last paragraph of page 2.
https://history.siam.org/pdfs2/Kahan_final.pdf
Start with the second-to-last paragraph on page 163. He digresses for
a page, but continues on the fourth paragraph of page 165 and
continues to the first paragraph of page 168.
The latter sounds like someone not getting his way more than a design mistake.
In the first reference Kahan states:

"When the 8087 was designed, I knew that stack over/underflow was an issue of
more aesthetic than practical importance. I still regret that the 8087's stack
implementation was not quite so neat as my original intention described in the
accompanying note."

Intel decided Kahan's aesthetic afterthought could be dispensed with. History
appears to have proven them correct. Were 8 levels of stack actually insufficient,
it would have made more sense for Intel to double it (if not for the 8087 then the
next incarnation) than to spill to memory which was bad in every way.
Anton Ertl
2024-04-14 15:53:40 UTC
Permalink
Post by dxf
Post by Anton Ertl
Kahan writes about the original intention in
http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf
especially starting at the last paragraph of page 2.
https://history.siam.org/pdfs2/Kahan_final.pdf
Start with the second-to-last paragraph on page 163. He digresses for
a page, but continues on the fourth paragraph of page 165 and
continues to the first paragraph of page 168.
The latter sounds like someone not getting his way more than a design mistake.
"When the 8087 was designed, I knew that stack over/underflow was an issue of
more aesthetic than practical importance. I still regret that the 8087's stack
implementation was not quite so neat as my original intention described in the
accompanying note."
Intel decided Kahan's aesthetic afterthought could be dispensed with.
In a way, they did, and Kahan obviously did not get his way. But to
me it sounds like they tried and failed at implementing a stack that
extends into memory. The tags that indicate the presence of a stack
item are there. If they had made a conscious decision at the start to
dispense with the idea of an extensible stack, they would have
discpensed with these bits indicating the presence of a stack item as
well. So what happened is that they botched the first attempt, and
then decided that they did not want to do what would have been
necessary to fix it.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
dxf
2024-04-15 02:07:47 UTC
Permalink
Post by Anton Ertl
Post by dxf
Post by Anton Ertl
Kahan writes about the original intention in
http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf
especially starting at the last paragraph of page 2.
https://history.siam.org/pdfs2/Kahan_final.pdf
Start with the second-to-last paragraph on page 163. He digresses for
a page, but continues on the fourth paragraph of page 165 and
continues to the first paragraph of page 168.
The latter sounds like someone not getting his way more than a design mistake.
"When the 8087 was designed, I knew that stack over/underflow was an issue of
more aesthetic than practical importance. I still regret that the 8087's stack
implementation was not quite so neat as my original intention described in the
accompanying note."
Intel decided Kahan's aesthetic afterthought could be dispensed with.
In a way, they did, and Kahan obviously did not get his way. But to
me it sounds like they tried and failed at implementing a stack that
extends into memory. The tags that indicate the presence of a stack
item are there. If they had made a conscious decision at the start to
dispense with the idea of an extensible stack, they would have
discpensed with these bits indicating the presence of a stack item as
well. So what happened is that they botched the first attempt, and
then decided that they did not want to do what would have been
necessary to fix it.
My impression is Palmer (the mathematician Intel hired to co-head the
project) was trying to placate Kahan and it fell through for various
reasons.

The design criteria that never changed was the 8-level hardware stack.
Forthers can either accept it for best performance - or pick something
more forgiving at a lesser performance.
Krishna Myneni
2024-04-15 08:47:33 UTC
Permalink
On 4/14/24 21:07, dxf wrote:
...
Post by dxf
The design criteria that never changed was the 8-level hardware stack.
Forthers can either accept it for best performance - or pick something
more forgiving at a lesser performance.
In the Lorenz equation example, which works with the 8 deep fpu stack,
we have assumed that the fpu hardware stack was empty before calling
DERIVS. In a real use case, the call to DERIVS is likely to occur within
a deeper call chain, resulting in items already on the fpu stack before
args for DERIVS are pushed. As Marcel said, using only a hardware-based
fp stack is not realistic for any non-trivial floating point work.

The loss of performance with a memory-based fp stack is far less a
concern than having to consider the limited stack depth when writing
code involving floating point arithmetic. Failure from overflowing the
fpu stack is silent. Debugging is likely to be a nightmare.

--
Krishna
minforth
2024-04-15 09:35:22 UTC
Permalink
In most cases 'bigger' fp data will be stored in memory anyhow,
which can be cached before disk access. The old 8087 improvements
were caused by its new fp operators, the stack was unusable.

And if CPU based stacks were so lucrative for high performance,
CPU makers would have implemented them since long for normal
integer data.
a***@spenarnc.xs4all.nl
2024-04-15 10:37:11 UTC
Permalink
Post by minforth
In most cases 'bigger' fp data will be stored in memory anyhow,
which can be cached before disk access. The old 8087 improvements
were caused by its new fp operators, the stack was unusable.
And if CPU based stacks were so lucrative for high performance,
CPU makers would have implemented them since long for normal
integer data.
The iA64 comes to mind. Apparently a failure but was a
technical or commercial failure?

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
minforth
2024-04-15 11:37:00 UTC
Permalink
Post by a***@spenarnc.xs4all.nl
Post by minforth
In most cases 'bigger' fp data will be stored in memory anyhow,
which can be cached before disk access. The old 8087 improvements
were caused by its new fp operators, the stack was unusable.
And if CPU based stacks were so lucrative for high performance,
CPU makers would have implemented them since long for normal
integer data.
The iA64 comes to mind. Apparently a failure but was a
technical or commercial failure?
Both i.e. poor developer tool stack and strong AMD competition.
And it was overly complex.
https://softwareengineering.stackexchange.com/questions/279334/why-was-the-itanium-processor-difficult-to-write-a-compiler-for
Anton Ertl
2024-04-15 14:13:27 UTC
Permalink
Post by a***@spenarnc.xs4all.nl
Post by minforth
And if CPU based stacks were so lucrative for high performance,
CPU makers would have implemented them since long for normal
integer data.
They have: SPARC has a stack of integer register windows, AMD29K and
IA-64 have a register stack of 128 integer registers. However, the
29K died in the early 1990s, IA-64 always remained niche and has been
killed (last order date in January 2020), and SPARC delivered OoO too
late to save it, and the last new SPARC designs were introduced in
2017.
Post by a***@spenarnc.xs4all.nl
The iA64 comes to mind. Apparently a failure but was a
technical or commercial failure?
It was a great commercial success. Several companies killed their
RISCs in favour of IA-64 based just on the IA-64 roadmaps, and when
IA-64 failed to achieve the predicted technical and market
superiority, they switched to Intel's AMD64 CPUs: HP, DEC (bought by
Compaq bought by HP), SGI. Apple switched to Intel's AMD64 CPUs
without the IA-64 step, so maybe it would have happened for the others
as well without IA-64, but maybe one of the others would have done
what ARM and Apple did 15-20 years later.

Technically, IA-64 was a bet on in-order execution with compiler
scheduling being superior to hardware scheduling (out-of-order
execution). That was a bet on the wrong horse, as the superior
hardware branch prediction allows OoO hardware to outperform IA-64 on
branchy code, while SIMD and GPGPUs eat IA-64's lunch on data-parallel
code.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
dxf
2024-04-15 12:14:16 UTC
Permalink
...
Post by dxf
The design criteria that never changed was the 8-level hardware stack.
Forthers can either accept it for best performance - or pick something
more forgiving at a lesser performance.
In the Lorenz equation example, which works with the 8 deep fpu stack, we have assumed that the fpu hardware stack was empty before calling DERIVS. In a real use case, the call to DERIVS is likely to occur within a deeper call chain, resulting in items already on the fpu stack before args for DERIVS are pushed. As Marcel said, using only a hardware-based fp stack is not realistic for any non-trivial floating point work.
The loss of performance with a memory-based fp stack is far less a concern than having to consider the limited stack depth when writing code involving floating point arithmetic. Failure from overflowing the fpu stack is silent. Debugging is likely to be a nightmare.
Likely the designers never considered Forth. OTOH ANS-Forth did and said
6 items ought to be good enough for anyone :)
Anton Ertl
2024-04-15 14:09:28 UTC
Permalink
Post by Krishna Myneni
Failure from overflowing the
fpu stack is silent.
Reality check:

VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023

1e 2e 3e 4e 5e 6e 7e ok F:-7
8e ok
NDP Stack Fault: NDP SW = 0041
NDP Potential Exception: NDP SW = 0041

SwiftForth also seems to notice it in some way, but does not report it
as an error:

SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e ok
f. 6.00000000 ok

SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e 7e ok
f.
ok

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
Krishna Myneni
2024-04-15 22:02:40 UTC
Permalink
Post by Anton Ertl
Post by Krishna Myneni
Failure from overflowing the
fpu stack is silent.
VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023
1e 2e 3e 4e 5e 6e 7e ok F:-7
8e ok
NDP Stack Fault: NDP SW = 0041
NDP Potential Exception: NDP SW = 0041
SwiftForth also seems to notice it in some way, but does not report it
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e ok
f. 6.00000000 ok
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e 7e ok
f.
ok
I tried overflowing the fpu stack in kforth32, and no exception is
raised. Perhaps one needs to configure the fpu to raise an exception.
Also tried it in C with an assembly procedure. The executable throws no
exception.

--
Krishna

== begin fpu-stack-overflow.4th ==
fpu-stack-overflow.4th
\ for use with kforth32

include ans-words
include strings
include modules
include syscalls
include mc
include asm-x86

code fpu-stack-overflow
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
end-code

fpu-stack-overflow


== end fpu-stack-overflow.4th ==

== begin example ==
$ kforth32
kForth-32 v 2.4.5 (Build: 2024-03-30)
Copyright (c) 1998--2023 Krishna Myneni
Contributions by: dpw gd mu bk abs tn cmb bg dnw
Provided under the GNU Affero General Public License, v3.0 or later

include fpu-stack-overflow
ok
== end example ==
minforth
2024-04-16 01:10:42 UTC
Permalink
I would be surprised to get a SIGFPE interrupt from x87 stack ops.
X87 mode has long been deprecated and replaced by SSE2.
Krishna Myneni
2024-04-16 01:41:43 UTC
Permalink
Post by minforth
I would be surprised to get a SIGFPE interrupt from x87 stack ops.
X87 mode has long been deprecated and replaced by SSE2.
The FPU has maskable interrupts for arithmetic -- see

https://github.com/mynenik/kForth-32/blob/master/forth-src/fpu-x86.4th

But, yes, I'm not aware of any interrupts from stack ops.

--
Krishna
Anton Ertl
2024-04-16 05:53:15 UTC
Permalink
Post by minforth
X87 mode has long been deprecated
Citation needed.
Post by minforth
and replaced by SSE2.
I just tried compiling the following program with gcc with different
options:

float sfplus(float a, float b)
{
return a+b;
}

double dfplus(double a, double b)
{
return a+b;
}

long double lfplus(long double a, long double b)
{
return a+b;
}

what I got was:

sfplus() dfplus() lfplus()
387 387 387 gcc -m32 -O
SSE2 SSE2 387 gcc -m64 -O
387 387 387 gcc -m64 -mfpmath=387 -O

The System V calling convention for AMD64 passes and returns float and
double in xmm registers, so the last option leads to moving the values
between the xmm register and the 387 FP stack through memory, e.g.:

000000000000001f <dfplus>:
1f: f2 0f 11 44 24 f0 movsd %xmm0,-0x10(%rsp)
25: f2 0f 11 4c 24 f8 movsd %xmm1,-0x8(%rsp)
2b: dd 44 24 f0 fldl -0x10(%rsp)
2f: dc 44 24 f8 faddl -0x8(%rsp)
33: dd 5c 24 f0 fstpl -0x10(%rsp)
37: f2 0f 10 44 24 f0 movsd -0x10(%rsp),%xmm0
3d: c3 retq

By contrast, the middle option produces:

0000000000000005 <dfplus>:
5: f2 0f 58 c1 addsd %xmm1,%xmm0
9: c3 retq

and the first option:

00000009 <dfplus>:
9: dd 44 24 04 fldl 0x4(%esp)
d: dc 44 24 0c faddl 0xc(%esp)
11: c3 ret

The difference between the first and last option is due to the
differences in calling convention.

gcc also has the option -mfpmath=sse,387 which tells the compiler that
it can use both. A small experiment only resulted in using SSE2, but
maybe if I had used code with more values alive at the same time it
would have used both.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
minforth
2024-04-16 06:58:37 UTC
Permalink
Sure, x87 mode is still there for backwards compatibility and you can
instruct compilers to use it eg for 80-bit floats. Godbolt is your friend.
I seem to remember the "deprecation" had to do with inefficient load/store
mechanisms between memory and x87 registes.

I can't remember or name an exact citation but found a somewhat related
discussion:
https://retrocomputing.stackexchange.com/questions/9751/did-any-compiler-fully-use-intel-x87-80-bit-floating-point
mhx
2024-04-16 07:21:16 UTC
Permalink
Anton Ertl wrote:
[..]
Post by Anton Ertl
gcc also has the option -mfpmath=sse,387 which tells the compiler that
it can use both.
Nice to know! Last time I checked, only the Intel compiler could do that.

-marcel
Krishna Myneni
2024-04-17 11:24:15 UTC
Permalink
Post by minforth
I would be surprised to get a SIGFPE interrupt from x87 stack ops.
X87 mode has long been deprecated and replaced by SSE2.
Maybe I'm missing something, but SSE2 does not seem to have anything
beyond basic floating point arithmetic e.g. transcendental functions.

--
Krishna
minforth
2024-04-17 13:07:22 UTC
Permalink
Post by Krishna Myneni
Post by minforth
I would be surprised to get a SIGFPE interrupt from x87 stack ops.
X87 mode has long been deprecated and replaced by SSE2.
Maybe I'm missing something, but SSE2 does not seem to have anything
beyond basic floating point arithmetic e.g. transcendental functions.
Hardware x87 support isn't necessarily faster:
https://users.ece.utexas.edu/~adnan/comm/fast-trigonometric-functions-using.pdf

But I think the main advantage lies in the possibility of parallel and/or
vectorized execution.

And of course Forth is very close to assembler and as such it is natural
to use x87 instructions, unless a Forth system is implemented using libc
or using math libraries that better exploit those many features of modern CPUs.
mhx
2024-04-17 17:37:45 UTC
Permalink
[..]
Post by minforth
https://users.ece.utexas.edu/~adnan/comm/fast-trigonometric-functions-using.pdf
.. which shows it also can be a lot slower. And if I'm not mistaken, it can't
deliver the 80 bits of the FPU without breaking down by a factor of at least 1.3.
Post by minforth
But I think the main advantage lies in the possibility of parallel and/or
vectorized execution.
I have not yet seen algorithms where that would bring something. It might when
all other code is done properly with SSE. An on-the-fly reconfigured FPGA
(available on some microcontrollers) might be a better idea :--)
Post by minforth
And of course Forth is very close to assembler and as such it is natural
to use x87 instructions, unless a Forth system is implemented using libc
or using math libraries that better exploit those many features of modern CPUs.
It all depends, that is what I like about Forth.

-marcel
minforth
2024-04-17 19:16:47 UTC
Permalink
Post by mhx
Post by minforth
And of course Forth is very close to assembler and as such it is natural
to use x87 instructions, unless a Forth system is implemented using libc
or using math libraries that better exploit those many features of modern CPUs.
It all depends, that is what I like about Forth.
True, it all depends. Copied from a Visual Studio documentation:

Many of the floating point math library functions have different implementations
for different CPU architectures. For example, the 32-bit x86 CRT may have a
different implementation than the 64-bit x64 CRT. In addition, some of the
functions may have multiple implementations for a given CPU architecture. The
most efficient implementation is selected dynamically at run-time depending on
the instruction sets supported by the CPU. For example, in the 32-bit x86 CRT,
some functions have both an x87 implementation and an SSE2 implementation. When
running on a CPU that supports SSE2, the faster SSE2 implementation is used.
When running on a CPU that does not support SSE2, the slower x87 implementation
is used.
mhx
2024-04-18 07:23:48 UTC
Permalink
minforth wrote:
[..]
[..] For example, in the 32-bit x86 CRT,
some functions have both an x87 implementation and an SSE2 implementation. When
running on a CPU that supports SSE2, the faster SSE2 implementation is used.
When running on a CPU that does not support SSE2, the **slower**
[ my emphasis -mhx ] x87 implementation is used.
This strikes me as showing a strong bias (i.e. not strictly based on technical
arguments) towards SSE2. I've noticed before that Microsoft has a dislike for
the x87 FPU, if not boycotting it outright (e.g. no long double in their
compiler).

-marcel
minforth
2024-04-18 08:08:07 UTC
Permalink
Post by mhx
[..]
[..] For example, in the 32-bit x86 CRT,
some functions have both an x87 implementation and an SSE2 implementation. When
running on a CPU that supports SSE2, the faster SSE2 implementation is used.
When running on a CPU that does not support SSE2, the **slower**
[ my emphasis -mhx ] x87 implementation is used.
This strikes me as showing a strong bias (i.e. not strictly based on technical
arguments) towards SSE2. I've noticed before that Microsoft has a dislike for
the x87 FPU, if not boycotting it outright (e.g. no long double in their
compiler).
Could be. At least gcc has support for _float80. FWIW Intel's icc even has a
compiler flag for x87 stack overflow warnings. It all depends. ;-)
mhx
2024-04-18 09:09:19 UTC
Permalink
[..]
Post by minforth
Could be. At least gcc has support for _float80. FWIW Intel's icc even has a
compiler flag for x87 stack overflow warnings.
[..]

At runtime?

-marcel
minforth
2024-04-18 09:27:07 UTC
Permalink
Post by mhx
[..]
Post by minforth
Could be. At least gcc has support for _float80. FWIW Intel's icc even has a
compiler flag for x87 stack overflow warnings.
[..]
At runtime?
AFAIK they check if computation results are popped off the x87 stack and put
into the xmm0 register. Could be a flag for compile-time checking.
minforth
2024-04-18 09:51:55 UTC
Permalink
Post by minforth
Post by mhx
[..]
Post by minforth
Could be. At least gcc has support for _float80. FWIW Intel's icc even has a
compiler flag for x87 stack overflow warnings.
[..]
At runtime?
AFAIK they check if computation results are popped off the x87 stack and put
into the xmm0 register. Could be a flag for compile-time checking.
Correction. It generates runtime code
https://www.intel.com/content/www/us/en/docs/cpp-compiler/developer-guide-reference/2021-8/fp-stack-check-qfp-stack-check.html

but it doesn't look very convincing to me.
a***@spenarnc.xs4all.nl
2024-04-18 09:04:46 UTC
Permalink
Post by mhx
[..]
[..] For example, in the 32-bit x86 CRT,
some functions have both an x87 implementation and an SSE2 implementation. When
running on a CPU that supports SSE2, the faster SSE2 implementation is used.
When running on a CPU that does not support SSE2, the **slower**
[ my emphasis -mhx ] x87 implementation is used.
This strikes me as showing a strong bias (i.e. not strictly based on technical
arguments) towards SSE2. I've noticed before that Microsoft has a dislike for
the x87 FPU, if not boycotting it outright (e.g. no long double in their
compiler).
Microsoft are no hobbyists. There has to be a commercial incentive to
take it up, that is not a boycott. OTOH the gcc folks like a
challenge, resulting in a baroque building.
Post by mhx
-marcel
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
Anton Ertl
2024-04-20 15:58:03 UTC
Permalink
Post by mhx
Post by minforth
But I think the main advantage lies in the possibility of parallel and/or
vectorized execution.
I have not yet seen algorithms where that would bring something.
Matrix multiplication is an easy case. I have also done a version of
Jon Bentley's greedy TSP program that benefitted from SSE and AVX; I
had to use assembly language to do this, however; see the thread
starting at <***@mips.complang.tuwien.ac.at>.

OTOH, yesterday I saw what gcc did for the inner loop of the bubble
benchmark from the Stanford integer benchmarks:

while ( i<top ) {

if ( sortlist[i] > sortlist[i+1] ) {
j = sortlist[i];
sortlist[i] = sortlist[i+1];
sortlist[i+1] = j;
};
i=i+1;
};

top=top-1;
};

gcc-12.2 -O1 produces straighforward scalar code, gcc-12.2 -O3 wants
to use SIMD instructions:

gcc -01 gcc -O3
1c: add $0x4,%rax c0: movq (%rax),%xmm0
cmp %rsi,%rax add $0x1,%edx
je 35 pshufd $0xe5,%xmm0,%xmm1
25: mov (%rax),%edx movd %xmm0,%edi
mov 0x4(%rax),%ecx movd %xmm1,%ecx
cmp %ecx,%edx cmp %ecx,%edi
jle 1c jle e1
mov %ecx,(%rax) pshufd $0xe1,%xmm0,%xmm0
mov %edx,0x4(%rax) movq %xmm0,(%rax)
jmp 1c e1: add $0x4,%rax
35: cmp %r8d,%edx
jl c0

The version produced by gcc -O3 is almost three times slower on a
Skylake than the one by gcc -O1 and is actually slower than several
Forth systems, including gforth-fast. I think that the reason is that
the movq towards the end stores two items, and the movq at the start
of the next iteration loads one of these item, i.e., there is partial
overlap between the store and the load. In this case the hardware
takes a slow path, which means that the slowdown is much bigger than
the instruction count suggests.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
Paul Rubin
2024-04-20 19:00:00 UTC
Permalink
Post by Anton Ertl
The version produced by gcc -O3 is almost three times slower on a
Skylake than the one by gcc -O1 and is actually slower than several
Forth systems, including gforth-fast.
Wow, that seems worth a bug report. How about gcc -O2 ?
Anton Ertl
2024-04-21 06:45:29 UTC
Permalink
Post by Paul Rubin
Post by Anton Ertl
The version produced by gcc -O3 is almost three times slower on a
Skylake than the one by gcc -O1 and is actually slower than several
Forth systems, including gforth-fast.
Wow, that seems worth a bug report.
At least the compiled code is working as the programmer intended. gcc
people regularly resolve bug reports as "INVALID" where earlier gcc
versions work as intended and later gcc versions do not. And then
there are the cases where they just do nothing, as on PR93811.
Post by Paul Rubin
How about gcc -O2 ?
I have not tried it.

If you want to measure it and/or report this as a bug, you can find
the source code in <http://www.complang.tuwien.ac.at/forth/bench.zip>
in the directory c-manual.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
Anton Ertl
2024-04-21 09:12:54 UTC
Permalink
Post by Anton Ertl
OTOH, yesterday I saw what gcc did for the inner loop of the bubble
while ( i<top ) {
if ( sortlist[i] > sortlist[i+1] ) {
j = sortlist[i];
sortlist[i] = sortlist[i+1];
sortlist[i+1] = j;
};
i=i+1;
};
top=top-1;
};
gcc-12.2 -O1 produces straighforward scalar code, gcc-12.2 -O3 wants
gcc -01 gcc -O3
1c: add $0x4,%rax c0: movq (%rax),%xmm0
cmp %rsi,%rax add $0x1,%edx
je 35 pshufd $0xe5,%xmm0,%xmm1
25: mov (%rax),%edx movd %xmm0,%edi
mov 0x4(%rax),%ecx movd %xmm1,%ecx
cmp %ecx,%edx cmp %ecx,%edi
jle 1c jle e1
mov %ecx,(%rax) pshufd $0xe1,%xmm0,%xmm0
mov %edx,0x4(%rax) movq %xmm0,(%rax)
jmp 1c e1: add $0x4,%rax
35: cmp %r8d,%edx
jl c0
The version produced by gcc -O3 is almost three times slower on a
Skylake than the one by gcc -O1 and is actually slower than several
Forth systems, including gforth-fast. I think that the reason is that
the movq towards the end stores two items, and the movq at the start
of the next iteration loads one of these item, i.e., there is partial
overlap between the store and the load. In this case the hardware
takes a slow path, which means that the slowdown is much bigger than
the instruction count suggests.
I was curious if a more recent Intel core had improved on that (and
maybe such a more recent Intel core was targeted by the "optimization"
that caused the slowdown), so I measured it on a P-core of a Core
i3-1315U. The results are as follows:

O1/bubble O3/bubble
424,248,952 2,061,809,866 cpu_core/cycles/
1,536,825,253 1,986,035,580 cpu_core/instructions/

So, more than a factor of 4 on this microarchitecture.

The differences in the topdown analysis are also interesting:

O1
1,177,188,340 cpu_core/topdown-retiring/ # 46.1% Retiring
279,332,826 cpu_core/topdown-bad-spec/ # 10.9% Bad Speculation
778,141,445 cpu_core/topdown-fe-bound/ # 30.5% Frontend Bound
319,237,516 cpu_core/topdown-be-bound/ # 12.5% Backend Bound
0 cpu_core/topdown-heavy-ops/ # 0.0% Heavy Operations
269,356,654 cpu_core/topdown-br-mispredict/ # 10.5% Branch Mispredict
269,356,654 cpu_core/topdown-fetch-lat/ # 10.5% Fetch Latency
59,857,034 cpu_core/topdown-mem-bound/ # 2.3% Memory Bound

O3
1,599,831,263 cpu_core/topdown-retiring/ # 12.9% Retiring
630,236,558 cpu_core/topdown-bad-spec/ # 5.1% Bad Speculation
533,277,087 cpu_core/topdown-fe-bound/ # 4.3% Frontend Bound
9,598,987,583 cpu_core/topdown-be-bound/ # 77.6% Backend Bound
280,169 cpu_core/topdown-heavy-ops/ # 0.0% Heavy Operations
630,236,558 cpu_core/topdown-br-mispredict/ # 5.1% Branch Mispredict
193,918,941 cpu_core/topdown-fetch-lat/ # 1.6% Fetch Latency
5,623,649,291 cpu_core/topdown-mem-bound/ # 45.5% Memory Bound

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
dxf
2024-04-18 07:53:45 UTC
Permalink
Post by minforth
...
And of course Forth is very close to assembler and as such it is natural
to use x87 instructions
In addition to being available on most Intel x86 cpu's, x87 was cheap to
support. A relatively small amount of code was needed to implement the
assembler extensions:

https://pastebin.com/Md6BGWmj

There wasn't a good reason not to choose x87. If the hardware stack didn't
appeal, use a software stack. Performance will still be very good. I recall
pitting my 16-bit DTC forth with x87 against VFX using an fp intensive program/
benchmark. The difference was only a factor of 4. I couldn't believe it.
a***@spenarnc.xs4all.nl
2024-04-18 09:00:45 UTC
Permalink
Post by dxf
Post by minforth
...
And of course Forth is very close to assembler and as such it is natural
to use x87 instructions
In addition to being available on most Intel x86 cpu's, x87 was cheap to
support. A relatively small amount of code was needed to implement the
https://pastebin.com/Md6BGWmj
There wasn't a good reason not to choose x87. If the hardware stack didn't
appeal, use a software stack. Performance will still be very good. I recall
pitting my 16-bit DTC forth with x87 against VFX using an fp intensive program/
benchmark. The difference was only a factor of 4. I couldn't believe it.
The loadable extension for ciforth runs 20 screens.
This is not politically correct. Only 80 bits floats, only 8 stack depth.
(The assembler is require, which contains 2 screens of fp related
instructions.)
The transcendental functions are very low cost,
Nice if you have an occasional cosine.
In implementing the floating point for the transputer, we had
to generate Chebychov polynomials with a special UBASIC
program and a truckload of testing. At least they
accommodated ISO single and double precision because
that was the transputer baseline.

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
PMF
2024-04-17 13:08:01 UTC
Permalink
Post by Krishna Myneni
Post by minforth
I would be surprised to get a SIGFPE interrupt from x87 stack ops.
X87 mode has long been deprecated and replaced by SSE2.
Maybe I'm missing something, but SSE2 does not seem to have anything
beyond basic floating point arithmetic e.g. transcendental functions.
Yes that is right. There is a sqrt but nothing more.

In lxf64/ntf64 I use an external C library, specifically fdlibm53.

This claims to be within 1 ulp correct. My testing also suggests this.
fdlibm looks to be the base for most other math libraries.
I could have used libm from gcc but I wanted the same code for both
Linux and Windows.

In lxf/ntf I use the 387 fp stack. I think this was a wrong decision.
8 stack items is to low to be useful for anything more complicated.
complex numbers is an example that quickly eats all 8 stack items.

Best Regards
Peter Fälth
Post by Krishna Myneni
--
Krishna
Stephen Pelc
2024-04-18 12:09:22 UTC
Permalink
Post by PMF
In lxf64/ntf64 I use an external C library, specifically fdlibm53.
This claims to be within 1 ulp correct. My testing also suggests this.
fdlibm looks to be the base for most other math libraries.
I could have used libm from gcc but I wanted the same code for both
Linux and Windows.
In lxf/ntf I use the 387 fp stack. I think this was a wrong decision.
8 stack items is to low to be useful for anything more complicated.
complex numbers is an example that quickly eats all 8 stack items.
I share your pain. In 2023, I spent far too long trying to satisfy FP users.
For VFX64, we ended up providing support for
1) 8 item 387 FP, internal FP stack,
2) 387 FP and FP stack in memory,
3) SSE2 FP and FP stack in memory.

Conversion of FP parameters to/from OS call interfaces is done
automagically by the EXTERN: declarations for system calls and
callbacks.

For a significant proportion of our users, 64 bit FP is inadequate,
and ony 64 bit FP is available on CPUs other than AMD64/x64
unless you are using rare and expensive hardware.

In retrospect, if I were doing this again I would standardise on an
external double-double library (about 106 bits). In most cases that
we encounter, the desire for 387 FP is to gain the extra precision.
Since very few CPUs support quad precision natively, the most
obvious solution is a double-double library.

Stephen
--
Stephen Pelc, ***@vfxforth.com
MicroProcessor Engineering, Ltd. - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)78 0390 3612, +34 649 662 974
http://www.mpeforth.com
MPE website
http://www.vfxforth.com/downloads/VfxCommunity/
downloads
Krishna Myneni
2024-04-18 13:00:17 UTC
Permalink
On 4/18/24 07:09, Stephen Pelc wrote:
...
Post by Stephen Pelc
In retrospect, if I were doing this again I would standardise on an
external double-double library (about 106 bits). In most cases that
we encounter, the desire for 387 FP is to gain the extra precision.
Since very few CPUs support quad precision natively, the most
obvious solution is a double-double library.
I also thought double-double would be sufficient, until last year, when
I ran into having to code an application for which it was unsuitable.
The problem with double-double was not in the number of bits in the
significand (106) -- the problem was that the exponent range of double
precision was not large enough. I believe the double-double type even
reduces the available exponent range from that of a double type. In any
case I ended up using the MPFR library for this application. IEEE quad
precision would be more suitable, since it expands the exponent range.

--
Krishna
minforth
2024-04-18 17:10:17 UTC
Permalink
The Cephes math library supports extended and 128b floats
https://netlib.org/cephes/128bdoc.html

But gnu libquadmath had been good enough for me until now.
minforth
2024-05-11 05:37:36 UTC
Permalink
Post by Krishna Myneni
I also thought double-double would be sufficient, until last year, when
I ran into having to code an application for which it was unsuitable.
The problem with double-double was not in the number of bits in the
significand (106) -- the problem was that the exponent range of double
precision was not large enough. I believe the double-double type even
reduces the available exponent range from that of a double type. In any
case I ended up using the MPFR library for this application. IEEE quad
precision would be more suitable, since it expands the exponent range.
These days I am indulging myself by extending my fp number libraries:

AFAICS float80 (x87) and float128 share the same exponent range of 15 bits,
which is more than the 11 bits in float64 (double) numbers.

However, it is a mess with double-double arithmetic. Composed of a pair
of double numbers, their exponent range remains as small as that of
single doubles. And things vary between compilers and target CPUs.
Krishna Myneni
2024-04-18 12:52:31 UTC
Permalink
Post by PMF
Post by Krishna Myneni
Post by minforth
I would be surprised to get a SIGFPE interrupt from x87 stack ops.
X87 mode has long been deprecated and replaced by SSE2.
Maybe I'm missing something, but SSE2 does not seem to have anything
beyond basic floating point arithmetic e.g. transcendental functions.
Yes that is right. There is a sqrt but nothing more.
In lxf64/ntf64 I use an external C library, specifically fdlibm53.
This claims to be within 1 ulp correct. My testing also suggests this.
fdlibm looks to be the base for most other math libraries.
I could have used libm from gcc but I wanted the same code for both
Linux and Windows.
In lxf/ntf I use the 387 fp stack. I think this was a wrong decision.
8 stack items is to low to be useful for anything more complicated.
complex numbers is an example that quickly eats all 8 stack items.
Best Regards
Peter Fälth
Thank you for the clarification. I disassembled the GNU C Math Library
(64-bit libm.so.6) and had a look at the code. It is a huge file. The
long double functions are coded for x87, while functions for float,
double, and float128 use the SSE instructions.

This implies that if x87 is deprecated on 64-bit systems, then 80-bit
floats are also deprecated?

Does fdlibm53 support 80-bit floats and float128?

--
Krishna
a***@spenarnc.xs4all.nl
2024-04-16 07:59:31 UTC
Permalink
Post by Krishna Myneni
Post by Anton Ertl
Post by Krishna Myneni
Failure from overflowing the
fpu stack is silent.
VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023
1e 2e 3e 4e 5e 6e 7e ok F:-7
8e ok
NDP Stack Fault: NDP SW = 0041
NDP Potential Exception: NDP SW = 0041
SwiftForth also seems to notice it in some way, but does not report it
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e ok
f. 6.00000000 ok
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e 7e ok
f.
ok
I tried overflowing the fpu stack in kforth32, and no exception is
raised. Perhaps one needs to configure the fpu to raise an exception.
Also tried it in C with an assembly procedure. The executable throws no
exception.
--
Krishna
== begin fpu-stack-overflow.4th ==
fpu-stack-overflow.4th
\ for use with kforth32
include ans-words
include strings
include modules
include syscalls
include mc
include asm-x86
code fpu-stack-overflow
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
end-code
fpu-stack-overflow
== end fpu-stack-overflow.4th ==
== begin example ==
$ kforth32
kForth-32 v 2.4.5 (Build: 2024-03-30)
Copyright (c) 1998--2023 Krishna Myneni
Contributions by: dpw gd mu bk abs tn cmb bg dnw
Provided under the GNU Affero General Public License, v3.0 or later
include fpu-stack-overflow
ok
== end example ==
I tried this on ciforth. It crashes with the 10th item
not the 8th.

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
Krishna Myneni
2024-04-16 12:14:46 UTC
Permalink
Post by a***@spenarnc.xs4all.nl
Post by Krishna Myneni
Post by Anton Ertl
Post by Krishna Myneni
Failure from overflowing the
fpu stack is silent.
VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023
1e 2e 3e 4e 5e 6e 7e ok F:-7
8e ok
NDP Stack Fault: NDP SW = 0041
NDP Potential Exception: NDP SW = 0041
SwiftForth also seems to notice it in some way, but does not report it
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e ok
f. 6.00000000 ok
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e 7e ok
f.
ok
I tried overflowing the fpu stack in kforth32, and no exception is
raised. Perhaps one needs to configure the fpu to raise an exception.
Also tried it in C with an assembly procedure. The executable throws no
exception.
--
Krishna
== begin fpu-stack-overflow.4th ==
fpu-stack-overflow.4th
\ for use with kforth32
include ans-words
include strings
include modules
include syscalls
include mc
include asm-x86
code fpu-stack-overflow
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
end-code
fpu-stack-overflow
== end fpu-stack-overflow.4th ==
== begin example ==
$ kforth32
kForth-32 v 2.4.5 (Build: 2024-03-30)
Copyright (c) 1998--2023 Krishna Myneni
Contributions by: dpw gd mu bk abs tn cmb bg dnw
Provided under the GNU Affero General Public License, v3.0 or later
include fpu-stack-overflow
ok
== end example ==
I tried this on ciforth. It crashes with the 10th item
not the 8th.
That may be for some other reason. The following code executes an
arbitrary number of FLD1 instructions:

code fpu-stack-overflow ( n -- -n )
0 [ebx] ecx mov, \ set loop count
0 # eax mov,
DO,
fld1,
eax dec,
LOOP,
eax 0 [ebx] mov,
0 # eax mov,
end-code

16384 fpu-stack-overflow .
\ end of prog

\ run it

include clf-code/fpu-stack-overflow

-16384
ok


--
Krishna
Post by a***@spenarnc.xs4all.nl
Groetjes Albert
a***@spenarnc.xs4all.nl
2024-04-18 09:11:17 UTC
Permalink
Post by Anton Ertl
Post by dxf
Post by Anton Ertl
From what I read about this, the intention was that the FP stack would
extend into memory (and thus not be limited to 8 elements): software
should react to FP stack overflows and underflows and store some
elements on overflow, and reload some elements on underflow. However,
this functionality was implemented in a buggy way on the 8087, so it
never worked as intended. Hoever, when they noticed this, the 8087
was already on the market, and Hyrum's law ensured that this behaviour
could not be changed.
Do you have a reference for that?
Kahan writes about the original intention in
http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf
especially starting at the last paragraph of page 2.
https://history.siam.org/pdfs2/Kahan_final.pdf
Start with the second-to-last paragraph on page 163. He digresses for
a page, but continues on the fourth paragraph of page 165 and
continues to the first paragraph of page 168.
Interesting read. The reverse polish calculator from HP was a
resounding success, with great profits.
Kahan contributed to this.
It was killed by the bean counters at HP.
There was huge demand but they refused to expand production.
Then the calculator died because it simply was not available.
Post by Anton Ertl
- anton
Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
dxf
2024-04-19 03:48:04 UTC
Permalink
Post by a***@spenarnc.xs4all.nl
...
Interesting read. The reverse polish calculator from HP was a
resounding success, with great profits.
Kahan contributed to this.
It was killed by the bean counters at HP.
There was huge demand but they refused to expand production.
Then the calculator died because it simply was not available.
We had quite a few HP instruments but then the bottom fell out of
the test instrument market (and maintenance generally). It was
embarrassing to see the once great HP pursue the consumer PC market.
a***@spenarnc.xs4all.nl
2024-04-14 11:21:14 UTC
Permalink
Post by Anton Ertl
[..]
Post by Anton Ertl
$1022600A fld [r13 0 +] tbyte41DB6D00 A[m.
$1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m.
$10226012 fxch ST(2) D9CA YJ
$10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m
$10226018 faddp ST(1), ST DEC1 ^A
$1022601A fxch ST(1) D9C9 YI
$1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m.
$10226026 fmulp ST(1), ST DEC9 ^I
$10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}.
$10226032 ; 488B45004883C508FFE0
H.E.H.E..` ok
Post by Anton Ertl
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.
Turbo Pascal had a fast FP mode that used the FPU stack. I found almost
immediately that that is unusable for serious work.
The used scheme is rather complicated. iForth uses the internal stack
when it can prove that there will be no under- or overflow. Non-inlined
calls (F. below) always use the memory stack.
FORTH> pi fvalue val1 pi/2 fvalue val2 ok
FORTH> : test val1 fdup val2 foo val1 f+ val2 f* f. ; ok
FORTH> see test
Flags: ANSI
$015FDA80 : test
$015FDA8A fld $015FD650 tbyte-offset
$015FDA90 fld ST(0)
$015FDA92 fld $015FD670 tbyte-offset
$015FDA98 faddp ST(1), ST
$015FDA9A fmulp ST(1), ST
$015FDA9C fld $015FD650 tbyte-offset
$015FDAA2 faddp ST(1), ST
$015FDAA4 fld $015FD670 tbyte-offset
$015FDAAA fmulp ST(1), ST
$015FDAAC fpush,
$015FDAB6 jmp F.+10 ( $0124ED42 ) offset NEAR
$015FDABB ;
Apparently there are special interrupts that one can enable
to signal FPU stack underflow (and then spill to memory)
but I never got them to work reliably. The software
analysis works fine, but can be fooled in case of rather
contrived circumstances. I have not encountered a bug in the
past two decades.
This is a practical way.

I researched whether it is possible to detect whether the
circular stack overflows. There are instructions to
detect whether a position in this stack is occupied.
For a word that using a stack 4 deep, you could detect whether
it is necessary to save words this way, I thought.
I couldn't make it work, because essential assembler instruction
are missing. (Or I'm not clever enough.)
Post by Anton Ertl
-marcel
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
mhx
2024-04-14 11:59:51 UTC
Permalink
Post by a***@spenarnc.xs4all.nl
[..]
This is a practical way.
I researched whether it is possible to detect whether the
circular stack overflows. There are instructions to
detect whether a position in this stack is occupied.
For a word that using a stack 4 deep, you could detect whether
it is necessary to save words this way, I thought.
I couldn't make it work, because essential assembler instruction
are missing. (Or I'm not clever enough.)
I vaguely remember something like that for the FPU stack (combined
with interrupts?). It falls under the category "I couldn't make
it work."

-marcel
a***@spenarnc.xs4all.nl
2024-04-14 11:12:19 UTC
Permalink
Post by Anton Ertl
I just looked at the floating-point implementations of recent
SwiftForth and VFX (finally present in the system from the start), and
1 FLOATS .
16 iforth
10 sf64
10 vfx64
For
: foo f+ f* ;
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
: foo f+ f* ; ok
see foo
44E8B9 ST(0) ST(1) FADDP DEC1
44E8BB ST(0) ST(1) FMULP DEC9
44E8BD RET C3 ok
VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023
: foo f+ f* ; ok
see foo
FOO
( 0050A250 DEC1 ) FADDP ST(1), ST
( 0050A252 DEC9 ) FMULP ST(1), ST
( 0050A254 C3 ) RET/NEXT
( 5 bytes, 3 instructions )
I cut the same corners with ciforth. However I think this
cannot be compliant with the IEEE requirement of the standard?
Post by Anton Ertl
- anton
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
Anton Ertl
2024-04-20 16:23:21 UTC
Permalink
Post by a***@spenarnc.xs4all.nl
I cut the same corners with ciforth. However I think this
cannot be compliant with the IEEE requirement of the standard?
Which IEEE requirement of which standard?

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
a***@spenarnc.xs4all.nl
2024-04-21 09:44:38 UTC
Permalink
Post by Anton Ertl
Post by a***@spenarnc.xs4all.nl
I cut the same corners with ciforth. However I think this
cannot be compliant with the IEEE requirement of the standard?
Which IEEE requirement of which standard?
The ISO 9x talks about

64-bit IEEE double-precision number
32-bit IEEE double-precision number

IEEE floating point number as defined in ANSI/IEEE Standard
754-1985

80 bit '87 is not compliant with either of these, I think.
Post by Anton Ertl
- anton
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
Anton Ertl
2024-04-21 10:57:40 UTC
Permalink
Post by a***@spenarnc.xs4all.nl
Post by Anton Ertl
Post by a***@spenarnc.xs4all.nl
I cut the same corners with ciforth. However I think this
cannot be compliant with the IEEE requirement of the standard?
Which IEEE requirement of which standard?
The ISO 9x talks about
Whatever "ISO 9x" may be: Why do you worry about it and mention it
here?
Post by a***@spenarnc.xs4all.nl
64-bit IEEE double-precision number
32-bit IEEE double-precision number
IEEE floating point number as defined in ANSI/IEEE Standard
754-1985
So?
Post by a***@spenarnc.xs4all.nl
80 bit '87 is not compliant with either of these, I think.
So?

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
Buzz McCool
2024-05-15 18:37:10 UTC
Permalink
Gforth sticks out by using 8-byte FP values; ...
"Doubling back" to the beginning of this thread, does the above mean
that gforth uses IEEE binary64 aka double precision floating point
numbers to perform floating point math on the floating point stack?

This section https://gforth.org/manual/Floating-Point-Tutorial.html of
the gforth documentation doesn't come out and plainly say what floating
point precision (IEEE binary64 or IEEE binary32 or ... ) is used.
dxf
2024-05-16 01:43:05 UTC
Permalink
Gforth sticks out by using 8-byte FP values; ...
"Doubling back" to the beginning of this thread, does the above mean that gforth uses IEEE binary64 aka double precision floating point numbers to perform floating point math on the floating point stack?
This section https://gforth.org/manual/Floating-Point-Tutorial.html of the gforth documentation doesn't come out and plainly say what floating point precision (IEEE binary64 or IEEE binary32 or ... ) is used.
https://en.wikipedia.org/wiki/Dorothy_Dixer
Anton Ertl
2024-05-16 05:48:45 UTC
Permalink
Post by Buzz McCool
"Doubling back" to the beginning of this thread, does the above mean
that gforth uses IEEE binary64 aka double precision floating point
numbers to perform floating point math on the floating point stack?
It uses whatever the underlying C implementation uses as "double"
type. Given the success of the IEEE FP standard, that has been IEEE
binary64 format on all systems with DP FP hardware that Gforth has
been run on (that I know about).

On Alpha, the support for denormal results of FP operations depends on
a C compiler option; we use that option in the current source code and
the line that sets this option has not changed since 2000-02-04 (i.e.,
before gforth-0.5.0).

On ARM with softfloat (i.e., using integer instructions instead of
hardware FP instructions in order to support ARM hardware without FP
hardware), the FP checks have reported deviations from the expected
results, so they obviously do not perform the same operations as the
IEEE FP standard requires for binary64. I don't know if the format is
the same or different from IEEE binary64.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
Buzz McCool
2024-05-16 14:54:49 UTC
Permalink
[gforth] uses whatever the underlying C implementation uses as "double"
type.
Thank you for the response Anton.

I believe it would be helpful to have a statement like this somewhere in
https://gforth.org/manual/Floating-Point-Tutorial.html for your users
concerned about numerical precision.
https://en.wikipedia.org/wiki/Dorothy_Dixer
Dorothy Dxf :-)

Loading...