Anton Ertl
2021-01-08 18:49:39 UTC
On IA-32 and AMD64, iForth uses 80-bit floats (extended precision), at
least in the default configuration, and VFX's Ndp387.fth also uses
extended precision; on VFX you can change that by changing a single
line of code in Ndp387.fth, but offers no way to tell it that you want
something else (changing code that other people update (and not
through a version control system) is a no-no).
Anyway, some time ago I found that VFX runs my matrix multiplication
code almost twice as fast with 64-bit floats (double precision) that
with 80-bit floats. Today I investigated this further
<***@mips.complang.tuwien.ac.at>
<***@mips.complang.tuwien.ac.at>, and here I repeat the
most interesting findings. Basically, I translated this loop
?do
fdup over f@ f* dup f+! float+ swap float+ swap
loop
into C (to avoid the slow counted loops of present-day Forth
compilers) and compiled it with gcc -O -mfpmath=387, resulting in the
following code:
long double double
fldt (%rdi) fld %st(0)
fmul %st(1),%st fmull (%rdi)
fldt (%rsi) faddl (%rsi)
faddp %st,%st(1) fstpl (%rsi)
fstpt (%rsi) add %rdx,%rdi
add %rdx,%rdi add %rdx,%rsi
add %rdx,%rsi sub $0x1,%rcx
sub $0x1,%rcx jne f <axpy+0xf>
jne 9 <axpy+0x9>
7 cycles/iteration 2 cycles/iteration
The cycle results are from a Skylake microarchitecture (most Intel
CPUs since 2016). So you can see that extended precision can cost a
lot of performance. If you need the precision, go for it, but if you
don't, and need performance, using 64-bit floats can get you that.
It's probably a good default to go for precision, but going for
performance should not require doing a no-no.
- anton
least in the default configuration, and VFX's Ndp387.fth also uses
extended precision; on VFX you can change that by changing a single
line of code in Ndp387.fth, but offers no way to tell it that you want
something else (changing code that other people update (and not
through a version control system) is a no-no).
Anyway, some time ago I found that VFX runs my matrix multiplication
code almost twice as fast with 64-bit floats (double precision) that
with 80-bit floats. Today I investigated this further
<***@mips.complang.tuwien.ac.at>
<***@mips.complang.tuwien.ac.at>, and here I repeat the
most interesting findings. Basically, I translated this loop
?do
fdup over f@ f* dup f+! float+ swap float+ swap
loop
into C (to avoid the slow counted loops of present-day Forth
compilers) and compiled it with gcc -O -mfpmath=387, resulting in the
following code:
long double double
fldt (%rdi) fld %st(0)
fmul %st(1),%st fmull (%rdi)
fldt (%rsi) faddl (%rsi)
faddp %st,%st(1) fstpl (%rsi)
fstpt (%rsi) add %rdx,%rdi
add %rdx,%rdi add %rdx,%rsi
add %rdx,%rsi sub $0x1,%rcx
sub $0x1,%rcx jne f <axpy+0xf>
jne 9 <axpy+0x9>
7 cycles/iteration 2 cycles/iteration
The cycle results are from a Skylake microarchitecture (most Intel
CPUs since 2016). So you can see that extended precision can cost a
lot of performance. If you need the precision, go for it, but if you
don't, and need performance, using 64-bit floats can get you that.
It's probably a good default to go for precision, but going for
performance should not require doing a no-no.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020