Friday, 3 July 2009

Numerical Precision

Numerical precision is an ongoing concern of mine especially in big / long running simulations and solvers.

I came across an article by Rob Farber on the site this morning that asks the question "How much is Enough?".  Although no definitive answers are presented the author summarizes the current and future concerns over accuracy.

Personally I don't believe floating point is the way forward. Floating point is fast to calculate in hardware but is not always an ideal way of representing numbers. Although the various branches of mathematics are largely base independent humans are most comfortable with base 10 while computers are of course most comfortable with base 2. This does result in some situations when a calculation in base 10 with only a few decimals of precision gives precise results whereas a calculation in base 2 is incapable of giving a precise result even given N bits of precision although the result is probably acceptable after n bits.

I'm not presenting any solution to the precision problem, but merely pointing out that sometimes the issue is caused by:   using base 2 for calculations  and/or  the floating point representation of these numbers.


  1. Could always attempt to go with 64-bit integer math, however you run the risk of integer multiply not being fully pipelined on some archs. Perhaps what you really want is a fused integer multiply add with intermediate shifts...

  2. Back in 'ye olde days' integer maths was almost always faster so using custom fixed point integer maths routines was a good idea both for accuracy and speed. Of course if you implemented your own fixed point you could relatively easily extend the precision as required. You still find quite a few big number libraries out on the internet these days.

    If you want performance though, floating point is built into the hardware so is usually faster. And in cuda land 32bit mul's are very slow - hence the 24bit operations. When you get to double precision the current generation Nvidia GPU's only have 1 DP unit per multiprocessor so it *may* actually be faster to use 64 or 128 fixed point integer maths if you structure the code correctly. Something I'd love to test at some stage.

    The IEEE 754r specification is also worth looking at as it does address some of the issues in the earlier spec.