Tuesday, May 3, 2016

Numerical pitfalls in computing variance

One of the most common tasks in statistical computing is computation of sample variance. This would seem to be straightforward; there are a number of algebraically equivalent ways of representing the sum of squares \(S\), such as \[ S = \sum_{k=1}^n ( x_k - \bar{x})^2 \] or \[ S = \sum_{k=1}^n x_k^2 + \frac{1}{n}\bar{x}^2 \] and the sample variance is simply \(S/(n-1)\).

What is straightforward algebraically, however, is sometimes not so straightforward in the floating-point arithmetic used by computers. Computers cannot represent numbers to infinite precision, and arithmetic operations can affect the precision of floating-point numbers in unexpected ways.