Problem 1: Fitting a straight line to simple data

Let us assume that some theory tells us that we should expect a linear relationship to exist between some physical quantities x and y where, unfortunately, the slope and zero-point of the relationship can not be predicted by the theory - we need to use observations and statistics to estimate these missing parameters. We want to find the straight line of the form y = mx + b that "best" describes our data set, which consists of the N observed data points (x1, y1), (x2, y2), ..., (yN, yN). We can write this as

where I use the symbol "" to represent the phrase "is fitted to." For the time being, we must assume that only the observable quantities yi contain observational errors; we must be certain that the quantities xi are known perfectly, or at least that the errors in the x's are negligible in comparison to the errors in the y's. If the observational errors in the quantities yi, which I will denote by i, follow the normal error distribution,

(assuming that the standard error 1),

where

then the Likelihood, , of obtaining these fitting residuals is

This comes straight from elementary probability theory: if P(i) is the probability that the observed quantity y1 is wrong by an amount i, and P(2) is the probability that observed quantity y2 is wrong by amount 2, and if the error that you made in y1 is independent of the error that you made in y2, then the probability of making errors 1 and 2 simultaneously is P(1) . P(2). This principle may be generalized to give the probability of observing the full set of observational errors 1, 2,..., N, which results in the equation for that I just gave. The "method of maximum likelihood" seeks the values and for which is maximized, since these are the most likely values of m and b, given the current set of data. (I use the "^" symbol to denote the specific estimate of a parameter which has been derived from the data by some statistical method; without the "^", the parameter is considered to be a variable). Clearly, is maximized when i=1N, i2 is minimized: hence, for an assumed Gaussian error distribution, the method of maximum likelihood becomes the so-called "method of least squares." For this sum, we often use the symbolic representation

Then,

2 is at a minimum and the Likelihood is at a maximum when:

This means

For the purposes of this discussion, it is convenient to write this pair of simultaneous equations as a matrix equation. Since i=1N(1) = N we have

so

And since

the algebraic solution to this set of equations is:

The quantity given by

is called the mean error of unit weight, and is a very useful measure of the scatter of your data. It is essentially the root-mean-square value of your observational errors, i. (The -i are sometimes also referred to as the residuals of your yi values, but note that these are usually considered to have the opposite sign: residual = observation - model = yi - (mxi + b). Of course, when you square them, the sign makes no difference.) The sum of the squares of these residuals is divided by N - 2 instead of N to eliminate a bias in the estimate of the scatter: any two points define a straight line perfectly, so if there are two data points the residuals from the "best" fit will always be zero regardless of the true errors in the yi. It works similarly for other values of N: since we have used our data points to determine two previously unknown quantities, the observed scatter of the data about the best fit will always be a little bit less than the actual scatter about the true relationship. Dividing by N - 2 instead of N completely removes this bias; you can find a mathematical proof of this statement in a text on the theory of statistics.

If you had trouble following my derivation of the least-squares fit of a straight line, let's look at a couple of even simpler examples, to show how the general principle of minimizing the sum of the squares of the residuals can be used in real life. For the first example, let us assume that we have a set of observed values yi which are supposed to equal some constant. Of course, because of observational errors the various yi will not have a single, constant value, but instead will show some amount of scatter. Let's derive the least-squares estimator of that constant which "best" fits the observed yi. In the same notation as above, we want to solve a least-squares problem of the form

where b is the desired constant. Here are the steps:

In other words, the least-squares estimator of that constant parameter b which best fits some set of data yi is simply the arithmetic mean of those yi. This is why we have all grown up always taking the mean of a bunch of independent measurements: it can be shown mathematically to be the "best" way of extracting an estimate of some constant from error-prone data, provided that the observational errors follow a Gaussian distribution.

Now let's look at a slightly different problem. Suppose you know that your measured quantities yi should be linearly related to other quantities xi (which are known to infinite precision), but you also know that the best-fitting line must pass through the origin: y = 0 when x = 0. One astronomical example of this kind of problem is the calibration of the Hubble law for the expansion of the Universe: for some set of distant galaxies we measure the radial velocities of recession and estimate their distances, and then fit a straight line giving distance as a function of radial velocity (for the purposes of the present discussion, I assume that the radial velocities may be regarded as of infinite precision in comparison with the distances). Since our model assumes that the recession velocity must be zero at zero distance, we have to solve a problem of the form

Let's do it.

This is the least-squares formula for determining the slope of a line which is forced to pass through the origin. It is just as fundamental as using the arithmetic mean to obtain the best estimate of a constant, but is less familiar to most of us.