3.2 The Maximum-Likelihood (ML) Method
The maximum-likelihood method also has a long history: it was derived by Bernoulli in 1776 and by Gauss around 1821, and worked out in detail by Fisher in 1912.
Consider the probability density function F (x; ), x a random variable, a single parameter characterizing the known form of F. We want to estimate . Let x1, x2, . . , xN be a random sample of size N, the xi independent and drawn from the same population. Then the so-called ``likelihood function'' is the joint probability density function
This is the probability, given , of obtaining the observed set of
results. The maximum-likelihood estimator (MLE) of is = (that
value of that maximizes
L () for all variations
of ), i.e.
Note that traditionally it is ln L that is maximized. It should be
pointed out that the maximum of the function cannot always be
determined by this method - finding the ML time of occurrence of a
singular observed event is a case in point.
The MLE is a statistic with many highly desirable properties - it is
efficient, usually unbiased, it has minimum variance, and it is ~
Normally distributed.
If the residuals are Normally distributed, then minimizing the sum
of squares (Section 3.1) is the
maximum-likelihood method.
By way of example,
Jauncey (1967)
showed that ML was an excellent
way of estimating the slope of the number - flux-density relation for
extragalactic radio sources, and this particular application has made
the technique familiar to astronomers. The source count is assumed to
be of the power-law form
where N is the number of sources on a particular patch of sky with
flux densities greater than S, k is a constant and is the exponent,
or slope in the log N-log S plane, which we wish to
estimate. If we
consider M sources with flux densities S in the range
S0 to Smax, then
a straightforward application of the ML procedure above yields the
following likelihood function:
where
and
Differentiation of this with respect to then yields the equation from which , the MLE of , is obtained:
However, with a computer handy it is simplest to forget the
differentiation and to evaluate L () over a wide range of , small .
Maximum L yields ,
while a good estimate of the standard deviation in
is obtained
from the two
values of that cause L
to drop by e1/2
from its maximum, the factor e1/2 because the asymptotic
distribution of L is Gaussian. For large M and b
(Jauncey 1967),
This application of ML makes optimum use of the data in that the
sources are not grouped and the loss of power that always results from
binning is avoided.
To return to general considerations, after the ML estimate has been
obtained, it is essential to perform a final check - does the MLE
model fit the data reasonably? If it does not then the data are
erroneous when the model is known to be right; conversely, the adopted
or assumed model is wrong. There are many ways of carrying out such a
check: two of these, the chisquare test and the Kolmogorov-Smirnov
test, are described in Section 4.2.
The ML procedure may be generalized to obtain simultaneous MLEs of
several parameters:
and the solution of these simultaneous equations yields the MLE i.
It is important to emphasize the ML principle, which is that when
confronted with the choice of hypotheses, choose that which maximizes
L, i.e. the one giving the highest probability to the observed
event. This sounds reasonable, but in fact the proofs of certain
theorems (see e.g.
Martin 1971)
concerning the ``goodness'' of MLEs are
required to justify the procedure.