The information criteria have a deep underpinning in the theory of statistical inference, but fortunately have a very simple expression. The key aim is to make an objective comparison of different models (here interpretted as different selections of cosmological parameters to vary) which may feature different numbers of parameters. Usually in cosmology a basic selection of `essential' parameters is considered, to which additional parameters might be added to make a more general model. It is assumed that the models will be compared to a fixed dataset using a likelihood method.
Typically, the introduction of extra parameters will allow an improved fit to the dataset, regardless of whether or not those new parameters are actually relevant. 1 A simple comparison of the maximum likelihood of different models will therefore always favour the model with the most parameters. The information criteria compensate for this by penalizing models which have more parameters, offsetting any improvement in the maximum likelihood that the extra parameters might allow.
The simplest procedure to compare models is the likelihood ratio test (Kendall & Stuart 1979, ch. 24), which can be applied when the simple model is nested within a more complex model. The quantity 2 lnsimple / complex, where is the maximum likelihood of the model under consideration, is approximately chi-squared distributed and standard statistical tables can be used to look up the significance of any increase in likelihood against the number of extra parameters introduced. However the assumptions underlying the test are often violated in astrophysical situations (Protassov et al. 2002). Further, one is commonly interested in comparing models which are not nested.
The Akaike information criterion (AIC) is defined as
where is the maximum likelihood and k the number of parameters of the model (Akaike 1974). The best model is the model which minimizes the AIC, and there is no requirement for the models to be nested. Typically, models with too few parameters give a poor fit to the data and hence have a low log-likelihood, while those with too many are penalized by the second term. The form of the AIC comes from minimizing the Kullback-Leibler information entropy, which measures the difference between the true distribution and the model distribution. The AIC arises from an approximate minimization of this entropy; an explanation geared to astronomers can be found in Takeuchi (2000), while the full statistical justification can be found in Sakamoto et al. (1986) and Burnham & Anderson (2002).
The Bayesian information criterion (BIC) was introduced by Schwarz (1978), and can be defined as
where N is the number of datapoints used in the fit (in current cosmological applications, this will be of order one thousand). It comes from approximating the Bayes factor (Jeffreys 1961; Kass & Raftery 1995), which gives the posterior odds of one model against another presuming that the models are equally favoured prior to the data fitting. Although expressed in terms of the maximum likelihood, it is therefore related to the integrated likelihood.
It is unfortunate that there are different information criteria in the literature, which forces one to ask which is better. Extensive Monte Carlo testing has indicated that the AIC tends to favour models which have more parameters than the true model (see e.g. Harvey 1993; Kass & Raftery 1995). Formally, this was recognized in a proof that the AIC is `dimensionally inconsistent' (Kashyap 1980), meaning that even as the size of the dataset tends to infinity, the probability of the AIC incorrectly picking an overparametrized model does not tend to zero. By contrast, the BIC is dimensionally consistent, as the second term in its definition ever more harshly penalizes overparametrized models as the dataset increases in size, and hence the BIC does always pick the correct model for large datasets. Burnham & Anderson (2002) generally favour the AIC, but note that the BIC is well justified whenever the complexity of the true model does not increase with the size of the dataset and provided that the true model can be expected to be amongst the models considered, which one can hope is the case in cosmology. Accordingly, it seems that that BIC should ordinarily be preferred. Note though that for any likely dataset ln N > 2, and hence the AIC is always more generous towards extra parameters than the BIC. Hence the AIC remains useful as it gives an upper limit to the number of parameters which should be included.
In either case, the absolute value of the criterion is not of interest, only the relative value between different models. A difference of 2 for the BIC is regarded as positive evidence, and of 6 or more as strong evidence, against the model with the larger value (Jeffreys 1961; Mukherjee et al. 1998).
The rather limited literature on cosmological model selection has thus far not used the information criteria, but has instead used the more sophisticated idea of Bayesian evidence (see e.g. Jaynes 2003), which compares the total posterior likelihoods of the models. This requires an integral of the likelihood over the whole model parameter space, which may be lengthy to calculate, but avoids the approximations used in the information criteria and also permits the use of prior information if required. It has been used in a variety of cosmological contexts by Jaffe (1996), Drell, Loredo & Wasserman (2000), John & Narlikar (2002), Hobson, Bridle & Lahav (2002), Slosar et al. (2003), Saini, Weller & Bridle (2003), and Niarchou, Jaffe & Pogosian (2003).
1 In cosmology, a new parameter will usually be a quantity set to zero in the simpler base model, and as the likelihood is a continuous function of the parameters, it will increase as the parameter varies in either the positive or negative direction. However some parameters are restricted to positive values (e.g. the amplitude of tensor perturbations), and in that case it may be that the new parameter does not improve the maximum likelihood. Back.