Statistics and the Treatment of Experimental Data

1.4 The Covariance

Thus far we have only considered the simple case of single variable probability distributions. In the more general case, the outcomes of a process may be characterized by several random variables, x, y, z..... The process is then described by a multivariate distribution P(x, y, z, . . .). An example is a playing card which is described by two variables: its denomination and its suit.

For multivariate distributions, the mean and variance of each separate random variable x, y,... are defined in the same way as before (except that the integration is over all variables). In addition a third important quantity must be defined:

Equation 10 (10)

where µ_x, and µ_y are the means of x and y respectively. Equation (10) is known as the covariance of x and y and it is defined for each pair of variables in the probability density. Thus, if we have a trivariate distribution P(x, y, z), there are three covariances: cov(x, y), cov(x, z) and cov(y, z).

The covariance is a measure of the linear correlation between the two variables. This is more often expressed as the correlation coefficient which is defined as

Equation 11 (11)

where sigma _x and sigma _y are the standard deviations of x and y. The correlation coefficient varies between -1 and +1 where the sign indicates the sense of the correlation. If the variables are perfectly correlated linearly, then | rho | = 1. If the variables are independent ⁽¹⁾ then rho = 0. Care must be taken with the converse of this last statement, however. If rho is found to be 0, then x and y can only be said to be linearly independent. It can be shown, in fact, that if x and y are related parabolically, (e.g., y = x²), then rho = 0.

¹ The mathematical definition of independence is that the joint probability is a separable function, i.e., P(x, y) = P₁(x) P₂(y)