**3. FUNDAMENTALS OF MULTIVARIATE ANALYSIS AND
CLUSTERING**

*r*) or rank (Spearman's , or Kendall's
) correlation coefficients
between pairs of variables.
Statisticians often divide each value by the sample standard deviation
for that variable (known as `standardizing' or `Studentizing' the
sample), while astronomers often take a log transform or consider the
ratio of two variables with the same units.

*p*-space
which can have any form of structure, not just planar correlations
parallel to the axes. The sample covariance matrix *S* contains
information for this more general approach, and lies at the root of
many methods of multivariate analysis developed during the
1930-60s. The method most widely used in astronomy is *principal
components analysis*. Here the 1st principal component is
*e*_{1}* ^{T} X* where

*canonical analysis*, the variables are divided into two preselected
groups and the eigenvectors of the cross-sample covariance matrix
*S*_{11}^{-1/2} *S*_{12}
*S*_{22}^{-1} *S*_{21}
*S*_{11}^{-1/2} gives the principal linear
relationships between the two sets of variables. This might be used to relate
stellar metallicity variables with kinematic variables to study
Galactochemical evolution, or stellar magnetic activity indicators
with bulk star properties to study dynamo theory.

*S* may
thus reflect heterogeneity of the sample, rather than astrophysical
processes within a homogeneous class. It is thus important to search
for groupings in *p*-space using multivariate algorithms. Dozens of
such methods have been proposed. Unfortunately, most are procedural
algorithms without formal statistics (*i.e.,* no probabilistic measures
of merit) and there is little mathematical guidance which produces
`better' clusters.

*Hierarchical clustering* procedures produces small clusters within
larger clusters. One such procedure, `percolation' or the
`friends-of-friends' algorithm is a favorite among astronomers. It is
called *single linkage clustering* obtained by successively removing the
longest branches of the unique *minimal spanning tree* connecting
the *n* points in *p*-space. Single linkage produces long
stringy clusters. This
may be appropriate for galaxy clustering studies, but researchers in
other fields usually prefer *average or complete linkage* algorithms
which produce more compact clusters. The many varieties of
hierarchical clustering arise because the scientist must chose the
metric (*e.g.,* should the `distance' between objects be Euclidean or
squared?), weighting (*e.g.,* how is the average location of a cluster
defined?), and criteria for merging clusters (*e.g.,* should the total
variance or internal group variance be minimized?).

*k-means partitioning*. It finds the combination of *k* groups that
minimizes intragroup variance. However, it is necessary to specify *k*
in advance.