4.3 Tests for Comparison of Two Independent Samples
Now suppose we have two samples. We want to know whether they could have been drawn from the same population, or from different populations, and, if the latter, whether they differ in some predicted direction. Again assume we know nothing about probability distributions, so that we need non-parametric tests. There are several.
Fisher exact test. The test is for two independent small samples for which discrete binary data are available, e.g. scores from the two samples fall in two mutually exclusive bins yielding a 2 x 2 contingency table as shown in Table II.
H0 is that the assignment of ``scores'' is random.
Compute the following statistic:
This is the probability that the total of N scores could be as they
are when the two samples are in fact identical: but in fact
H0 asks,
what is the probability of occurrence of the observed outcome or one
more extreme? By the laws of probability ptot =
p1 + p2 + . . ., where
p1, p2 . . . represent values of
p for all other cases of more extreme
arrangements of the contingency table
(Siegel & Castellan 1988).
This is the best test for small samples; and if N < 20, it is
probably the only test to use.
Chi-square two-sample (or k-sample) test. Again the much-loved
2
test is applicable. All the previous shortcomings apply, but for data
that are not on a numerical scale, there may be no alternative. To
begin with, each sample is binned in the same r bins (a k
x r contingency table - see Table III).
H0 is that the k samples are from the same population.
Then compute
The Eij are the expectation values, computed from
Under H0 this is distributed as 2df = (r - 1)
(k - 1).
Note that there is a modification for the 2 x 2 contingency table with
N objects (Table II). In this case,
The usual 2 caveat
applies - beware of the dreaded number 5, below
which the cell populations should not fall. If they do, combine adjacent
cells, or abandon the test. And if there are only 2 x 2 cells, the total
(N) must exceed 30; if not, use the Fisher exact probability test.
There is one further distinctive feature about the 2 test (and the
2 x 2 contingency table test); it may be used to test a directional
alternative to H0, i.e. H1 can be
that the two groups differ in some
predicted sense. If the alternative to H0 is
directional, then use
Table A III in the normal way and halve the
probabilities at the heads
of the columns, since the test is now one-tailed. For degrees of
freedom > 1, the 2 test
is insensitive to order, and another test may
thus be preferable. One that almost always is preferable is the following.
Mann-Whitney (Wilcoxon) U test. There are two samples, A
(m members)
and B (n members); H0 is that A
and B are from the same distribution
or have the same parent population, while H1 may be
one of three possibilities:
that A is stochastically larger than B;
that B is stochastically larger than A; or
that A and B differ.
The first two hypotheses are directional, resulting in one-tailed
tests; the third is not and is correspondingly a two-tailed test. To
proceed, first decide on H1 and of course the
significance level . Then
Rank in ascending order the combined sample A +
B, preserving the A or B identity of each member.
(Depending on choice of H1) sum
the number of A-rankings to get UA
or, vice-versa, the B-rankings to get UB. Tied
observations are
assigned the average of the tied ranks. Note that if N = m
+ n,
so that only one summation is necessary to determine both - but a
decision on H1 should have been made a priori.
The sampling distribution of U is known (of
course, or there would not be a test);
Table A VI, columns labelled
cu (upper-tail
probabilities), presents the exact probability associated with the
occurrence (under H0) of values of U greater
than that observed. The
table also presents exact probabilities associated with values of U
less than those observed; entries correspond to the columns labelled
cl (lower-tail probabilities). The table is arranged
for m n, which
presents no restriction in that group labels may be interchanged. What
does present a restriction is that the table gives values only for
m
4 and n 10. For samples up
to m = 10 and n = 12, see
Siegel & Castellan (1988).
For still larger samples, the sampling distribution
for UA tends to Normal distribution with mean
µA = m (N + 1)/2 and
variance A2 = mn (N +
1)/12. Significance can be assessed from the
Normal distribution,
table I
of Paper I, by calculating
where +0.5 corresponds to considering probabilities of U that
observed (lower tail), and -0.5 for U that observed
(upper-tail). If the two-tailed (``the samples are distinguishable'')
test is required, simply double the probabilities as determined from
either Table A VI (small samples) or the
Normal distribution approximation (large samples).
An example application of the test is shown in
Fig. 7, which
presents magnitude distributions for flat and steep (radio) spectrum
QSOs. H1 is that the flat-spectrum QSOs extend to
significantly lower
(brighter) magnitudes than do the steep-spectrum QSOs, a claim made
earlier by several observers. The eye agrees with H1,
and so does the result from the U test, in which we found
U = 719, z = 2.69, rejecting
H0 in favour of H1 at the 0.004
level of significance.
Fig. 7. An application of the
Mann-Whitney-Wilcoxon U test. The
frequency distributions are magnitude histograms for a complete sample
of QSOs from the Parkes 2.7-GHz survey
(Masson & Wall 1977),
(a) steep-spectrum objects, (b) flat-spectrum objects. H1
is that the
flat-spectrum QSOs have stochastically smaller (brighter) magnitudes
than the steep-spectrum QSOs. U = 719, z = 2.69;
H0 is rejected at the 0.004 level of significance.
In addition to the versatility, the test has a further advantage of
being applicable to small samples. In fact it is one of the most
powerful non-parametric tests; the efficiency in comparison with the
``Student's'' t test is 95
per cent for even moderate-sized
samples. It is therefore an obvious alternative to the chi-square
test, particularly for small samples where the chi-square test is
illegal, and when directional testing is desired. An alternative is
the following:
Kolmogorov-Smirnov two-sample test. The formulation of this test
parallels the Kolmogorov-Smirnov one-sample test; it considers the
maximum deviation between the cumulative distributions of two samples
with m and n members. H0 is (again) that
the two samples are from the
same population, and H1 can be that they differ
(two-tailed test) or
that they differ in a specific direction (one-tailed test).
To implement the test, refer to the procedure for the one-sample
test; merely exchange the cumulative distributions Se
and S0 for Sm
and Sn corresponding to the two samples.
Critical values of D are given in
Tables A VII and
A VIII.
Table A VII
gives the values for small samples, one-tailed test, while
Table A VIII
is for the two-tailed test. For large samples, two-tailed test, use
Table A IX. For large samples, one-tailed
test, compute
which has a sampling distribution approximated by chi-square with two
degrees of freedom. Then consult Table A III
to see if the observed D
results in a value of 2
large enough to reject H0 in favour of
H1 at the desired level of significance.
An example is shown in Fig. 8. The frequency
distributions show the
strengths of radio emission detected at particular positions in the
sky. There was good (a priori!) reason to suspect that the positions
observed in the larger sample should result in greater detected radio
flux than the random positions of the smaller sample. The eye (a
posteriori!) suggests that this might not actually be
so. H1 is that
the distributions differ in the sense that observations of the larger
sample stochastically exceed those of the smaller: H0 = no
difference. The Kolmogorov-Smirnov test yielded D = 0.068 for
m = 290,
n = 385. Hence 2
= 3.06 for 2 degrees of freedom; the associated
probability is 0.25, a very boring number, and quite inadequate to
allow rejection of H0 in favour of H1.
The test is extremely powerful with an efficiency (compared to the t
test) of > 95 per cent for small samples, decreasing somewhat for
larger samples. The efficiency always exceeds that of the chi-square
test, and slightly exceeds that of the U test for very small
samples. For larger samples, the converse is true, and the U test is
to be preferred.
Sample . . . 1 2
Category = 1 A C
2 B D
Sample: j= 1 2 3 . . .
Category: i = 1 O11 O12
O13 . . .
2 O21 O22
O23 . . .
3 O31 O32
O33 . . .
4 O41 O42
O43 . . .
5 O51 O52
O53 . . .
. . . . . . . . . .
. . .