Next Contents Previous

B. Non-Gaussian error distributions

The second big problem we commonly run into with real astronomical data is the fact that the error distribution is usually not truly Gaussian. We all know that two- and three-sigma residuals are far more common than the rates (one time in twenty and one time in four hundred) predicted by the Gaussian, or "normal" error distribution. As you recall, we justified the whole least-squares approach by the fact that it gives the most likely set of answers, if the distribution of random errors is Gaussian.

The official reason why people always assume a Gaussian error distribution goes back to something called the Central Limit Theorem. The Central Limit Theorem says that whenever a measurement is subject to a very large number of very small errors, the probability distribution for the total error is driven toward the Gaussian distribution. This is true regardless of the form of the original probability distributions of the individual errors. A proof - and it is a pretty one - can be found in any book on the theory of statistics.

The real reason why people always assume a Gaussian error distribution is that, having made that assumption, we can then easily derive (and have derived!) exact mathematical formulae which allow us to compute directly the "best" values for the unknown parameters. This is not necessarily possible for other probability distributions. What would happen if, for instance, the error distribution for your data were not Gaussian,

Equation 143

but Lorentzian,

Equation 144

With the Gaussian, when you go to maximize the likelihood

Equation 145

you discover that you must minimize the sum of the squares of the residuals. This leads to a very simple and straightforward set of simultaneous linear equations. With the Lorentz function, you get

Equation 146

Try differentiating the right side of this equation with respect to each of the unknown parameters, and see where it gets you. Pretending that the error distribution is Gaussian even if it isn't makes life a lot simpler.

The fact is, with real data you don't know what the probability distribution of the errors is, and you don't even know that it has any particular mathematical form that is consistent from one experiment to another. Most likely, some formula like the Lorentz function - with a well-defined core and extended wings - is a more reasonable seat-of-the-pants estimate for real error distributions than the Gaussian is, because the Gaussian's wings fall off very quickly. As I said, we all know that two- and three-sigma residuals are far more common in real life than the Gaussian would predict. This is because a real observation is likely to contain one or two large errors in addition to a myriad of tiny ones.

For instance, say you are doing photometry, with either a photomultiplier or a CCD. During a ten-second integration things like photon statistics and scintillation in the atmosphere represent a very large number of tiny, quasi-independent errors. According to the Central Limit Theorem, these should indeed tend to add up to an error distribution which is very well approximated by a Gaussian. However, you also have the finite chance that a wisp of cloud, or a flock of birds, or the contrail of a jet could pass in front of your star while you are observing it. With a photomultiplier, the telescope tracking might be erratic, occasionally taking the star image closer to the edge of the aperture, thus causing some random amount of light to be lost. Or, if you are doing CCD photometry, an unrecognized cosmic ray could land on top of one of the star images in your CCD frame, or the night-sky fringe pattern might not subtract cleanly. It is even possible that round-off error in your equipment could contribute a significant part of the difference between the "true" and observed values of your measurement. In short, when you have many very small, independent sources of error which are all added up, yes, the total error should be very close to the Gaussian distribution. But if in addition to those tiny errors, you also have several potential sources of large random errors, one of the fundamental postulates of the Central Limit Theorem is violated, and there is no guarantee that the final error distribution will be driven all the way to a Gaussian.

When an observation can possibly contain one or a few large errors, the error distribution is not necessarily even symmetric. For instance, if we go back to the photometry example, if you think the sky is clear but in fact there are scattered wisps of cloud which occasionally pass in front of the telescope, the cloud will always make the star you happen to be observing at the time appear fainter. The clouds are presumably just as likely to affect your observations of standard stars as of program stars, so in the mean the photometry is still good. However, if the wisps of cloud occupy less than exactly 50% of the sky, there will be a systematic bias: you will tend to measure any given star either a little too bright (compared to the mean) or a lot too faint; if the wisps occupy more than 50% of the sky it will be the other way around. If you occasionally forget to check that the telescope is looking straight out of the slit, so that once in a while the dome covers part of the telescope, or if there is a periodic tracking error which makes the star move near to the edge of the aperture for one second out of every 60, one ten-second integration out of every six will be a lot too faint, while the other five integrations will be a little too bright, compared to the average. In the case of CCD photometry, when a cosmic ray lands on top of a star image it will always make the star appear brighter, but when it lands in a clear-sky region, it may be recognized and ignored. Since there is no law of Nature which says that unrelated phenomena which preferentially produce positive and negative errors must always absolutely balance each other, you can see that real-world observational errors do not only limit the precision of our results, they can often introduce a systematic bias as well. This means that you must always be on the lookout for large residuals, and you must be ready to do something about them.

Of course, you all know what to do with a bad measurement: you reject it. You leave the bad data point out of the solution and fit your model to the rest of the points. Most of us know a bad datum when we see it, but if we want to be scrupulous about it, we will set some limit, for instance three times the standard error, and feel good about rejecting the point if its residual is larger than that limit. If you are a truly honest person, when you publish the paper you will include the bad data point in your plots, but will explain that you omitted it when you computed the fit. Of course, you will feel best about rejecting a bad piece of data if you have some other reason to think that it is bad: for instance, if you encounter a discrepant measurement while reducing some photoelectric photometry, and go back to your observing log to discover that you had written "dome may have been in the way," you would feel perfectly happy to reject the discordant observation. However, if a data point is discrepant enough, you will discard it even if you have no independent reason to suspect it.

What if you don't? Suppose you try to be completely honest and use every single datum, discordant or not? Well, to illustrate what can happen, I ran some more simulations. This time the goal was to compute an estimate of the mean value of some variable from ten independent observations, where each observation has some finite probability of being "bad." For the first test, I gave each observation a 90% chance of being "good," which meant that the observation was chosen at random from a Gaussian distribution with mean zero and standard deviation unity; the other 10% of the time the observation was "bad," which meant that it was chosen at random from a Gaussian distribution with mean zero and standard deviation 10. Fig. 3-9 illustrates this probability distribution (heavy curve), and compares it to a true Gaussian distribution with sigma = 1 (light curve). As you can see, when these two distributions are scaled to the same peak value it's very hard to see that the error distribution of my "observations" is non-Gaussian. Under these conditions, I generated 2,000 different sets of data, each set consisting of 10 independent variables chosen from this composite probability distribution; Table 3-1 contains five typical lists of "data," where the "bad" ones are marked with asterisks.

Figure 3-9
Figure 3-9.

I initially calculated the mean of each sample of ten in two ways: (1) knowing which data were good and which were bad, and weighting them accordingly, and (2) not knowing which data were good and which were bad, and weighting them all equally. Since neither the "good" data nor the "bad" data are biased (both were selected from probability distributions with mean zero), we would expect that the means of the samples of ten will scatter symmetrically about zero. However, how closely they scatter about zero will tell us how good our estimator of the mean is. Thus, by looking at the standard deviation of the 2,000 means generated by each of these two schemes, we can see how good that scheme is at coming up with the "right" answer. The 2,000 means generated with the first scheme - that is, perfect knowledge of which data were good and which were bad - was 0.3345; the 2,000 means generated with the second scheme, that is, no a priori knowledge of which data were good and which were bad, but treating all equally, was 1.042. The ratio of the squares of these numbers - the ratio of the weights of the results produced by the two schemes - is 9.7. Call it 10. The second scheme produces results that have lower weight, by a factor of ten, than the first scheme. That means that by blindly including every single observation, whether good or bad, we have lost 90% of the weight of our results! Ninety percent of the effort of collecting the data has been thrown away simply by including that 10% of the data which happens to be of poor quality. We might as well have gotten one night of observing time and reduced that one night carefully, instead of getting ten nights of time and including the bad results with the good.

Table 3-1. Five samples of random test data

0.30 -4.85* 0.16 -0.14 3.96*
-0.49 0.23 0.76 -0.19 2.02
-0.14 0.05 0.14 -0.23 -0.16
-0.14 0.15 3.82* -7.48* -0.19
1.06 -1.77 0.67 -0.39 0.87
-2.71 6.48* -0.11 20.67* -1.08
-0.35 -1.00 -0.34 -1.24 -0.54
-1.56 0.84 1.99 0.77 -0.21
0.85 -9.65* -0.25 -1.33 -2.35
-0.16 -0.21 2.18 -7.84* -1.31
CORRECT weights -0.33 -0.25 0.58 -0.38 -0.32
ASSUMED weights -0.33 -0.97 0.90 0.26 0.10
alpha = 2, beta = 2 -0.20 -0.29 0.61 -0.55 -0.25

That factor of ten is not some accident; you can demonstrate from first principles that that is what you should have expected. Remember my statement in the first lecture that when you do properly weighted least squares (obtain the mean, in this case) the weight of the result is equal to the sum of the weights of the individual observations? Well, even though for this experiment any given set of ten data points could contain either no "bad" observations, or one, or two, or ..., or ten, the typical set of ten will contain nine good observations and one bad one. So the typical set often will have total weight (9) . (1) + (1) . (1/100) = 9.01, which implies a standard error of 1 / sqrt9.01 = 0.333 for the weighted mean. If we had left the bad data out altogether and just averaged the good ones, we would have gotten a typical total weight of 9.00 - thus the bad data contribute practically nothing to a properly weighted mean. For the equal-weighted mean, however,

Equation 147

propagation of errors tells us we should expect

Equation 148

which implies a standard error of 1.044 for the mean. So, as you can see, the factor of ten difference in total weight between the two schemes for computing the mean is no accident, it is what I should have gotten.

Unfortunately, in real life we don't usually know ahead of time which observations are good and which are bad. We can't use the first scheme, and weight every point properly. The most we can hope for is to approximate the first scheme by reducing the weights of those data points which, after the fact, appear likely to be members of the "bad" category. This is why you must be prepared to employ (carefully!) some weight-reduction scheme to make the best possible use of your precious telescope time, if there is any chance at all that your error distribution might be non-Gaussian (practically always). I would like to deny that this necessarily constitutes "fudging," or "cooking" your data. Provided that you are intellectually honest, that you think carefully about what you are doing, and that you tell the referee and the readers of your paper exactly what you have done, this is a legitimate and necessary method for not wasting most of your effort by mixing in extra noise in the form of poor observations.

The problem of exactly how you deal with bad data is especially important in modern astronomy, for two reasons. First, as I said a while ago we know that some fraction of any set of quantitative observations is going to be affected by occasional large observational errors for reasons that we can't always figure out. For instance, we know that cosmic-ray events will appear at unpredictable locations in CCD images and they will have unpredictable energies, that the night sky can flicker erratically, and that flocks of birds can fly overhead. With large modern data sets, I suspect that you will virtually always find that the error distribution is significantly non-Gaussian (cf. Fig. 3-9: it's non-Gaussian at the factor often level!) Second, these large astronomical data sets are almost always analyzed by computer, and the stupid machine doesn't have any common sense. If you tell it to add up a list of numbers and divide by N, it'll do it. The computer won't see anything odd about the data set 0, -1, 3, 1, -2, 31927, 0, 1 - it'll add `em up and divide by eight. You can spoil a lot of good data this way. Instead, the computer programmer must imagine ahead of time what the possible problems could be, and must come up with purely numerical ways of diagnosing and dealing with these problems, so they can be understood even by a dumb computer.

Of course, the easiest thing to do is to tell the computer to compute the mean, check to see whether any of the data points are away off in left field, and recompute the mean without the oddballs. This method is easily applied to more general least-squares problems, and fits right in with those reductions where you have to iterate anyway, whether because the equations are non-linear or because you have to deal with observational errors in more than one dimension. Unfortunately, the simple notion of simply rejecting any datum which looks bad is often unreliable, and in fact it involves some profound philosophical difficulties.

To understand what I mean, consider the task of fitting a straight line to the data shown in Fig. 3-10; the dashed line represents the least-squares solution you would get if the obviously "bad" point were included in the fit with its full natural weight, and the solid line is the answer you'd get if the bad point were arbitrarily given zero weight, i.e., if it were discarded. Suppose you had told the computer to derive a least-squares fit using all the data, and then to repeat the fit after rejecting any data point whose residual from that first fit was greater than five standard deviations. Now, suppose that this data point is precisely 4.999 standard deviations away from the dashed line. So everything's hunky-dory: the point is within 5sigma of the solution, and the dashed line is what you get. But now suppose that the point was only a tiny little bit lower, so that it was 5.001 standard deviations from the dashed line. The computer, following your instructions, would first produce the dashed line as the provisional solution to your problem, then it would see the spurious datum and reject it, then it would compute the solid line as the correct solution, and that's the answer you'd submit to the journal. In other words, by making a tiny change in a single data point - and that data point already highly suspect to any sensible human observer - you've made a big change in your final answer. This is Not Good.

Figure 3-10
Figure 3-10.

If this problem were not a simple linear least-squares fit of a straight line, but rather were a non-linear problem or a two-error problem where you had to make an initial starting guess and then iterate to a solution, you'd run into another paradox. If your first guess at the correct straight line lay somewhere below the data, then the discrepant point would be within 5sigma of the provisional answer, and the solution would eventually settle down - after some iteration - at the position represented by the dashed line. Conversely, if your initial guess at the solution lay generally above the data points, since the discrepant data point was more than 5sigma from the provisional solution, the computer would reject it early, and the solution would happily converge to the position of the solid line. In other words, you could get two very different answers out of the same data set, depending upon your crude, initial starting guess at the solution. This is also Not Good.

So I've pointed out two major problems with simply telling a dumb computer to reject any data point that lies too far from a provisional solution: you can get big changes in the answer from tiny changes in a single datum, and you can get big changes in the final answer by making different starting guesses. If a human being were plotting the data and sketching the curve, no problem: some reasonable compromise would suggest itself. It turns out that you can get the computer to compromise, too - automatically and without human supervision - once you recognize the problem. Let the machine exercise a little healthy doubt. Mathematically speaking, what I mean is we should empower the computer to apply partial weights to apparently discrepant observations. In particular, once a provisional solution as been generated, any observation with a residual << many sigma would be assigned its full 1 / sigma2 natural weight, while an observation with a residual >> many sigma would be given a much smaller weight than it would nominally deserve. On this basis a new solution would be computed, new residuals derived, new weights assigned, and so on, until a consistent, repeatable answer was achieved.

So let's rerun our little thought experiment with this scheme in place. The computer fits a straight line to the data in Fig. 3-10, with each point receiving its full, natural, 1 / sigma2 weight, and the dashed line results. The computer then sees one data point lying way off the provisional fit, and repeats the solution giving that point, say, half weight. The new least-squares line lies about half-way between the dashed and solid lines, and the discrepant point now lies a little bit farther from the current provisional fit. So the computer reduces its weight still more, let's say to 0.4 times its natural weight. The provisional fit moves up just a whisker more, and soon settles down between the solid line and the dashed line, but closer to the solid line, with the computer ultimately assigning the oddball point maybe 0.38 times its natural weight. Move the discrepant point up or down by a bit and the solution hardly changes.

Now let's see what happens with the iterated nonlinear reduction. If our starting guess is somewhere below the bulk of the data points, the discrepant point will start off with a comparatively high weight, but the other points will still tend to pull the solution up. The weight of the discrepant point will be reduced, and the solution will continue to move up, eventually settling down somewhere just below the middle of the bulk of the data points. On the other hand, if the starting guess were somewhere above the data points, the weight of the oddball would start out low. But as the other data pulled the solution down, the oddball's weight would increase, and the solution would once again settle down somewhere just below the middle of the bulk of the data points - at exactly the same place as before.

Now don't get me wrong. I'm not saying that the answer this scheme gives you is the "right" solution; I'm not saying that it's the "best" solution; I'm only saying that it's a consistent, repeatable solution which is robust against both small changes in individual data points and large changes in the starting guess. It keeps the idiot computer from being satisfied with some non-unique answer. If you were drawing a smooth curve through data points by hand, you wouldn't need a scheme like this: you'd realize that the oddball point was a little iffy, and would concentrate on making sure the curve described most of the data well. A computer just isn't this smart all by itself. When you do a photometric reduction of a CCD frame, the computer is solving thousands of little least-squares problems; for a single observing run, there may be hundreds of thousands or millions of least-squares problems. A scheme which teaches the computer to be less influenced by discrepant data may allow you to get a few pretty good answers when otherwise you would have gotten garbage. In fact, I think I'll go out on a limb and argue that when you don't know what the true probability distribution of the observational errors is, you only know that it probably isn't Gaussian, there is no such thing as the "right," or "best" answer; a consistent, repeatable, robust answer is the very most you can hope for.

There are only a few things that you want to be sure of, for logical consistency and simplicity. (1) Points with residuals small compared to their standard errors should get something approaching their full natural weight. Let me use f(epsilon) to represent the fudge-factor by which the weight of an individual data point is changed on the basis of the size of its residual. Now the weight that the point is actually assigned is given by wi = f(epsiloni) / sigmai2 (let's just say the scaling factor s = 1, shall we?), and we want limepsilon->0 f = 1. (2) Really, really bad points should have weights approaching zero: limepsilon->infty f = 0. That way, you know it doesn't matter whether the oddball has a 100sigma residual or a 200sigma residual, you'll get essentially the same answer in either case. In fact, if you consider the condition for a minimum of chi2,

Equation 149

you can see that you really want f to fall off at least as fast as epsilon-1; otherwise the really humongous blunders will continue to distort the solution more the larger they get. (3) The fudge-factor f should be a monotonic, gradual, smooth function of |epsilon|, or else you'll get back into the paradoxes: small changes in a datum (changes which would move epsilon across a discontinuity in f) would produce discrete changes in the solution, and different starting guesses might converge to different stable points, where one solution generates values for f which reproduce that solution, whereas another solution would generate different values for f which would reproduce that solution. If f is sufficiently gradual and continuous, this won't happen.

For some years now, I have been using one particular scheme meeting the above criteria, which has turned out to be handy in a wide variety of applications:

Equation 150

where alpha and beta are two constants which can be set so as to fine-tune the machinery to a given application. Fig. 3-11 shows three functions f, for alpha = 2 and beta = 2, 4, and 8. If you study the equation and the figure, you'll see that this formula does indeed go to unity for small epsilon and to zero for large epsilon, and as long as beta geq 1 it will fall off at least as fast as epsilon-1 for large epsilon. You may also ascertain that alpha represents the half-width of the function: any observation whose residual is alpha times its standard error will be assigned half its natural weight. Finally, note that in the limit of very large beta, this scheme degenerates to the old, classical, reject-any-observation-whose-error-is-larger-than-some-limit: as beta -> infty

Equation 151

Figure 3-11
Figure 3-11.

There is one case which is worth special mention: beta = 1. This is illustrated in Fig. 3-12. It looks a bit odd, but to understand what is neat about this fudge-function, reconsider the sum

Equation 152

For epsilon >> alpha sigma, epsilon cancels out. This means that, all other things being equal, any large residual contributes exactly the same amount to this sum as any other, only the sign of the residual being retained. By making a comparatively small, we can make our solution depend upon making the number of positive residuals equal the number of negative residuals (scaled, of course, for the weight of each point and the sensitivity of the fitting parameter to each point), regardless of the size of those residuals. In other words, we now have a mechanism which permits us to do a least-squares-like analysis, possibly with an equation which is nonlinear in the parameters, possibly with data possessing errors in more than one dimension, which finds a solution passing through the weighted median of the data rather than through the mean. This can be really handy with really cruddy data.

Figure 3-12
Figure 3-12.

Yes, yes, yes, but does it work? Well, I reran the reductions of the same 2,000 sets-of-ten as before, again without using any a priori knowledge of which observations were "good" and which were "bad," but this time using my weight-fudging scheme to reduce the impact of data points whose extreme values made them appear somewhat suspect. Representative means derived with alpha = 2 and beta = 2 are shown on the last line of Table 3-1. In Fig. 3-13 I have plotted the standard deviation of the 2,000 fudged means as derived from various values of alpha and beta. The horizontal line at sigma = 0.33 represents the standard deviation of the means achieved with perfect knowledge of which points are good and which are bad, and the horizontal line at sigma = 1.1 represents the standard deviation obtained with a blind, weight-'em-all-the-same attitude. As you can see, the quality of the answers that you get with my automatic reweighting scheme is surprisingly insensitive to which values of alpha and beta you adopt. Under the hypothesized circumstances, with one point in ten being a factor of ten worse than the rest, by using intermediate values of alpha and beta (setting each of them to somewhere around 2 or 3) you get marginally better results, on average, than with more extreme values. It's also pretty obvious that while this reweighting scheme is never as good as having Perfect Knowledge of Reality, it's a darn sight better than a stubborn insistence on treating all the data equally, no matter how odd they may seem.

Figure 3-13
Figure 3-13.

How does my gimmick work under other circumstances? To find out, I ran some more simulations, of course. Suppose there is a preferential bias among the "bad" data points? For my second test, I assumed as before that the probability of getting a "good" measurement was 90%, where again a good measurement had mean zero and standard deviation unity; this time, however, the 10% of "bad" measurements were assigned mean 2 and standard deviation 5. Fig. 3-14 shows this probability distribution as a heavy curve and, as before, the light curve is a genuine Gaussian probability distribution with sigma = 1, which has been scaled to the same peak value. Again, the difference between the "true" probability distribution for my "observations" and a Gaussian is slight, and would be impossible to detect in any small sample. The bias in the "bad data," in particular, is barely detectable. In this test a "correct" answer of 0.0 for our hypothetical physical quantity is what we want to get from our data. Unlike in the previous test, in this one we can use both the mean of the 2,000 answers and their standard deviation as different measures of the quality of the weighting scheme: the mean of the answers will show the extent to which the bias in the bad data points influences the accuracy of our answer, the standard deviation of the 2,000 answers will show how the extra scatter in the bad points affects the precision of the result. When we use our Perfect Knowledge of the probability distribution of each datum, we can both correct the bad ones for their systematic bias and weight them appropriately. Therefore the expected mean of the 2,000 means would be 0.0, and we expect those standard deviation of those 2,000 means to be

Equation 153

Figure 3-14
Figure 3-14.

However, when we pretend we don't know which observations are good and which are poor, and blindly take the straight mean of all the data in each set of ten, we get answers which average

Equation 154

Naively you'd expect a standard deviation

Equation 155

but, in fact, the bias in the "bad" data introduces additional scatter: the possibility that any given set of ten might have two bad data points, or three, or none, makes the answers scatter by more than their total weights would imply. I actually observed sigma(xbar) = 0.647 for these 2,000 data sets. Now, when I turn on my magic reweighting scheme with various choices of alpha and beta, I get the means represented in Fig. 3-15 by the big dots connected by smooth curves. As you can see, with a fairly severe clipping of the apparently discrepant data (alpha ~ 1.5-2, beta ~ 4-8) we can virtually eliminate the systematic bias in the results: the minimum value of <xbar>, obtained with alpha = 1.5 and beta = 4 is 0.0055 - more than thirty times smaller than the bias produced by accepting all data blindly. Taking the "median" of the data (actually, adopting my weighting scheme with beta = 1 and alpha small which, as I said earlier, shades the answer toward the median) offers the least help: it reduces bias only by a little over a factor of 2. The standard deviations of the 2,000 means are shown in Fig. 3-16, which suggests that the precision of the answers is optimized by a somewhat more gentle clipping than that yielding the best accuracy: values of alpha geq 2, beta geq 2 yield sigma(xbar) approx 0.36 (as compared to 0.33 for perfect knowledge of good and bad, and 0.65 for an equal-weight reduction).

Figure 3-15
Figure 3-15.

Figure 3-16
Figure 3-16.

Well, suppose you make a serious mistake: suppose you think that your measurements are pretty good, but in fact most of them contain large random errors from a source you are unaware of? For my third experiment, I assumed that there was only a 30% chance that any given datum was "good" (mean zero and standard deviation unity) and a 70% chance that it was "bad" (mean zero and standard deviation 10; see Fig. 3-17). Suppose you try to reduce these data blithely assuming that every data point has sigma = 1, so that they all get equal weight? Well, pretty obviously you're going to get means from your sets-of-ten showing a standard deviation

Equation 156

For comparison, if you knew the actual standard error of each datum, the very best you could hope for would be means showing a standard deviation

Equation 157

(Most groups of ten will have three or so decent observations; by knowing which ones they are and neglecting the rest - well, giving them the one one-hundredth weight that they deserve - you could come up with OK results. But we don't usually know which ones are good in real life, do we?) (4) With data this badly corrupted, even my reweighting scheme can't perform miracles. But, as Fig. 3-18 shows, it is still better than blindly accepting all data as of equal legitimacy. Of the weighting schemes I tried, the pseudo-median is the best. With beta = 1 - 2 and almost any value of alpha, you can get the standard deviation of the means down to as little as 2.0. When compared to the "blind" standard deviation of 2.65, this represents a 32% reduction in the standard error or a 76% increase in the weight of your results. This is the same as getting seven nights of telescope time for the price of four, all through enabling the computer to feel some doubt about apparently discordant data.

Figure 3-17
Figure 3-17.

There is one more question which you must have answered before you will believe that what I'm telling you is any good. "Suppose there is nothing wrong with any of my observations. How much honest information will I be throwing away by telling the computer to reduce the weight of any data point it doesn't like?" One more simulation. This time I gave each datum a 100% chance of being "good," with the same normal, Gaussian probability distribution with mean zero and standard deviation unity. Again, I generated 2,000 data sets of size ten and took their means both without and with my automatic reweighting scheme. The results are shown in Fig. 3-19, and the answer is, unless you use a truly vicious clipping scheme with very large beta and very small alpha - so the computer completely discards any data point which lies the least little bit away from the mean of its group of ten - you lose very little. When beta = 1 for alpha = 1 (so that, in effect, your answer is the mean of those data less than 1sigma from the answer, shaded toward the median of the rest) the standard deviation of the 2,000 answers is 0.3265, as compared to the "right" answer of 0.3162 = 1 / sqrt10 and things get even better for larger alpha. For more a more intermediate clipping scheme, beta = 2 and alpha = 2.5, I find the standard deviation of the answers is 0.3185. This represents a 1.5% loss of information, as compared to using the "right" observational weights - the rough equivalent of throwing away one minute's observing time out of every hour. How many of us operate at 98.5% efficiency when performing a chore for which we were not intended?

Figure 3-18
Figure 3-18.

Please let me summarize my basic conclusions from the second half of this talk. First, unless you know with 100% absolute certainty that the distribution of your observational errors is Gaussian, and unless you know with perfect accuracy what the standard error of every one of your observations is, classical, standard, look-it-up-in-a-cookbook least squares does not guarantee you the best possible answer from your data. In fact, with only a tiny admixture of poor observations, you can destroy a great deal of information. You must remember to teach your computer to recognize grossly discordant data (as in the data set (0, -1, 3, 1, -2, 31927, 0, 1), which I used as an example earlier). However, under some circumstances (i.e., iterated non-linear or multi-error solutions) telling the computer to reject discordant data does not guarantee you a self-consistent, repeatable answer from a given data set. You can partially overcome these problems and improve the consistency, precision, and accuracy of your final answers by enabling the computer to feel a healthy skepticism about apparently discordant data. When I am reducing real data, unless I have an absolute conviction that the errors are Gaussian, completely Gaussian, and nothing but Gaussian with known sigma's, I find that routinely applying my reweighting scheme with alpha ~ 2 - 2.5 and beta ~ 2 - 4 does my iterated least-squares solutions a whole lot more good, on average, than harm.

Figure 3-19
Figure 3-19.

Does this mean that you can now feed your computer any kind of garbage and have it give you back Truth? Of course not. Does this mean that the astronomer no longer has to look at his or her data? Of course not. An empirical reweighting scheme like this one offers you some insurance against non-ideal data, but it does nothing to protect you against fundamental paradigm errors. Your discordant data points may be due not to wispy clouds, flocks of birds, or cosmic rays; instead they may be heralding a new law of physics. Computers can't recognize this (at least, not yet), only trained scientists can. In the next lecture or two I plan to give you some practical data-reduction tasks where my reweighting scheme helps, but these are all cases where there is very little danger of uncovering new laws of physics. In short, I think it is still true that people who are unwilling to think won't do very good science, with or without computers. There is still no practical substitute for a thorough understanding of your material and some deep thinking about its implications.


4 In fact, even knowing the weights doesn't actually produce means this good. Samples of size ten are not big enough for the Central Limit Theorem to drive the probability distribution for the mean of each sample to the Gaussian distribution, when the distribution for the individual observations is this strongly non-Gaussian. With an average of three "good" data points per sample of ten, there is a finite chance that any given sample will contain no good data. The means of such samples will typically be so far from zero that they will drive up the root-mean-square value of the means. I actually observed 0.818 - rather than 0.571 - as the standard deviation of my 2,000 means-of-ten, assuming perfect knowledge of good and bad. Back.

Next Contents Previous