1. Fitting a model profile to a star image
Figure 2-1 |
In Fig. 2-1 I illustrate an observed radial intensity profile for a star. To obtain this profile I took the actual digital data for an isolated star in a CCD image which I happened to have, estimated the position of the star's centroid by eye, and obtained the average intensity within the star image as a function of the radial distance from the center by averaging the observed pixel values in narrow, concentric annuli. Now, it has been known for a long time that the so-called "Moffat" function is a very good representation of the radial profile of a star image (Moffat 1969; Buonanno, et al. 1983). In its simplest useful form, the Moffat function is
where S stands for the Specific intensity (or, if you prefer, the Surface brightness) at some point located a radial distance r (= sqrt[(x - x_{0})^{2} + (y - y_{0})^{2}]) from the center of the stellar image. As you can see, besides the independent variable r, the Moffat function also involves four unknown parameters, which must be determined empirically: I denote them B, which is the Background diffuse sky brightness caused by emission from Earth's atmosphere, zodiacal light, unresolved background galaxies, and so on (i.e., B = lim_{r->} S(r), as you can easily confirm); C, representing the star's Central surface brightness above the local background (i.e., C = S(r = 0) - B, ¿no es verdad?); then there is R, which is a sort of characteristic Radius of the star image, determined by such broadening mechanisms as seeing, and guiding errors; and finally, there is , which describes the asymptotic power-law slope of the wings of the image at large radial distances from the center (that is, for large r, S(r) - B -> C . (r^{2} / R^{2})^{-}). In particular, if = 1, then the stellar profile is a Lorentz function, and R is the half-width at half maximum, since in this case S(R) - B = C / 2. (However, real star images cannot be perfect Lorentz profiles since
Of course, in real life we can't integrate a stellar profile to infinity; we can observe it only out to radians if the star is at the zenith! However, the Lorentz profile still has an unreasonable amount of flux out in the wings.)
For the sake of pursuing this first example, let us assume that the value of is known. If we also assume that R is known, then the problem is linear in the unknowns C and B ^{(1)} so we know how to do the least-squares solution. But let's take it to the next degree of difficulty, and assume that we want to solve for C, B, and the seeing parameter R all at once. How shall we do this? Well, we have to go back to the basics. The whole thing about "least squares" is you want to minimize the sum of the squares of the fitting residuals - that's how you get the maximum-likelihood answer, and that hasn't changed. Where does it occur that ^{2} = ^{2} minimized? Why, at the point where the first derivative of ^{2} with respect to every one of the fitting parameters is zero, simultaneously. Let's do it.
(Remember, for the time being we're assuming that is known.)
Similarly,
and
The magical difference between this problem and the linear least-squares problems we looked at in the previous lecture is this: whereas before, when we took
the derivatives ð / ða did not contain any unknown fitting parameters, a, - they depended only upon known quantities, t. Obviously, that's because those equations were linear in the fitting parameters, a. That meant that we were able to pull the constant (but unknown) fitting parameters from the 's out of the summations. The summations that were left then included only known quantities; they could be computed; and the resulting system of equations - all of which were linear in the unknown parameters - could be solved for the a's. In our current problem, the ð / ða's do contain a's; the unknowns can't be taken out of the summations; the summations can't be evaluated; and we can't solve for the parameters until after we know them.
The way out is to use one of the favorite tools of mathematical physicists (and statisticians like it, too!): the method of successive approximation. That is, we guess at approximate solutions for our three unknowns, C, B, and R; I will call these guesses C_{0}, B_{0}, and R_{0}. If we write the final, least-squares solution that we want so badly as S(r; , , ), we can approximate this by a Taylor expansion evaluated with our initial guesses:
which leads to the set of fitting equations
or, equivalently,
We may recognize that this equation is of the same form as
where
and
We already know how to solve these linear equations for the three unknown parameters C, B, and R. When we crank up the old machinery we can compute the improved parameter values C_{1} = C_{0} + C, B_{1} = B_{0} + B, and R_{1} = R_{0} + R. ^{(2)} Unfortunately, because our first-order Taylor expansion is only approximate, we can be reasonably certain that the new values C_{1}, B_{1}, and R_{1} will not be exactly equal to the best values , , and - which is why I dropped the "^" notation and used the subscript_{1} instead - and they probably won't even be "good enough for all practical purposes." Nevertheless, they should be better estimates than C_{0}, B_{0}, and R_{0} were. So, we just plug them back into the equations,
and try again. These new equations can now be solved for fresh values of C, B, and R (which should mostly be smaller than the previous ones), and we may take C_{2} = C_{1} + C, B_{2} = B_{1} + B, and R_{2} = R_{1} + R as still better estimates of the desired fitting parameters. This whole process may be repeated as many times as necessary. Of course, as C, B, and R become smaller and smaller, the Taylor expansion becomes better and better. The solution should quickly converge, and when C, B, and R have all become negligibly small we have found the least-squares solution to our problem. This leads me to . . .
A GENERAL METHODOLOGY FOR NONLINEAR LEAST SQUARES
Step 1.
Write down your analytic equation, and derive its first derivative with respect to each of the unknown fitting parameters. In the example above,
Step 2.
By any means available, guesstimate approximate values for the fitting parameters. For instance, in the example above you might take the largest observed value of s_{i} as your initial guess for C, and s_{N} as your initial guess at B. These guesses don't have to be "best" values - that's why we're doing the least squares. They just have to be vaguely all right (except sometimes).
Step 3.
For each data point, calculate the error of the initial model, and use these residuals to solve for corrections to the parameter estimates by ordinary linear least squares. In the example above, this step is represented by the equation
Step 4.
Add the parameter corrections which you have just computed to your initial guesses at the parameter values, to obtain improved parameter values.
Step 5.
Examine the parameter corrections which you computed in Step 3. How big are they? If any or all of them are larger than you would like, go back and do Steps 3 and 4 again. Keep doing Steps 3, 4, and 5 until all parameter corrections are negligibly small. When this is accomplished, you have achieved your desired least-squares solution!
Step 5 (alternate).
Using the current set of parameter estimates, calculate the m.e.1 of your data points with respect to the current model. In the notation of the example above, this would be
Repeat Steps 3, 4, and 5 as long as the m.e.1 continues to get significantly smaller. When the root-mean squar e residual no longer decreases with additional iterations, you have achieved your desired least-squares solution!
Whether you choose to adopt Step 5 or Step 5 (alternate) may depend upon why you want to solve the least-squares problem in the first place. If you are actually interested in the best numerical values for the fitting parameters, you will want to use Step 5. For instance, if you are fitting the model profile to the star image because you want to know how bright the star is, what you want is accurate values for C and R. Another example is the radioactive decay experiment from my last lecture, where you really wanted to know the actual abundances of each of the isotopes: if for some reason you found yourself doing iterated least squares (cf. Lecture 3) you would keep iterating until you got the parameter values as accurately as you could; you would adopt Step 5 as your convergence criterion. However, suppose you wanted to model a stellar surface-brightness distribution simply in order to interpolate the star's intensity profile to some radius between two of your r_{i} values? In this case you don't really care what C and R are, you just want to know that the model represents the data to the required level of precision. Therefore you might adopt Step 5 (alternate). Similarly, in the problem where we transformed the coordinate systems of CCD frames yesterday: maybe you aren't interested in the actual values of the translation, scale, and rotation constants, you just want to be sure that the transformations are good to, say, 0.1 pixel for the average star; if you had to iterate the solution for some reason, you would stop iterating when and if the mean error of unit weight dropped significantly below this value.
I guess this is a good time to ramble on philosophically about some of the subtleties involved here. Which brings me to my first statement of a theme which will recur throughout the rest of these lectures, to wit: computers may be awfully smart, but computers are awfully stupid! The computers available today, at least, may be really good at arithmetic but they aren't yet much for common sense. One of the best examples of this is they don't know when to quit. If you tell the computer to go and compute sums and invert matrices until C, B, and R go to zero, the computer will keep on going forever or until the next power failure, whichever comes first. Because of truncation error, ð^{2} / ðC probably will never go identically to zero: after an infinite amount of iterating, your current best value of C will give still you a finite value for C, but try to improve your estimate of C by one quantum in the least significant bit, and you'll get another finite correction C of the opposite sign. It's time for the astronomer/programmer to step in and impose a little common sense. When you write the program, you must decide ahead of time just how good you need your answer to be, and tell the computer that it is allowed to quit when it gets there.
As I said above, there are various ways this can be done in our profile-fitting problem, and I'd like you to think about some of them for a moment: you might tell the computer to quit when the size of the correction to the central surface brightness drops to |C| < 0.001 electrons per pixel; alternatively, since the likelihood is that you really want C in order to compute the star's magnitude by some formula like magnitude = constant - 2.5 log C, you could tell the computer to quit when |C| < 0.001C - when an iteration changes the star's brightness by less than a millimagnitude. Of course, for faint stars, you're never going to get an answer good to a millimagnitude anyway, just because of the noise, so maybe you'll tell the computer to quit when |C| < 0.1 (C), where (C) can be computed by the method I gave you at the beginning of this lecture. In this case you're telling the computer to quit as soon as the correction becomes much smaller than the uncertainty of the correction = the uncertainty of the result. In other cases it may be most convenient simply to stop iterating when the m.e.1 stops decreasing by, say, a part in a thousand. Every situation is a little bit different, and that's why I mostly write my nonlinear least-squares programs from scratch every time I need to solve a different sort of problem. This is one of the areas where practical mathematics becomes a creative endeavor, where it borders on the artistic as well as the scientific. You need to be able to step back from the problem, see what its truly important aspects are, and explain it to a dumb machine.
Now, back to the example. Figs. 2-2 and 2-3 represent a nonlinear least-squares fit of a Lorentz function (a Moffat function with 1) to the observed profile data illustrated in Fig. 2-1. Fig. 2-2 is a photocopy of my actual computer printout documenting the convergence of the fit. I told the computer that I wanted to solve for three unknowns (C, B, and R), and I then provided it with starting guesses for the four parameters of the Moffat function; of course, was to be held fixed at the value specified ( 1), while the other three were destined to be improved by the nonlinear least-squares solution. My initial guesses for C, B, and R were 4000, 0, and 5, respectively, and the computer took it from there. Succeeding pairs of lines show (upper) the current working estimates of the three fitting parameters C, B, and R, and (lower) the corrections C, B, and R derived from those parameters. The last number on the upper line of each pair is the value of the m.e.1 derived from the differences between the data and the current model. (For this experiment I gave every data point unit weight; the m.e.1 therefore represents an estimate of the r.m.s. error of each raw datum = the brightness in a particular annulus. This isn't what I'd do if I were really serious; in actuality the weight should reflect the total readout noise and the Poisson noise in each annulus, but this is just for show so wotthehell, wotthehell.) The last two rows of numbers - under where it says "Converged" - represent the final derived parameter values and their uncertainties, calculated as described at the beginning of this lecture. I arbitrarily imposed a convergence criterion of 0.001 electrons per pixel in the value of C. As you can see, this was undoubtedly too strict since the computer could have stopped working several iterations sooner without producing significantly poorer answers, whether by the milllmagnitude criterion, by |C| << (C), or by the leveling-off of ^{2}.
Enter number of terms 3 Enter starting guesses for C, B, R, and beta 4000 0 5 1 C B R m.e.1 4000.000 0.000 5.000 775. 2757.100 -214.882 -1.947 6757.100 -214.882 3.053 456. -208.755 -47.599 1.050 6548.345 -262.481 4.104 164. 268.906 -15.772 -0.065 6817.251 -278.253 4.039 147. -3.557 -2.115 0.011 6813.694 -280.368 4.050 147. 0.725 0.349 -0.002 6814.419 -280.019 4.049 147. -0.094 -0.047 0.000 6814.325 -280.066 4.049 147. 0.013 0.006 0.000 6814.337 -280.060 4.049 147. -0.001 -0.001 0.000 Converged 6814.336 -280.061 4.049 +/-122.805 46.856 0.131
Fig. 2-3 shows the raw data as points, the "initial guess" at the solution as the dashed curve which peaks near 4000, the improved solution after a single iteration as the other dashed curve, and the final solution as the solid curve. I think you can see that, even with pretty horrible initial guesses at the solution, after one or two hops to get itself into the right ballpark, the solution converges like a bat out of . . ., the solution converges pretty quickly. Still, real afficionados would not be satisfied with this fit for two reasons: first, if you look carefully at Fig. 2-3 you'll see that the data points tend to fall consistently below the best solution for 8 r 15 and above it for r 20; second, the background sky brightness comes out significantly negative (at the 6 level!) - a situation which you'll admit is pretty unusual for a real astronomical image.
Figure 2-3 |
What shall we do about this? One obvious solution is to admit that we were pretty foolish to assume that 1. Suppose really has some different, unknown value? Why don't we solve for it? Can we really??? Sure, why not? (But you don't need to get so excited.) All we have to do is work out one (pretty hairy) derivative:
Well, I always have to go back to the CRC Tables for this one, but
is what we need to work out
Let a -> 1 + r^{2} / R^{2}, u -> - ðu / ð -> -1
So now we've got it knocked. We just perform our same old linearized nonlinear least-squares solution for four unknowns rather than three. The results of this exercise are shown in Fig. 2-4. As you can see, by merely solving for rather than imposing a guess, we have improved the fit by a factor of 50. (Since, as you know, the standard error scales as the square root of the number of observations, this is the same level of improvement as we might have gotten by reobserving the star 2500 times and averaging the results, except for the fact that 1 would probably always have been wrong, so even by reobserving 2500 times we wouldn't have gotten the improvement. Much better to get the model right.) I won't bother to illustrate the fit. Suffice it to say that the solid curve goes right through the middle of all the points like a string through beads, as the m.e.1 = 3 (3 parts in 6,000!) attests.
Enter number of terms 4 Enter starting guesses for C, B, R, and beta 4000 0 5 1 C B R beta m.e.1 4000.000 0.000 5.000 1.000 791. 1946.211 355.223 2.606 1.338 5946.210 355.223 7.606 2.338 397. 285.621 -315.214 0.441 0.457 6231.832 40.009 8.046 2.795 18. 2.386 -5.279 0.180 0.129 6234.218 34.730 8.226 2.924 3. 0.065 -0.143 0.007 0.006 6234.283 34.587 8.233 2.930 3. -0.002 0.001 0.000 0.000 6234.281 34.588 8.233 2.930 3. 0.000 0.000 0.000 0.000 Converged 6234.281 34.588 8.233 2.930 +/- 3.044 1.104 0.034 0.020
Just for fun, I ran the four-parameter fit again with a different set of starting guesses; the printout from this reduction is shown as Fig. 2-5. Oops. This is yet another example of the fundamental stupidity of the computer. I told the heap of dirty silicon to evaluate some algebraic derivatives, compute some sums, invert and multiply some matrices, and add the resulting numbers to the original numbers, and that's what it did. It didn't notice - or care - that the results were garbage; if I didn't warn it to watch out for trouble it sure wasn't going to notice trouble on its own. The problem here is that sometimes the first-order Taylor expansion is not quite good enough. Certainly, given any set of fitting parameters, you can compute the value of ^{2} corresponding to those parameters. And furthermore, the first derivative of ^{2} with respect to each of those fitting parameters will tell you which direction each fitting parameter must change to make ^{2} smaller. So in principle the parameter values ought to be able to slide down the ^{2} gradient all the way to the minimum. The problem with the first-order Taylor expansion is that it may not always correctly predict how much the parameter value should change during a given iteration. Sometimes it can overestimate a prediction so badly that the provisional parameter values are shot straight out of the ^{2} valley, and the solution can be left wandering around in No-Person's-Land for time immemorial. This is what happened in Fig. 2-5: the first set of corrections made go negative - a nonsensical result, as you can quickly convince yourself - and after that all hope was lost.
Enter number of terms 4 Enter starting guesses for C, B, R, and beta 4000 0 4 2 C B R beta m.e.1 4000.000 0.000 4.000 2.000 1271. 2228.132 -28.314 -0.957 -3.311 6228.131 -28.314 3.043 -1.311 850038. -14477.302 14345.458 -1.598 0.010 -8249.171 14317.145 1.445 -1.301 7368696. 4599.956 -4322.243 0.248 -0.001 -3649.215 9994.901 1.693 -1.302 2165603. -1300.995 1665.958 0.895 -0.003 -4950.210 11660.859 2.588 -1.305 990817. -1870.571 1273.029 1.406 -0.007 -6820.781 12933.889 3.994 -1.312 456283. -4125.703 3739.223 2.590 -0.020 -10946.484 16673.111 6.584 -1.332 214670. -10658.754 10257.783 5.536 -0.068 -21605.238 26930.895 12.120 -1.400 103538. -39657.239 39197.091 16.459 -0.327 -61262.477 66127.984 28.579 -1.727 52382. -381083.538 380654.934 136.898 -4.720 -442346.000 446782.906 165.477 -6.447 35487. ***********3878612.233 3886.715 -241.123 ***********4325395.000 4052.192 -247.570 20527. -626868.512 625786.342 3055.050 -110.527 ***********4951181.500 7107.243 -358.096 10700. 1237167.740*********** 17015.633 -1498.216 ***********3713886.750 24122.877 -1856.312 2843. 70332.360 -70223.023 7817.022 -210.053 ***********3643663.750 31939.898 -2066.365 1712. -5222.222 5214.010 -2028.465 896.271 ***********3648877.750 29911.434 -1170.094 1524. 107526.703-107510.670 -12482.410 843.641 ***********3541367.000 17429.023 -326.454 1584. 172736.120-172655.726 56596.851 -2267.746 ***********3368711.250 74025.875 -2594.200 2088. -4733.567 4925.535-209804.917 7603.578 Error Inverting matrix.
The way around the problem is for the human being to anticipate that this might happen and forewarn the computer to watch out for it. Stepping back from the problem once again, we remind ourselves that the first derivatives of ^{2} ought to be pretty good at predicting the direction of the parameter changes, but the step sizes may be untrustworthy. So any time it looks like the corrections may be erroneous, we tell the computer to apply smaller steps of the same sign. There probably is a technical term for these human-imposed limits on the allowable size of parameter corrections, but I don't know it; I call them "clamps." There are any number of ways that you can impose these clamps, but I have decided to illustrate two. For Fig. 2-6 I told the computer that if any particular set of parameter corrections caused ^{2} to increase - which we don't want - then those corrections should not be applied. Instead, they should be cut in half and the computer should see whether those corrections caused ^{2} to decrease. As you can see, this scheme worked fine. It kept the solution from shooting off into Neverland, and the true minimum was soon located. For Fig. 2-7 I imposed clamps of a different sort: I told the computer not to allow either R or to change by more than 20% in any given iteration. That is, I instructed it
I figured that C and B were well enough constrained by the data that, if R and could be kept in line, C and B would follow along without any trouble. As you can see from Fig. 2-7, this scheme worked fine, too. It took a few more iterations than the previous one (12 as compared to 9), but it has the desirable property that the estimates of R and can never pass through zero; give `em positive starting values and they'll stay positive. Both types of clamps lead to absolutely the same final answer as we got with the more fortunate set of starting guesses. A well-designed set of clamps will not affect the final answer, since for a reasonably well-conditioned problem like this one there really is only one place where the first derivative of ^{2} goes to zero for all parameters simultaneously (this statement is not true for certain other problems, such as binary-star orbits for instance, where there are more parameters, they are much more strongly interconnected, and good starting guesses are harder to come by). Clamps are required simply to keep the computer from going too far off the right track during those first few iterations of indecision.
This is a good place to stress that "clamps" or any other arbitrary scheme for coming up with parameter corrections do not mean that you are fudging the data in any way, shape, or form. The heart and soul of the method of least squares consist in finding the minimum of ^{2}: finding that set of parameter values for which the derivative of ^{2} with respect to each of the parameters goes to zero: finding that set of parameter values for which the column vector V goes to the null vector. This business of setting up the square matrix M and solving for a column vector A is one extremely convenient gimmick for getting corrections from a provisional parameter set to an improved parameter set. However, it doesn't matter where your parameter corrections come from - get them from a crystal ball for all I care (they have nice crystals here in Brazil). When the column vector V becomes the null vector, you have your official government-certified Least-Squares Solution just as good and pure as they come.
Enter number of terms 4 Enter starting guesses for C, B, R, and beta 4000 0 4 2 C B R beta m.e.1 4000.000 0.000 4.000 2.000 1271. 2228.132 -28.314 -0.957 -3.311 6228.132 -28.314 3.043 -1.311 850038. is larger than 1271. 1114.066 -14.157 -0.478 -1.656 5114.066 -14.157 3.522 0.344 1589. is larger than 1271. 557.033 -7.078 -0.239 -0.828 4557.033 -7.078 3.761 1.172 837. 1542.626 38.276 3.504 0.970 6099.659 31.198 7.265 2.143 128. 128.289 14.502 0.673 0.562 6227.948 45.700 7.938 2.704 36. 6.106 -10.627 0.274 0.207 6234.054 35.073 8.212 2.912 4. 0.232 -0.485 0.021 0.018 6234.286 34.588 8.233 2.930 3. -0.006 0.000 0.000 0.000 6234.279 34.588 8.233 2.930 3. 0.002 0.000 0.000 0.000 Converged 6234.281 34.588 8.233 2.930 +/-3.044 1.104 0.034 0.020
Enter number of terms 4 Enter starting guesses for C, B, R, amd beta 4000 0 4 2 C B R beta m.e.1 4000.000 0.000 4.000 2.000 1271. 2228.132 -28.314 -0.957 -3.311 -0.800 -0.400 6228.131 -28.314 3.200 1.600 897. -20.236 -60.539 -0.115 -1.613 -0.115 -0.320 6207.895 -88.852 3.085 1.280 762. -73.917 -22.592 1.468 -0.170 0.617 6133.978 -111.445 3.703 1.110 389. -59.829 169.027 2.698 0.774 0.741 0.222 6074.149 57.582 4.443 1.332 261. 67.728 -1.350 2.536 0.824 0.889 0.266 6141.876 56.232 5.332 1.598 159. 46.128 -1.911 2.154 0.821 1.066 0.320 6188.004 54.321 6.398 1.918 102. 28.998 -4.596 1.479 0.726 1.280 0.384 6217.002 49.725 7.678 2.301 170. 14.662 -8.717 0.259 0.422 6231.665 41.008 7.937 2.724 25. 2.404 -6.064 0.280 0.192 6234.069 34.944 8.217 2.916 4. 0.217 -0.357 0.016 0.014 6234.286 34.587 8.233 2.930 3. -0.005 0.000 0.000 0.000 6234.281 34.588 8.233 2.930 3. Converged 6234.281 34.588 8.233 2.930 +/- 3.044 1.104 0.034 0.020
^{1} By the same sort of change of variables that we did in the radioactive decay examples: let the known quantity be s_{i} (1 + r_{i}^{2} / R^{2})^{-}, and let the observed brightness value in the i-th annulus be s_{i}, so s_{i} C . r_{i} + B. This, of course, is nothing but a straight line. Back.
^{2} This is a great place for minus signs to trip you up. The left-hand side of the fitting equation is the negative of _{i} as defined by the convention I have been using: observation - model as opposed to model - observation. I find it really helpful, after writing an equation down, to stand back and look at it to see if it makes sense. If, for instance, s_{i} is greater than [C_{0} / (1 + r_{i}^{2} / R_{0}^{2})^{}] + B, then that means the left side of the fitting equation is positive. But if the observation is greater than the current model, one way to fix it is to increase the height of the model - that is, if the data tend to lie above the initial model, you expect the true height of the profile, , to be greater than the current guess, C_{0}. Therefore, C = - C_{0} should be greater than zero since ðS / ðC is always positive. So the right side of the equation is positive. Similar arguments work for B and R. So I didn't screw up the minus sign and I can go on. Back.