1. Fitting a model profile to a star image
![]() |
Figure 2-1 |
In Fig. 2-1 I illustrate an observed radial intensity profile for a star. To obtain this profile I took the actual digital data for an isolated star in a CCD image which I happened to have, estimated the position of the star's centroid by eye, and obtained the average intensity within the star image as a function of the radial distance from the center by averaging the observed pixel values in narrow, concentric annuli. Now, it has been known for a long time that the so-called "Moffat" function is a very good representation of the radial profile of a star image (Moffat 1969; Buonanno, et al. 1983). In its simplest useful form, the Moffat function is
where S stands for the Specific intensity (or, if you prefer, the
Surface brightness) at some point located a radial distance r (=
sqrt[(x - x0)2 + (y -
y0)2]) from the center of the stellar
image. As you can see,
besides the independent variable r, the Moffat function also involves
four unknown parameters, which must be determined empirically: I
denote them B, which is the Background diffuse sky
brightness caused
by emission from Earth's atmosphere, zodiacal light, unresolved
background galaxies, and so on (i.e., B =
limr-> S(r),
as you can
easily confirm); C, representing the star's Central
surface brightness
above the local background (i.e., C = S(r =
0) - B, ¿no es verdad?);
then there is R, which is a sort of characteristic Radius
of the star
image, determined by such broadening mechanisms as seeing, and guiding
errors; and finally, there is
, which describes
the asymptotic
power-law slope of the wings of the image at large radial distances
from the center (that is, for large r, S(r) - B ->
C . (r2 /
R2)-
). In
particular, if
=
1, then the stellar profile is a Lorentz function,
and R is the half-width at half maximum, since in this case
S(R) - B = C / 2. (However, real star images cannot be
perfect Lorentz profiles since
Of course, in real life we can't integrate a stellar profile to
infinity; we can observe it only out to
radians if the star is at
the zenith! However, the Lorentz profile still has an unreasonable
amount of flux out in the wings.)
For the sake of pursuing this first example, let us assume that the
value of is
known. If we also assume that R is known, then the
problem is linear in the unknowns C and
B (1) so we
know how to do the
least-squares solution. But let's take it to the next degree of
difficulty, and assume that we want to solve for C, B, and
the seeing
parameter R all at once. How shall we do this? Well, we have to go
back to the basics. The whole thing about "least squares" is you want
to minimize the sum of the squares of the fitting residuals - that's
how you get the maximum-likelihood answer, and that hasn't
changed. Where does it occur that
2 =
2 minimized?
Why, at the point where the first derivative of
2 with respect to every one of
the fitting parameters is zero, simultaneously. Let's do it.
(Remember, for the time being we're assuming that
is known.)
Similarly,
and
The magical difference between this problem and the linear least-squares problems we looked at in the previous lecture is this: whereas before, when we took
the derivatives ð /
ða did not contain any unknown fitting parameters, a,
- they depended only upon known quantities, t. Obviously, that's
because those equations were linear in the fitting parameters, a. That
meant that we were able to pull the constant (but unknown) fitting
parameters from the
's out
of the summations. The summations that
were left then included only known quantities; they could be computed;
and the resulting system of equations - all of which were linear in
the unknown parameters - could be solved for the a's. In our current
problem, the ð
/
ða's do contain a's; the unknowns can't be taken out of
the summations; the summations can't be evaluated; and we can't solve
for the parameters until after we know them.
The way out is to use one of the favorite tools of mathematical
physicists (and statisticians like it, too!): the method of successive
approximation. That is, we guess at approximate solutions for our
three unknowns, C, B, and R; I will call these
guesses C0, B0, and
R0. If we write the final, least-squares solution that
we want so
badly as S(r;
,
,
), we can approximate this by a
Taylor expansion evaluated with our initial guesses:
which leads to the set of fitting equations
or, equivalently,
We may recognize that this equation is of the same form as
where
and
We already know how to solve these linear equations for the three
unknown parameters
C,
B, and
R. When we crank up the
old machinery we can compute the improved
parameter values C1 = C0 +
C, B1
= B0 +
B,
and R1 = R0 +
R.
(2) Unfortunately,
because our first-order Taylor expansion is only approximate, we can be
reasonably certain
that the new values C1, B1, and
R1 will not be exactly equal to the best
values
,
, and
- which is why I dropped the "^"
notation and used the subscript1 instead - and
they probably won't even be "good enough for all practical purposes."
Nevertheless, they
should be better estimates than C0,
B0, and R0 were. So, we just plug
them back into the equations,
and try again. These new equations can now be solved for fresh values of
C,
B, and
R
(which should mostly be smaller than the previous ones), and we may take
C2 = C1 +
C,
B2 = B1 +
B, and
R2 = R1 +
R as still better
estimates of the desired fitting parameters.
This whole process may be repeated as many times as necessary. Of
course, as
C,
B,
and
R become smaller and
smaller, the Taylor expansion becomes better and better. The
solution should quickly converge, and when
C,
B, and
R have all
become negligibly
small we have found the least-squares solution to our problem. This
leads me to . . .
A GENERAL METHODOLOGY FOR NONLINEAR LEAST SQUARES
Step 1.
Write down your analytic equation, and derive its first derivative with respect to each of the unknown fitting parameters. In the example above,
Step 2.
By any means available, guesstimate approximate values for the fitting parameters. For instance, in the example above you might take the largest observed value of si as your initial guess for C, and sN as your initial guess at B. These guesses don't have to be "best" values - that's why we're doing the least squares. They just have to be vaguely all right (except sometimes).
Step 3.
For each data point, calculate the error of the initial model, and use these residuals to solve for corrections to the parameter estimates by ordinary linear least squares. In the example above, this step is represented by the equation
Step 4.
Add the parameter corrections which you have just computed to your initial guesses at the parameter values, to obtain improved parameter values.
Step 5.
Examine the parameter corrections which you computed in Step 3. How big are they? If any or all of them are larger than you would like, go back and do Steps 3 and 4 again. Keep doing Steps 3, 4, and 5 until all parameter corrections are negligibly small. When this is accomplished, you have achieved your desired least-squares solution!
Step 5 (alternate).
Using the current set of parameter estimates, calculate the m.e.1 of your data points with respect to the current model. In the notation of the example above, this would be
Repeat Steps 3, 4, and 5 as long as the m.e.1 continues to get significantly smaller. When the root-mean squar e residual no longer decreases with additional iterations, you have achieved your desired least-squares solution!
Whether you choose to adopt Step 5 or Step 5 (alternate) may depend upon why you want to solve the least-squares problem in the first place. If you are actually interested in the best numerical values for the fitting parameters, you will want to use Step 5. For instance, if you are fitting the model profile to the star image because you want to know how bright the star is, what you want is accurate values for C and R. Another example is the radioactive decay experiment from my last lecture, where you really wanted to know the actual abundances of each of the isotopes: if for some reason you found yourself doing iterated least squares (cf. Lecture 3) you would keep iterating until you got the parameter values as accurately as you could; you would adopt Step 5 as your convergence criterion. However, suppose you wanted to model a stellar surface-brightness distribution simply in order to interpolate the star's intensity profile to some radius between two of your ri values? In this case you don't really care what C and R are, you just want to know that the model represents the data to the required level of precision. Therefore you might adopt Step 5 (alternate). Similarly, in the problem where we transformed the coordinate systems of CCD frames yesterday: maybe you aren't interested in the actual values of the translation, scale, and rotation constants, you just want to be sure that the transformations are good to, say, 0.1 pixel for the average star; if you had to iterate the solution for some reason, you would stop iterating when and if the mean error of unit weight dropped significantly below this value.
I guess this is a good time to ramble on philosophically about some of
the subtleties
involved here. Which brings me to my first statement of a theme which
will recur throughout
the rest of these lectures, to wit: computers may be awfully smart, but
computers are awfully
stupid! The computers available today, at least, may be really good at
arithmetic but they
aren't yet much for common sense. One of the best examples of this is
they don't know
when to quit. If you tell the computer to go and compute sums and invert
matrices until
C,
B, and
R go to zero, the
computer will keep on going forever or until the next
power failure, whichever comes first. Because of truncation error,
ð
2 / ðC
probably will
never go identically to zero: after an infinite amount of iterating,
your current best value
of C will give still you a finite value for
C, but try to improve your
estimate of C by
one quantum in the least significant bit, and you'll get another finite
correction
C of
the opposite sign. It's time for the astronomer/programmer to step in
and impose a little
common sense. When you write the program, you must decide ahead of time
just how good
you need your answer to be, and tell the computer that it is allowed to
quit when it gets there.
As I said above, there are various ways this can be done in our profile-fitting
problem, and I'd like you to think about some of them for a moment: you
might tell
the computer to quit when the size of the correction to the central
surface brightness drops
to |C| < 0.001 electrons
per pixel; alternatively, since the likelihood
is that you really want
C in order to compute the star's magnitude by some formula like
magnitude = constant - 2.5 log C,
you could tell the computer to quit when
|
C| < 0.001C -
when an iteration
changes the star's brightness by less than a millimagnitude. Of course,
for faint stars,
you're never going to get an answer good to a millimagnitude anyway,
just because of the
noise, so maybe you'll tell the computer to quit when
|
C| < 0.1
(
C),
where
(
C)
can be computed by the method I gave you at the beginning of this
lecture. In this case
you're telling the computer to quit as soon as the correction becomes
much smaller than
the uncertainty of the correction = the uncertainty of the result. In
other cases it may
be most convenient simply to stop iterating when the m.e.1 stops
decreasing by, say, a
part in a thousand. Every situation is a little bit different, and
that's why I mostly write
my nonlinear least-squares programs from scratch every time I need to
solve a different
sort of problem. This is one of the areas where practical mathematics
becomes a creative
endeavor, where it borders on the artistic as well as the
scientific. You need to be able to
step back from the problem, see what its truly important aspects are,
and explain it to a dumb machine.
Now, back to the example. Figs. 2-2 and
2-3 represent a nonlinear
least-squares fit of
a Lorentz function (a Moffat function with
1) to the observed
profile data illustrated
in Fig. 2-1.
Fig. 2-2 is a photocopy of my actual computer printout
documenting the
convergence of the fit. I told the computer that I wanted to solve for
three unknowns (C,
B, and R), and I then provided it with starting guesses
for the four parameters of the Moffat
function; of course,
was to be held
fixed at the value specified (
1), while the other three
were destined to be improved by the nonlinear least-squares solution. My
initial guesses
for C, B, and R were 4000, 0, and 5, respectively,
and the computer took it from there.
Succeeding pairs of lines show (upper) the current working estimates of
the three fitting
parameters C, B, and R, and (lower) the corrections
C,
B, and
R
derived from those
parameters. The last number on the upper line of each pair is the value
of the m.e.1 derived
from the differences between the data and the current model. (For this
experiment I gave
every data point unit weight; the m.e.1 therefore represents an estimate
of the r.m.s. error
of each raw datum = the brightness in a particular annulus. This isn't
what I'd do if I were
really serious; in actuality the weight should reflect the total readout
noise and the Poisson
noise in each annulus, but this is just for show so wotthehell,
wotthehell.) The last two rows
of numbers - under where it says "Converged" - represent the final
derived parameter
values and their uncertainties, calculated as described at the beginning
of this lecture. I
arbitrarily imposed a convergence criterion of 0.001 electrons per pixel
in the value of
C.
As you can see, this was undoubtedly too strict since the computer could
have stopped
working several iterations sooner without producing significantly poorer
answers, whether by the milllmagnitude criterion, by
|
C| <<
(
C), or by the leveling-off of
2.
Enter number of terms 3 Enter starting guesses for C, B, R, and beta 4000 0 5 1 C B R m.e.1 4000.000 0.000 5.000 775. 2757.100 -214.882 -1.947 6757.100 -214.882 3.053 456. -208.755 -47.599 1.050 6548.345 -262.481 4.104 164. 268.906 -15.772 -0.065 6817.251 -278.253 4.039 147. -3.557 -2.115 0.011 6813.694 -280.368 4.050 147. 0.725 0.349 -0.002 6814.419 -280.019 4.049 147. -0.094 -0.047 0.000 6814.325 -280.066 4.049 147. 0.013 0.006 0.000 6814.337 -280.060 4.049 147. -0.001 -0.001 0.000 Converged 6814.336 -280.061 4.049 +/-122.805 46.856 0.131
Fig. 2-3 shows the raw data as points, the
"initial guess" at the solution as the dashed
curve which peaks near 4000, the improved solution after a single
iteration as the other
dashed curve, and the final solution as the solid curve. I think you can
see that, even with
pretty horrible initial guesses at the solution, after one or two hops
to get itself into the
right ballpark, the solution converges like a bat out of . . ., the
solution converges pretty
quickly. Still, real afficionados would not be satisfied with this fit
for two reasons: first,
if you look carefully at Fig. 2-3 you'll see
that the data points tend to fall consistently
below the best solution for 8
r
15 and above it for
r
20; second,
the background sky
brightness comes out significantly negative (at the
6
level!) - a
situation which you'll
admit is pretty unusual for a real astronomical image.
![]() |
Figure 2-3 |
What shall we do about this? One obvious solution is to admit that we
were pretty foolish to assume that
1. Suppose
really has some
different, unknown value? Why
don't we solve for it? Can we really??? Sure, why not? (But
you don't need to get so
excited.) All we have to do is work out one (pretty hairy) derivative:
Well, I always have to go back to the CRC Tables for this one, but
is what we need to work out
Let a -> 1 + r2 / R2,
u -> -
ðu /
ð
-> -1
So now we've got it knocked. We just perform our same old linearized
nonlinear least-squares
solution for four unknowns rather than three. The results of this
exercise are shown
in Fig. 2-4. As you can see, by merely solving
for rather than
imposing a guess, we have
improved the fit by a factor of 50. (Since, as you know, the standard
error scales as the
square root of the number of observations, this is the same level of
improvement as we
might have gotten by reobserving the star 2500 times and averaging the
results, except for
the fact that
1 would probably always have
been wrong, so even by reobserving 2500
times we wouldn't have gotten the improvement. Much better to get the
model right.) I
won't bother to illustrate the fit. Suffice it to say that the solid
curve goes right through
the middle of all the points like a string through beads, as the m.e.1 =
3 (3 parts in 6,000!) attests.
Enter number of terms 4 Enter starting guesses for C, B, R, and beta 4000 0 5 1 C B R beta m.e.1 4000.000 0.000 5.000 1.000 791. 1946.211 355.223 2.606 1.338 5946.210 355.223 7.606 2.338 397. 285.621 -315.214 0.441 0.457 6231.832 40.009 8.046 2.795 18. 2.386 -5.279 0.180 0.129 6234.218 34.730 8.226 2.924 3. 0.065 -0.143 0.007 0.006 6234.283 34.587 8.233 2.930 3. -0.002 0.001 0.000 0.000 6234.281 34.588 8.233 2.930 3. 0.000 0.000 0.000 0.000 Converged 6234.281 34.588 8.233 2.930 +/- 3.044 1.104 0.034 0.020
Just for fun, I ran the four-parameter fit again with a different set of
starting guesses;
the printout from this reduction is shown as
Fig. 2-5. Oops. This is yet
another example
of the fundamental stupidity of the computer. I told the heap of dirty
silicon to evaluate
some algebraic derivatives, compute some sums, invert and multiply some
matrices, and
add the resulting numbers to the original numbers, and that's what it
did. It didn't notice
- or care - that the results were garbage; if I didn't warn it to watch
out for trouble it
sure wasn't going to notice trouble on its own. The problem here is that
sometimes the
first-order Taylor expansion is not quite good enough. Certainly, given
any set of fitting
parameters, you can compute the value of
2 corresponding to those
parameters. And furthermore, the first derivative of
2 with respect to each
of those fitting parameters will
tell you which direction each fitting parameter must change to make
2
smaller. So in
principle the parameter values ought to be able to slide down the
2 gradient all the way to
the minimum. The problem with the first-order Taylor expansion is that
it may not always
correctly predict how much the parameter value should change during a
given iteration. Sometimes it can overestimate a prediction so badly
that the provisional parameter values are shot straight out of the
2 valley, and the
solution can be left wandering around in
No-Person's-Land for time immemorial. This is what happened in
Fig. 2-5:
the first set of corrections made
go negative - a
nonsensical result, as you can
quickly convince yourself - and after that all hope was lost.
Enter number of terms 4 Enter starting guesses for C, B, R, and beta 4000 0 4 2 C B R beta m.e.1 4000.000 0.000 4.000 2.000 1271. 2228.132 -28.314 -0.957 -3.311 6228.131 -28.314 3.043 -1.311 850038. -14477.302 14345.458 -1.598 0.010 -8249.171 14317.145 1.445 -1.301 7368696. 4599.956 -4322.243 0.248 -0.001 -3649.215 9994.901 1.693 -1.302 2165603. -1300.995 1665.958 0.895 -0.003 -4950.210 11660.859 2.588 -1.305 990817. -1870.571 1273.029 1.406 -0.007 -6820.781 12933.889 3.994 -1.312 456283. -4125.703 3739.223 2.590 -0.020 -10946.484 16673.111 6.584 -1.332 214670. -10658.754 10257.783 5.536 -0.068 -21605.238 26930.895 12.120 -1.400 103538. -39657.239 39197.091 16.459 -0.327 -61262.477 66127.984 28.579 -1.727 52382. -381083.538 380654.934 136.898 -4.720 -442346.000 446782.906 165.477 -6.447 35487. ***********3878612.233 3886.715 -241.123 ***********4325395.000 4052.192 -247.570 20527. -626868.512 625786.342 3055.050 -110.527 ***********4951181.500 7107.243 -358.096 10700. 1237167.740*********** 17015.633 -1498.216 ***********3713886.750 24122.877 -1856.312 2843. 70332.360 -70223.023 7817.022 -210.053 ***********3643663.750 31939.898 -2066.365 1712. -5222.222 5214.010 -2028.465 896.271 ***********3648877.750 29911.434 -1170.094 1524. 107526.703-107510.670 -12482.410 843.641 ***********3541367.000 17429.023 -326.454 1584. 172736.120-172655.726 56596.851 -2267.746 ***********3368711.250 74025.875 -2594.200 2088. -4733.567 4925.535-209804.917 7603.578 Error Inverting matrix.
The way around the problem is for the human being to anticipate that this might
happen and forewarn the computer to watch out for it. Stepping back from
the problem
once again, we remind ourselves that the first derivatives of
2 ought
to be pretty good at
predicting the direction of the parameter changes, but the
step sizes may be untrustworthy.
So any time it looks like the corrections may be erroneous, we tell the
computer to apply
smaller steps of the same sign. There probably is a technical term for
these human-imposed
limits on the allowable size of parameter corrections, but I don't know
it; I call
them "clamps." There are any number of ways that you can impose these
clamps, but I have decided to illustrate two. For
Fig. 2-6 I told the computer that
if any particular set of parameter corrections caused
2 to increase -
which we don't want - then those
corrections should not be applied. Instead, they should be cut in half
and the computer
should see whether those corrections caused
2 to decrease. As you can
see, this scheme
worked fine. It kept the solution from shooting off into Neverland, and
the true minimum
was soon located. For Fig. 2-7 I imposed
clamps of a different sort: I told the computer
not to allow either R or
to change by more
than 20% in any given
iteration. That is, I instructed it
I figured that C and B were well enough constrained by the
data that, if
R and could
be kept in line, C and B would follow along without any
trouble. As you can see from
Fig. 2-7, this scheme worked fine, too. It took
a few more iterations than the previous one
(12 as compared to 9), but it has the desirable property that the
estimates of R and
can
never pass through zero; give `em positive starting values and they'll
stay positive. Both
types of clamps lead to absolutely the same final answer as we got with
the more fortunate
set of starting guesses. A well-designed set of clamps will not affect
the final answer, since
for a reasonably well-conditioned problem like this one there really is
only one place where the first derivative of
2 goes to zero for all
parameters simultaneously (this statement is
not true for certain other problems, such as binary-star orbits for
instance, where there are
more parameters, they are much more strongly interconnected, and good
starting guesses
are harder to come by). Clamps are required simply to keep the computer
from going too
far off the right track during those first few iterations of indecision.
This is a good place to stress that "clamps" or any other arbitrary
scheme for coming
up with parameter corrections do not mean that you are fudging the data
in any way, shape,
or form. The heart and soul of the method of least squares consist in
finding the minimum of
2: finding that set of
parameter values for which the derivative of
2 with respect to each
of the parameters goes to zero: finding that set of parameter values for
which the column
vector V goes to the null vector. This business of setting up the square
matrix M and
solving for a column vector A is one extremely convenient gimmick for
getting corrections from a provisional parameter set to an improved
parameter set. However, it doesn't matter
where your parameter corrections come from - get them from a crystal
ball for all I care
(they have nice crystals here in Brazil). When the column vector V
becomes the null vector,
you have your official government-certified Least-Squares Solution just
as good and pure as they come.
Enter number of terms 4 Enter starting guesses for C, B, R, and beta 4000 0 4 2 C B R beta m.e.1 4000.000 0.000 4.000 2.000 1271. 2228.132 -28.314 -0.957 -3.311 6228.132 -28.314 3.043 -1.311 850038. is larger than 1271. 1114.066 -14.157 -0.478 -1.656 5114.066 -14.157 3.522 0.344 1589. is larger than 1271. 557.033 -7.078 -0.239 -0.828 4557.033 -7.078 3.761 1.172 837. 1542.626 38.276 3.504 0.970 6099.659 31.198 7.265 2.143 128. 128.289 14.502 0.673 0.562 6227.948 45.700 7.938 2.704 36. 6.106 -10.627 0.274 0.207 6234.054 35.073 8.212 2.912 4. 0.232 -0.485 0.021 0.018 6234.286 34.588 8.233 2.930 3. -0.006 0.000 0.000 0.000 6234.279 34.588 8.233 2.930 3. 0.002 0.000 0.000 0.000 Converged 6234.281 34.588 8.233 2.930 +/-3.044 1.104 0.034 0.020
Enter number of terms 4 Enter starting guesses for C, B, R, amd beta 4000 0 4 2 C B R beta m.e.1 4000.000 0.000 4.000 2.000 1271. 2228.132 -28.314 -0.957 -3.311 -0.800 -0.400 6228.131 -28.314 3.200 1.600 897. -20.236 -60.539 -0.115 -1.613 -0.115 -0.320 6207.895 -88.852 3.085 1.280 762. -73.917 -22.592 1.468 -0.170 0.617 6133.978 -111.445 3.703 1.110 389. -59.829 169.027 2.698 0.774 0.741 0.222 6074.149 57.582 4.443 1.332 261. 67.728 -1.350 2.536 0.824 0.889 0.266 6141.876 56.232 5.332 1.598 159. 46.128 -1.911 2.154 0.821 1.066 0.320 6188.004 54.321 6.398 1.918 102. 28.998 -4.596 1.479 0.726 1.280 0.384 6217.002 49.725 7.678 2.301 170. 14.662 -8.717 0.259 0.422 6231.665 41.008 7.937 2.724 25. 2.404 -6.064 0.280 0.192 6234.069 34.944 8.217 2.916 4. 0.217 -0.357 0.016 0.014 6234.286 34.587 8.233 2.930 3. -0.005 0.000 0.000 0.000 6234.281 34.588 8.233 2.930 3. Converged 6234.281 34.588 8.233 2.930 +/- 3.044 1.104 0.034 0.020
1 By the same sort of change of variables
that we did in the
radioactive decay examples: let the known quantity be
si
(1 +
ri2 / R2)-
, and let the
observed brightness value in the i-th annulus be
si, so si
C .
ri + B. This, of course, is nothing but a straight
line.
Back.
2 This is a great place for minus signs
to trip you up. The left-hand side of the fitting
equation is the negative of
i as defined by
the convention I have been
using: observation - model
as opposed to model - observation. I find it really helpful, after
writing an equation
down, to stand back and look at it to see if it makes sense. If, for
instance, si is greater
than [C0 / (1 + ri2 /
R02)
] + B, then that means the left side of
the fitting equation is positive. But if
the observation is greater than the current model, one way to fix it is
to increase the height
of the model - that is, if the data tend to lie above the initial model,
you expect the true
height of the profile,
, to be
greater than the current guess,
C0. Therefore,
C =
- C0
should be greater than zero since ðS / ðC is
always positive. So the right side of the equation
is positive. Similar arguments work for B and R. So I didn't screw up
the minus sign and I can go on. Back.