# Statistical evaluation problem

[From Bill Powers (961126.1200 MST)]

I have a problem that people on CSGnet might be able to help me solve. How
do we objectively evaluate the goodness of fit between a model's behavior
and real behavior?

For years we're been using correlations as a rough way of showing how well a
model fits tracking behavior. This started mostly in order to compare the
results of using a control model with other approaches to behavior which
traditionally use correlations as a way of conveying how well a theory fits
the facts. But we've known all along that correlations really don't have
much meaning in this application; the differences between models and real
performance aren't distributed according to any standard distribution, and
as several people have pointed out the data points being compared aren't
temporally independent.

Those are relatively minor problems with correlations compared to the one I
have now. Bruce and I have been fitting models to various data obtained in
operant behavior experiments. One of these models decribes what is, from the
standpoint of a weight control system, an environmental feedback function or
EFF. It is the function that converts any pattern of daily food intake into
the animal's weight, both being functions of time.

What we get from the model is a predicted waveform of weights that is
generated by the observed waveform of total food intakes. For rat 1, the
weights go as low as 175 grams and as high as 290 grams. Over the times of
interest (after initial deprivation has been compensated for), the range is
even less -- say from 250 grams to 290 grams.

The model fits the observed weight values within about 4 grams RMS. The
correlation of model weight with real weight is in the high 0.9s -- around
0.96 or 0.98 depending on the parameters taken to be the best fit.

But suppose that the model's weight, while following the variations in the
real weight, had been everywhere 100 grams too low or too high. The
correlation would remain exactly the same! This follows because the first
step in calculating a correlation is to remove the mean values from the data
arrays being correlated. If the model's output is systematically too high or
too low by a constant amount, the correlation will be unchanged.

The same holds true if the model variations are all too great or too small
by some constant factor. The correlation process normalizes the data, so
that any constant factors disappear from the result.

The upshot is that a model can appear to fit the data extremely well, in
terms of correlations, when it is actually in error by a huge amount.

The opposite problem also occurs. In matching the weight control model to
the data, one of the things we want it to predict is daily food intake in
the home and experimental cages. The average intake is something like 15
grams, per day with variations above and below that level as conditions
change (for example, absence of food in the home cage). The model's total
food intake correlates with the actual total food intake only about 0.8 to
0.9. But in calculating this correlation, no account is taken of the fact
that the model predicts the _mean_ food intake better than it predicts the
variations in food intake. The mean values are removed by the correlation
calculation, so the correlation reflects only how well the _variations_ in
model intake match _variations_ in the real intake. The model gets no credit
for predicting the mean values correctly. In fact, _too much_ credit is
given for matching the variations in food intake, because the model's
variations are generally visibly smaller than those of the real intake,
although proportional to them, a fact that the correlation calculation
doesn't pick up. And too little credit -- none at all -- is given for the
fact that the model predicts the 13-gram mean value of intake very
accurately, instead of predicting, for example, 2 grams or 200 grams.

I used this analogy in discussing this with the rat group. Suppose a model
predicts that a car will go 20,000 feet and then stop. We observe the real
car, on successive trials, travelling 20,000 feet plus or minus 20 feet.
Since the model's predictions are the same every time, when we calculate the
correlation of the model's predictions against reality, we get a correlation
of ZERO. Yet the model has predicted the real distance of travel within 0.1%
of the observed value over all the trials.

So my question is simple. Is there any standard way, akin to a correlation,
of comparing two data sets for goodness of fit that takes into account both
constant offsets and proportionality factors as measures of error?

Best,

Bill P.

[from Jeff Vancouver 961127.09:30 EST]

[From Bill Powers (961126.1200 MST)]

I have a problem that people on CSGnet might be able to help me solve. How
do we objectively evaluate the goodness of fit between a model's behavior
and real behavior?

:

So my question is simple. Is there any standard way, akin to a correlation,
of comparing two data sets for goodness of fit that takes into account both
constant offsets and proportionality factors as measures of error?

It does seem some of us more conventional types should know. I have two
suggestions, one easy, one hard. The easy one is to use D**2 (see
Cronbach and Cleser, 1953 Psych Bull, 50, 456-473). D**2 is the
However, instead of averaging it, they sum it. Does not matter. The
issue is that there is no set metric (it depends on the scale and, if
summing, the number of observations). But this is easy to solve it seems
to me. The issue is to determine the range of possible values. Perfect
fit would be zero. Perfectly horrible fit would might depend on the
nature of the study. It could be random based on a comparison to random
fluctations in the scale or to a no fluctations. Once that was
determined, you could use your scaling (and # of obs) to determine the
worst fit end of the range. Then simple take a proportion, making the
result on the same scale as correlation. Their are probably more Psych
Bull articles on these issues (concerned, I would guess, with the nature
of the distributions involved) - I am not up to speed.

The hard way, but much more informative, is to use time series analysis
techniques (e.g., ARIMA). These provide goodness of fit indices and
other nicities. There is a huge literature on how to do these and
several of the better statistical packages provide the analysis
capacity. My knowledge is of the procedures, I have not actually used
any of them. Warning, they try to assess and control certain types of
assumptions, but fail to solve others.

I hope this helps.

Jeff

P.S. I am planning to run more spiral today and then get back to you on that
experiment.