Modelling Figure of Merit

[Martin Taylor 2001.10.16.15:09]

I raised this question a few years ago, but it was never resolved effectively.

I am trying to fit a simple control model to real data--it doesn't
matter what; this question is better treated in the abstract. The
"classical" approach is to adjust the model parameters until the
modelled data correlate maximally with the experimental data.

On the face of it, this seems fine, until one realizes two things:
(1) that good correlation can occur because both the human and the
model track well, even if they happen to use different control
structures, and (2) one can get 1.0 correlation between two waveforms
that differ greatly in amplitude.

Checking this out in my study, I have been using six measures. There
are three waveforms in question, the target to be tracked, the human
track and the model track. I determine the three pairwise RMS
differences and the three pairwise correlations. For a model to be a
believable representation of the human, it seems to me that the model
track must be close to the human track than to the target track. This
"closeness" can be measured both by comparing the model-to-target RMS
with the model-to-human RMS and by comparing the same two
correlations.

Ideally, if the human-to-target correlation is, say, 0.93, we would
want the model-to-human correlation to be greater than 0.93 and also
greater than the model-to-target correlation. We also want similar
relationships among the RMS difference values. But I find in my
modelling attempts that these do not go together very well. By
varying the model parameters, I can get the usual high correlations
between model and human, but very often for those parameter values
the RMS difference between human and model is large. Parameter values
that improve the RMS ratio (human-model)/(target-model) often give
poor values of the correlation comparison (human-model
correlation)/(target-model correlation).

It's easy enough to see geometrically what is happening here. But it
does not inspire confidence in the use of correlation alone as a
figure of merit for fitting control models to real data. My question
really points in two directions: (1) What is a better figure of merit
than simple correlation to optimize when adjusting parameters for a
model of fixed structure, and (2) is there any background literature
or understanding that would suggest structural modification based on
the way the fitting surfaces in RMS and correlation relate to one
another.

Clearly (to me) a good figure of merit must involve not only the
closeness of the model track to the human track, but also the degree
to which that closeness might be due simply to the fact that both are
good controllers, rather than being the same controller. In other
words, a correlation of 0.95 between model and human tells very
little if the correlation between human and target tracks is 0.97,
unless the RMS distance between human and model is substantially less
than between target and model (which would indicate that human and
model make the same kinds of error).

Technical or conceptual suggestions are welcome.

Martin

[From Rick Marken (2001.10.16.1410)]

Martin Taylor (2001.10.16.15:09)

I raised this question a few years ago, but it was never resolved effectively

Clearly (to me) a good figure of merit must involve not only the
closeness of the model track to the human track, but also the degree
to which that closeness might be due simply to the fact that both are
good controllers, rather than being the same controller. In other
words, a correlation of 0.95 between model and human tells very
little if the correlation between human and target tracks is 0.97,
unless the RMS distance between human and model is substantially less
than between target and model (which would indicate that human and
model make the same kinds of error).

I agree. My suggestion is to collect more data. The "figure of merit" would be
measured in terms of the model that fits _all_ the data with least required
changes in parameters (parameter changes across individuals would not count). I
think this is the way Kepler and Newton did it.

Best regards

Rick

···

--
Richard S. Marken, Ph.D.
The RAND Corporation
PO Box 2138
1700 Main Street
Santa Monica, CA 90407-2138
Tel: 310-393-0411 x7971
Fax: 310-451-7018
E-mail: rmarken@rand.org

[From Rick Marken (2001.10.16.1430)]

Me to Martin --

I agree. My suggestion is to collect more data. The "figure of merit" would be
measured in terms of the model that fits _all_ the data with least required
changes in parameters (parameter changes across individuals would not count). I

think this is the way Kepler and Newton did it.

I should say that I mean collecting more data under _different_ experimental
conditions. I don't think getting more of the same kind of data (more degrees of
freedom for model fitting) is the answer. I think a pretty good example of what
I mean is provided by the "Levels of Intention...", "Perceptual organization..."
and "Degrees of freedom..." papers in _Mind Readings_. All three papers show how
one model , with no parameter changes, explains the behavior of subjects in two
or three different experimental situations.

Best regards

Rick

···

--
Richard S. Marken, Ph.D.
The RAND Corporation
PO Box 2138
1700 Main Street
Santa Monica, CA 90407-2138
Tel: 310-393-0411 x7971
Fax: 310-451-7018
E-mail: rmarken@rand.org

[From Bill Powers (2001.10.17.0704 MDT)]

Martin Taylor 2001.10.16.15:09 --

Welcome back, Martin.

I raised this question a few years ago, but it was never resolved

effectively.

I am trying to fit a simple control model to real data--it doesn't
matter what; this question is better treated in the abstract. The
"classical" approach is to adjust the model parameters until the
modelled data correlate maximally with the experimental data.

Ideally, if the human-to-target correlation is, say, 0.93, we would
want the model-to-human correlation to be greater than 0.93 and also
greater than the model-to-target correlation. We also want similar
relationships among the RMS difference values. But I find in my
modelling attempts that these do not go together very well. By
varying the model parameters, I can get the usual high correlations
between model and human, but very often for those parameter values
the RMS difference between human and model is large. Parameter values
that improve the RMS ratio (human-model)/(target-model) often give
poor values of the correlation comparison (human-model
correlation)/(target-model correlation).

I noted the same problem some years ago. As you say, we want the model to
behave like a human being, but it's difficult to distinguish such a model
from a perfect controller, because human beings come so close to being
perfect controllers (within a frequency band from zero to one hertz).

I originally used the correlation measure because I was presenting data to
psychologists, and they were used to seeing that sort of measure. For my
own purposes, I have always thought of RMS error as more appropriate, that
being a direct measure of how well the model predicts the actual behavior.
The best measure is a moment-by-moment comparison because it shows in
detail how the model's behavior differs from that of the human being. My
models, for example, typically show less high-frequency tracking error than
a human being does. Adding a perceptual lag to the model partly remedies
that shortcoming. I suspect that adding nonlinearities will also help, but
have never experimented with that. In some cases, differences between model
and person were due to the person's still learning to control better -- the
model assumes constant parameters during a run, but while learning is still
incomplete there can be significant changes even during a one-minute run.
When runs are too long, fatigue and inattention can also cause performance
changes during the run; the model, of course, does not fatigue nor can its
attention wander. (Attention is another question for research, of course).
Of course some differences are just due to noise in the person that is not
in the model (see below).

A sensitivity analysis would reveal the true nature of the problem. The
fact is that the performance of a control system with reasonably high gain
is simply insensitive to variations in many system parameters, particular
the parameters of the output function. This is one reason why the control
strategy has a large evolutionary advantage over any plan-and-execute
strategy. But it makes determination of the real parameters difficult when
accuracy of control is used as the measure of performance. For a control
system with a loop gain of 1000, a 50% drop in output gain would make the
error double -- but the output would change from 0.999 of the disturbance
to 0.998 of the disturbance, which would be experimentally undetectable.
Visual-motor tracking is actually at least that accurate at low
frequencies, being limited mainly by visual acuity when disturbances change
only slowly. This says, by the way, that matching the _errors_ would be
much more sensitive to parameter changes than matching the _outputs_.

On the other hand, small changes in performance can reflect quite large
changes in system parameters, so if you're investigating, say, the effect
of drugs or fatigue on performance, any small performance change you can
reliably see indicates a large change in the behaving system. Just ask how
much a model parameter would have to change to have the same magnitude of
effect.

It's easy enough to see geometrically what is happening here. But it
does not inspire confidence in the use of correlation alone as a
figure of merit for fitting control models to real data. My question
really points in two directions: (1) What is a better figure of merit
than simple correlation to optimize when adjusting parameters for a
model of fixed structure, and (2) is there any background literature
or understanding that would suggest structural modification based on
the way the fitting surfaces in RMS and correlation relate to one
another.

I strongly recomment against using correlation as a figure of merit. It's
mainly useful as a way of contrasting the results of control experiments
with the kinds of experimental results usually obtained in the behavioral
sciences (at the St. Louis meeting, we were regaled with a report of an
explanation that accounted for 7% (and I typed that very carefully -- seven
per cent) of the variance).

You might find it useful to measure RMS errors as a function of frequency
(in the analysis, not during the experiment). Just compute the prediction
error as a function of time and then do a Fourier analysis of it. I would
predict a curve that starts close to zero and rises with frequency.

Another profitable analysis would be to test the RMS error for randomness.
If it's completely random, this would say that the person simply contains a
noise source that is absent from the model. I know that my own error traces
contain a periodic component (see the paper in Hershberger's book), but
that probably reflects a small genetic tremor I have always had. If the
errors are random, there wouldn't be much hope of making a model generate
the _same_ random variations. You could always add a random noise source,
of course, but not the same one.

In other
words, a correlation of 0.95 between model and human tells very
little if the correlation between human and target tracks is 0.97,
unless the RMS distance between human and model is substantially less
than between target and model (which would indicate that human and
model make the same kinds of error).

This was one point of the original "Models and their worlds" paper, which
is going to be published as part of Phil Runkel's forthcoming (some day)
book. If you're testing whether you have the right _kind_ of model, this
sort of match is highly significant -- when alternative models show
correlations of 0.2 or 0.5 with the measured output. When fits are very
bad, RMS error tends toward a limit of 100%, and the correlation measure is
probably more sensitive!

I'm a little dubious about trying to devise a single "figure of merit."
Control is a multidimensional phenomenon, and while a model may predict
very well in one dimension it may not do so well in another. It's sort like
trying to find a single figure of merit with which to compare Einstein and
Horowitz.

Best,

Bill P.

[Martin Taylor 2001.10.19.00:50]

[From Rick Marken (2001.10.16.1430)]
...I should say that I mean collecting more data under _different_
experimental
conditions. I don't think getting more of the same kind of data
(more degrees of
freedom for model fitting) is the answer. I think a pretty good
example of what
I mean is provided by the "Levels of Intention...", "Perceptual
organization..."
and "Degrees of freedom..." papers in _Mind Readings_. All three
papers show how
one model , with no parameter changes, explains the behavior of
subjects in two
or three different experimental situations.

Actually, I have six experimental conditions, with 32 subjects. The
problem is to find a model that does fit believably well to any of
them.

···

------------------------------------------

[From Bill Powers (2001.10.17.0704 MDT)]

Thanks for the careful answer.

A sensitivity analysis would reveal the true nature of the problem.

It was doing an empirical sensitivity analysis that alerted me to the
fact that the problem existed, but it hasn't helped much in resolving
it.

You might find it useful to measure RMS errors as a function of frequency
(in the analysis, not during the experiment). Just compute the prediction
error as a function of time and then do a Fourier analysis of it. I would
predict a curve that starts close to zero and rises with frequency.

Yes, I have done a little checking by frequency band, and have found
what looks like a peak in the error in the region of 1/3 or 1/5 Hz
(if I remember correctly, since the data are not here with me).

The problem with the RMS error figure of merit is that it does
incorporate the human high-frequency noise, and it is not clear where
it is legitimate to start a high-pass filter. I have settled (?) on a
high-pass filter with cut-off around 3 Hz, and try to fit the
filtered data.

I'm a little dubious about trying to devise a single "figure of merit."
Control is a multidimensional phenomenon, and while a model may predict
very well in one dimension it may not do so well in another.

If one is trying to find a best fit set of parameters to define a
model, one would think there should be a single figure of merit to
fix the optimum model. I hear what you are saying, though.

Since posting my original message, I've been given a little paid
programming help, which will start in a couple of weeks. The
intention is to make a more flexible, graphically oriented, program
that will allow me to try out various things a lot less laboriously
than I have been doing using my own programming skills. I'll let you
know what happens.

Martin

[From Bill Powers (2001.10.19.0713 MDT)]

Martin Taylor 2001.10.19.00:50--

Yes, I have done a little checking by frequency band, and have found
what looks like a peak in the error in the region of 1/3 or 1/5 Hz
(if I remember correctly, since the data are not here with me).

Probably most important is the value of the error at zero (or the lowest)
frequency. If it's very small, this means that the basic model is right.
The higher-frequency error has to do with matching details of the
performance, which is naturally less accurate than getting the basic
controlled variable right and so on.

The problem with the RMS error figure of merit is that it does
incorporate the human high-frequency noise, and it is not clear where
it is legitimate to start a high-pass filter. I have settled (?) on a
high-pass filter with cut-off around 3 Hz, and try to fit the
filtered data.

That's probably a good number -- in the 50s, the people at Wright-Patterson
AFB measured a peak in the loop gain for manual tracking systems at about
2.5 Hz. A goode name to look up would be Elkind, if I remember it right,
under the subject name of "human operator."

I think it may be a good idea to review your objectives in this project.
It's possible that you're trying to do something that can't be done -- pick
out well-defined values of the system parameters, for example, in a system
with a performance that simply isn't sensitive to even fairly large changes
in the parameters (some of them). Negative feedback has that effect.

Best,

Bill P.

[From Rick Marken (2001.10.19.0850)]

Me:

...I should say that I mean collecting more data under _different_
experimental conditions.

Martin Taylor (2001.10.19.00:50) --

Actually, I have six experimental conditions, with 32 subjects.
The problem is to find a model that does fit believably well
to any of them.

That's great. If the subjects are skillfully controlling in all experimental
conditions then I think your "figure of merit" would be the model that gives
the best prediction (measured by RMS error) for all 32 subjects in all six
conditions, with allowance for subject-wise parameter adjustments. If,
however, the some (or all) subjects were not controlling particularly well in
some (or all) conditions then there will be severe limits to the merit that
any figure can show for a control model. I mention this because I don't want
you to waste too much of your time trying to make sense of what may be random
noise.

Best regards

Rick

···

--
Richard S. Marken, Ph.D.
The RAND Corporation
PO Box 2138
1700 Main Street
Santa Monica, CA 90407-2138
Tel: 310-393-0411 x7971
Fax: 310-451-7018
E-mail: rmarken@rand.org