Re: Fitting models to data
[Martin Taylor 2004.11.21.11.40]
In response to
[From Bill Powers (2004.11.19.0653
MST)]
Rick Marken (2004.11.19.1500)
and
< [From Rick Marken (2004.11.20.2200)]
I think both Bill and Rick completely missed the point of the
posting to which they both responded (Martin Taylor 2004.11.18.17.51).
In my turn, I may be missing their points, but I guess that’s
unavoidable in this kind of interchange. I hope I won’t be unfair in
this message.
I’ll take the last first. In response to Marc Abrams, Rick
said:
- Powers rsponse to
Martin yesterday was a shame and a challenge.
<It sounded like a very reasonable and measured explanation of
how to go about <testing models when fitting different models to
the same data has been <pushed to the limit and resulted in a
tie.
I strongly disagree with Marc, but for different reasons. At no
point in my original message did I suggest that a situation existed in
which “fitting different models to the same data has been
pushed to the limit and resulted in a tie.” Nor would I
ordinarily be interested in such a situation.
What I presented was a situation in which the comparison of
different models presented evidence that an apparently reasonable
method of data fitting was shown to have statistically undesirable
characteristics, and that those characteristics would be likely to be
generalizable to other PCT-based experiments. I attempted to lay out
the reasons why these problems might be generalizable to a range of
experiments that attempt to go beyond the simple question of whether
people actually control when they are tracking a variable.
To repeat the observation that led me to this conclusion: When
model simulation runs are done many times to discover an optimum set
of parameters for each individual model, the average optimum fit for
Model A is better than that for Model B, but the best of these optima
is usually better for Model B than for Model A. This suggests that it
is harder for the optimization technique (a refinement of e-coli) to
find the optimum parameter values for Model B, but that Model B is
actually the better model.
Now for Rick’s other message, and then I’ll get to Bill’s because
both have similar points, but Bill’s deserves more specific
responses.
(Bill) I still maintain that the problem to be solved is
not what model to use, but
what experiment to do that will not include so many unknown
factors as to make
the results meaningless.
(Rick) I agree. I think finding the right experiment is usually
the best way to
find the best fitting model.
(Which isn’t exactly what Bill said, but let that pass).
I think that finding the best model of behavior is not just a
matter of
varying model parameters to get the best fit to a particular data
set. It’s
also (and, perhaps, more importantly) a matter of varying
experimental
conditions in order to produce the data that will most clearly
distinguish
contending models.
The gist of these comments is that if you want to study the
differential effects of sleep loss and complexity on perceptual
control, you should not use sleepy subjects, and you should not use
complex perceptual conditions. But, on the other hand, if you want to
compare the effects of sleep loss and complexity on perceptual
control, you should use subjects who get sleepy and you should use
both simple and complex perceptual signals. Did I get that
right?
That may sound like an unfair paraphrase, but given the question
and the boundary conditions, I find a different interpretation hard to
discover.
If it is an unfair paraphrase, I’m sure Rick or Bill will let me
know.
Well, actually, I can think of a different interpretation, but
maybe that one is even less fair – it is that the PCT studies of even
straightforward tracking are impossible to do and to interpret except
under the idealized conditions of maximally alert subjects and
trivially simple presentation conditions. I really don’t think that
Rick and Bill mean this (and I hope I’m right).
Now Bill P.
Martin Taylor 2004.11.18.17.51–
Often, on CSGnet, it is claimed
that the validity of a model fit in a tracking task is attested by
the high correlation usually observed between the real data and
the
data simulated by the model. These correlations are often in the
high
0.9’s. In my view, such correlations may indicate no more than
that
both the model and the subject are acting as control systems
capable
of countering disturbances in the task at hand. If the subject
controls well, the subject controls well, and the disturbance
excursions substantially exceed the noise, high correlations
between
subject and model are inevitable. But that doesn’t imply that the
model is the right model.
“Right model” has several meanings. Is it the right kind of
model, and is the particular implementation of that kind of model the
right one? The main question I have tried to answer is the first; the
second ranges from hard to answer without doing detailed quantitive
experiments to impossible to answer without knowing the detailed
neuroanatomy and neural functions.
That’s a valid comment, in that it expresses what Bill has tried
to answer. I, on the other hand, was trying to tease out a small part
of the second question. In particular, I asked the question:
“What aspects of control change when the task is simple or
complex, and are those aspects differentially affected by loss of
sleep.” Having 32 subjects, each doing 42 separate tracks, under
conditions from fully alert through the loss of a night’s sleep to
restored wakefulness the next day, seemed to offer an opportunity to
do more than simply say “They seem to be controlling except when
they seem not to be”.
There is, in fact, no reason to think
that if two people control the same thing in the same way at the gross
level, their nervous systems implement that control in the same way.
There is always more than one way to accomplish a given function, so
there is no reason to think that all nervous systems accomplish the
same ends by using the same circuitry – at any level of detail.
But they can still all be control systems in the PCT sense of the
term.
Quite so, which is why all my analyses deal with the control
processes modelled for each individual (and each track) separately,
before any comparisons are made. All comparisons are among the
parameters modelled. That offers the otential to show up individual
differences in the ways subjects might be controlling.
The high correlations between model and
real behavior that we find with the simplest control model are, of
course, so high as to be uninformative. I prefer looking at RMS
differences between model and real measures, and even better simply
examining the differences in detail. Correlations mask
differences – how much difference is there between a correlation of
0.95 and 0.97? Then ask what the difference in prediction errors is,
and it will be far larger as a proportion. I prefer measures that
magnify differences rather than minimizing them.
Quite so. That’s one of the points I was trying to get across.
But I went a bit further, to point out that RMS differences are not
always appropriate. I have, in fact, been looking at the differences
in detail, which raises yet other questions I didn’t mention in my
initial message, and I won’t here. I might, in a later message when
the points at issue in my first message have been cleared
up.
Even without fitting a model to the data
– just using a linear integrating control model with very high loop
gain and zero lags – we can show that human control behavior in a
simple tracking task is the same as that of the model within five to
ten percent RMS…
As far as I’m concerned, that made our case for this type of task. Tom
and I had hoped that others with more resources, seeing how we went
about testing the theories, would extend this approach to other realms
of behavior and ultimately to all of it, drawing the boundaries of the
theory of living control systems. Then it would be logical to start
trying to improve the fit of the model to real behavior,
Nevertheless, you raise what seem to be objections in principle
to my attempting to follow that programme.
In 1994, I was involved in an experiment
in which people worked for
three days and two nights continuously except for a 10 or 15
minute
break every 2 hours in which they could chat, eat, or whatever,
but
not sleep. There were quite a few different tasks, including some
tracking tasks programmed by Bill P. In that study, there were
hints
of a counterintuitive result, that the sleep loss had less effect
on
complex tasks than on simple ones.
As you know, I took exception to the interpretations of the results,
since the behavior was entirely too complex for a simple model to
handle. The fact that subjects could be asleep with the mouse in their
hands for seconds at a time made the model fits to the tracking data
meaningless. Even a slight lapse like a sneeze can increase the RMS
fit error by a substantial factor when tracking is being done very
consistently throughout a run. What would be the effects of simply
ceasing to track for an unknown length of time? Huge. So how do you
salvage meaningful results when there are large unpredictable changes
in the higher-order organization of the system? The answer is, you
don’t.
Wrong. The answer is that you develop algorithms that
conservatively eliminate from the modelling those stretches of data
that clearly represent micro-sleep, and model the rest. In the 1994
study, the algorithm was very simple: stretches of longer than X
seconds in which the target moved but the cursor didn’t were
considered to be periods when the subject was not tracking (I forget
what I used for “X”). And to avoid the obvious picky
comment, yes, I included a “guard-band” around the
non-tracking interval.
You abort the mission and go back to the
drawing board, and try not to regret the expense. Naturally, that
course was politically impossible to take, I quite understand. But
scientifically, that is what should have been done.
Actually, It’s rather difficult to study the effects of sleep loss if
you eliminate, a priori, conditions in which subjects might be
expected to be sleepy. I’d say that is a scientific problem, not a
political one.
In 2002, another experiment was run. This
time it was over a 26-hour
span, meaning one sleepless night. I devised a tracking experiment
that incorporated three different levels of perceptual complexity
and
two modes, one involving perception of visual magnitude, the other
of
numerical magnitude. Each of 32 subjects was asked to do 42
separate
50-second tracks. So I have a lot of data. The question is to
model
it, something that has been a problem for several reasons.
I still maintain that the problem to be solved is not what model to
use, but what experiment to do that will not include so many unknown
factors as to make the results meaningless. How do you know that you
had three different levels of perception?
OK. I ddin’t want extend an already overlong message, but since
you need to be reminded, here are the tasks. Each was a pursuit
tracking task. In one set of three, the target to be tracked was a
varying number and the subject’s mouse varied a number displayed on
the screen. In the other set, the target was the length of a line, and
the subject’s mouse varied the length of a line displayed horizontally
on the screen.
The three levels of complexity were (1) the subjects matched a
varying target, either a displayed number or the displayed length of a
line (presented vertically onthe screen to avoid the possibility that
the subject might simply compare positions rather than line lengths);
(2) The subject matched a the same varying target plus or minus a
fixed amount, the amount being indicated before the start of the trial
run – for example, the subject might be asked to make the controlled
number be 5 less than the varying target number; (3) The subject
matched the same target plus another number or line that varied more
slowly both sides of zero (a downgoing secondary line indicated
subtraction).
I don’t think that this really matters to the points I was making
in the thread-initiating message, which was about questions relating
to model-fitting. Those questions were intended to be general to many
kinds of PCT-based experiments. The problems were exposed by the
sensitivity of the experiment, which made it seem appropriate to
mention how they arose.
The issue I want to address now is
model-fitting. The model I fit to
the data from the 1994 study was a simple “classical PCT”
control
model with the addition of the “Marken prediction” element
that makes
the reference signal become not the target, but the target
advanced
by adding an amount proportional to the
target velocity.
I have no idea what you’re talking about here. What is the
“Marken prediction element?” I have never seen a PCT model
in which the target is advanced by adding an amount proportional to
the target velocity. Would you describe this model in more
detail?
Ask Rick. He was quite pleased with the improvements of model fit
it gave, and I simply copied it from him (with credit in the
publication).
Actually, it might be relevant for me to mention that the
original reason for trying several models was that I did not have an a
priori reason to prefer one competent controller over another, but I
thought that if the relation between direct observation of the target
and prediction of its future magnitude was important, that relation
should show up in a variety of structurally different models. The
important thing was to find structurally different models that
provided nearly equally good fits. If one model turned out to be a
unique way to fit the data best, so much the better, but that wasn’t
required for an answer to the experimental question.
… Perhaps I should just say that I’m
expressing a preference for clear results even if this limits me, for
now, to simple experiments.
In other words, given that I have a question raised by earlier
experiments: “There are hints that complex tasks are less
affected by sleep loss than are simple tasks; is this true?”, I
should not attempt to look at it from a PCT perspective? Or are you
saying that when a major sleep loss study is being conducted, I should
not attempt to introduce any kind of PCT-based study?
Having a criterion allows one to look for
optimum parameter values
for a model. I am using a modified e-coli method, not knowing any
computationally efficient way of finding an optimum in a surface
of
unknown roughness. Here’s the actual method: I choose an arbitrary
set of parameter values that allows the model to track the target.
From that point in 5-parameter space I choose an arbitrary
direction
and find the best fit parameter set along that line. This provides
a
new starting point.
I’ve thought of doing this, but haven’t actually tried it. It is
definitely a good idea, expecially for handling the local-minimum
problem. I’m not sure how real the “narrow valley”
phenomenon is
Actually, you can see it very easily, by looking at the
correlations among the pairs of parameter values that lead to
near-optimum fits. If you get the same near-optimum fit with (say)
high gain and low threshold as with low gain and high threshold, but
the true optimum is with mid gain and mid threshold, and raising or
lowering both together makes a big difference, then you have an
observable narrow valley. That is what I see in my data.
(I should emphasise that gain and threshold are not necessarily a
pair that exhibits high correlations in the actual data. I use them as
abstract example parameters. There are several pairs for which the
problem shows up. Actually, it was such a tight trade-off between gain
and exponent that led me to abandon the use of the exponent in fitting
models to data. Gain and exponent were almost surrogates for each
other in the fits.)
The result of this is that the further analysis of trends in
either of the two correlated parameter is very noisy, even if the
underlying real trends are smooth. The optimization for one track may
wind up with high gain and low threshold, and for the next with a low
gain and high threshold, when the hard-to-find true underlying optimum
would have shown (say) a slight increase in threshold and a slight
reduction in gain. The trade-off that shows up as correlation adds
spurious noise that obscures the trends that are the subject of the
experimental question. This, in turn, says that it is hard to tease
out individual difference. If the data were rescaled to be near
spherical, this “narrow valley” problem would go away.
Quite separately, there is an artificial kind of narrow valley
caused by a less than optimal scaling of the individual variables in
the different dimensions. That’s a question of the choice you make in
setting up what you treat as equal sized steps in the different
directions, but it’s just as real in the results you get as is the
intrinsic “narrow valley” shown by the inter-parameter
correlations you get from multiple optimizations for the same track.
The difference is that the noise induced by inappropriate scaling
affects only one of the variables at a time. It’s more a question of
making the search for the true optimum slower than necessary (perhaps
greatly slower) than it is a question of providing potentially
misleading results.
The biggest problem, I find, is how to
avoid getting so immersed in ever-more-complex analysis that flaws in
the basic approach are overlooked. Is the experiment itself really
telling us what we want to know? Are we trying to squeeze blood from a
turnip?
I’m well aware of that issue. Much of the reason that these
questions are coming up so long after the actual experiment is that
I’ve spent a lot of time going over the basic issues involved. Every
time I do this, I find more (unfortunately, science is like
that).
The power function is another good idea.
I think an error-detection threshold may be warranted, too – I seem
to have one, although it may be a sign of age that wouldn’t show up in
a younger subject.
I thought it might show up as subjects got sleepy.I thought that
perhaps they wouldn’t care about an error that they would have tried
to correct when they were alert.
The problem here is that if you
have the right kind of model to begin with, you’re already accounting
for 95% or so of the variance,
Variance in what? My problem with this kind of statement is that
ANY system that actually controls will account for almost all of the
variation that matches the variation in the target. That tells you
nothing. What interests me is the deviation between the target
and what the subject actually does. For that, it simply isn’t true
that “if you have the right kind of model to begin with, you’re
already accounting for 95% or so of the variance.” The problem is
exactly to find “the right kind of model.” That’s the end
product of the experiment, not the a priori starting point.
Well, I wish you luck with your modeling
of the sleep effects. Actually, I think you will probably come up with
results that are much better than the usual run of psychological
“findings,” so we can hope that your work will bring a few
more onlookers into the fold. The very subject of research through
modeling – in the new book I call it “investigative modeling”
– is pretty new to mainstream psychology in the large, and your
investigations might well create some wider interest in this approach.
Hope you can get it published.
Thank you. I hope so, too. But I have to resolve the issues I
brought up in my initial post, first.
Martin