Falsifiability

[From Bill Powers (941006.0645 MDT)]

Martin Taylor (941006.1345) --

Now that you are working with a slow link, I got your post within no
more than 6 hours after your time stamp. It could have been less; I
haven't looked at my mail for 6 hours.

I agree with you about the "no counterexamples" criterion of Popper;
taken literally it rules out all theories. But I have a positive
attitude toward his falsifiability criterion, at least as I interpret
it: that a test should actually be a test. A test that can't be failed
is no test of a theory.

A lot depends on how a theory is stated. If you say "I propose that
grade performance in college depends on ACT scores," you're proposing a
"theory" that is next to unfalsifiable. If you don't get a significant
correlation on your first try, just keep doubling the size of the test
population until you do get significance. If that doesn't work, accept a
less stringent significance criterion. You can hardly lose, whichever
way the correlation goes.

Then look at this theory: "A person in a tracking task behaves as a
negative feedback control system does." That alone leaves a lot of room
(you could be talking about heat dissipation), but when you add that a
model based on control theory should duplicate a subject's performance
within a few percent, with 1800 samples of data in an experiment, the
prediction suddenly becomes far more risky. Even allowing for bandwidth
limitations and consequent coherence within the data, you're talking
about a probability of producing this behavior by chance that is
trillions to one against -- and a correspondingly large chance that any
important change in the details of the model would lead to a wrong
prediction. And instead of relying on large numbers of people to reveal
the effect, you're claiming that it will be seen in a single individual.

Even putting aside the question of cooperation (any other theory would
have the same problem), we can work out what other kinds of models would
predict, and even test behavior under conditions requiring other models,
to show that there are in fact ways in which the control-theoretic model
could be falsified. For example, I have set up open-loop tracking
experiments where the subject can see the target but must move the
cursor without perceiving it (depending on sensing arm or mouse
position). Even in easy situations, the RMS tracking errors are 10 times
as large, meaning about 10 standard deviations larger than the errors in
a normal control experiment. Choosing between these models is not a
matter of statistical significance any more: it is a matter of picking
the most probable explanation where the chance of mistaking one type of
behavior for the other is about 10^9 to one against. There is no reason
in principle why the behavior could not be that of an open-loop system,
so we can definitely say that the control model is falsifiable. But with
such odds we can make the claim that it simply is not falsified, whereas
the rival model is, beyond any reasonable doubt, wrong.

Furthermore, we can say (after considerable experience) that if in the
normal tracking experiment a subject were to show errors as much as two
or three times the normal tracking errors for that person, something
must be seriously wrong with the person. Falsification of the model
would almost certainly be diagnostic of the person rather than the
model. In this case, if we knew that something was seriously wrong with
the person, yet the same model as before with the same parameters
continued to predict that person's behavior just as accurately as
before, we would take this "failure of the model to fail" as showing
something wrong with the model!

It seems to me that when we start talking about quantitative model-based
predictions of behavior, we move into a different type of theorizing
altogether. All the old statistical manipulations are predicated on the
assumption that the phenomenon under investigation will be swamped by
noise, so it can't be seen accurately or at all without statistical
analysis. It's assumed that we will do well to prove the _existence_ of
a relationship; determining its form is normally impossible.

What is meant by falsifiability is very different in the two kinds of
theories.

You say " All, and I mean ALL, theories actually proposed are false in
some respect, and we know that as solidly as we know any fact." This is
a technically true statement, but there is a vast difference between a
theory that is falsified by one in 20 of the the data sets used to test
it, and one that is falsified by one in 10^9 data sets. There is surely
something very different going on when we speak of the truth of Newton's
laws of dynamics and the truth of Bandura's laws of self-efficacy. Yes,
Newton's laws are no longer "true," but one has to ask about the
practical significance of that fact when under most conditions our most
sensitive instruments still can't detect any deviation of predictions
from observations. It's only in imagination that "true" and "false" have
values of 1 and 0.

Between differences from prediction of 2 standard deviations and
differences of 5 standard deviations, we go from a probable occurrance
of 5% to a probable occurrance of 5 x 10^-5%. By reducing the standard
deviation of prediction error by a factor of only 2.5, we achieve an
increase in predictability of 100,000 times. It seems to me that this
region of predictive error represents a gulf between two approaches to
understanding phenomena, and a giant step separating two ways of doing
science. And to get from the one region to the other, we only have to
reduce our predictive errors to 40% of their previous values!

Come on, you statistics guys. What am I doing wrong here?

···

--------------------------------------------------------------------
Best,

Bill P.

[Martin Taylor 941013 12:20]

Bill Powers (941006.0645 MDT) (received three minutes ago)

Now that you are working with a slow link, I got your post within no
more than 6 hours after your time stamp. It could have been less; I
haven't looked at my mail for 6 hours.

Now we are working with our "fast" link again (I think), and it has taken
a week for your message to reach here. But I'm glad it has, because you
put the argument for reasons for choosing one theory over another very
well. It is my Occam's razor argument put into concrete terms.

If you don't get a significant
correlation on your first try, just keep doubling the size of the test
population until you do get significance. If that doesn't work, accept a
less stringent significance criterion. You can hardly lose, whichever
way the correlation goes.

Exactly so, for ALL uses of significance tests as a mechanism of falsification.

Then look at this theory: "A person in a tracking task behaves as a
negative feedback control system does." That alone leaves a lot of room
(you could be talking about heat dissipation), but when you add that a
model based on control theory should duplicate a subject's performance
within a few percent, with 1800 samples of data in an experiment, the
prediction suddenly becomes far more risky.

Right. But the theory (as stated) is actually an amalgam of an infinity
of descriptions, descriptions that differ in the structure of the specific
negative feedback control system, in the kinds of data taken, and in the
parameters that describe the individual elements of the control system.

Some member of this infinite set provides the best description of the 1800
samples (which probably actually represent far fewer independent degrees of
freedom), but it is probable that no member of the set will provide an EXACT
description of the data unless the model includes the possibilities of the
action of a whole hierarchy including conflicting control systems, all of
which have the optimum set of freely fitted parameter values. And when you
fit as many parameters as the data has independent degrees of freedom, your
heory has become no better than a presentation of the raw data. The question
is always "how well can you describe the data with how much parametric
information."

Choosing between these models is not a
matter of statistical significance any more: it is a matter of picking
the most probable explanation where the chance of mistaking one type of
behavior for the other is about 10^9 to one against. There is no reason
in principle why the behavior could not be that of an open-loop system,
so we can definitely say that the control model is falsifiable. But with
such odds we can make the claim that it simply is not falsified, whereas
the rival model is, beyond any reasonable doubt, wrong.

Yes, I agree. It is exactly the argument in the Occam's Razor working paper.
However, to amplify the argument, think of using an open loop model
with a huge number of parameter that specify all of the changes in output
so as to produce an EXACT replica of the data. It can be done, for sure,
with enough independently selected parameters. The difference between the
models is not in that one can describe the data exactly and the other cannot.
It is in how closely the data can be described with how few (and how precise)
parameter values. The difference between a "good" theory and a "false" one
can be depicted graphically:

          100% |*************
               > + ************** bad theory
               > + ************
      % of data| + good theory *************
unaccounted for| + ***...
               > +
               > +
          0% ____________________________________________________________
                  Number of "free" parameter values used to fit data

("Number of" is too simplistic, but it will do as an illustration).

It seems to me that when we start talking about quantitative model-based
predictions of behavior, we move into a different type of theorizing
altogether.

Right. The issue is how accurately the model describes what happens, and how
little you have to change the parameter values or the structural choices
within the theory (e.g., is the control system adaptive or not?) when you
get new data.

Furthermore, we can say (after considerable experience) that if in the
normal tracking experiment a subject were to show errors as much as two
or three times the normal tracking errors for that person, something
must be seriously wrong with the person. Falsification of the model
would almost certainly be diagnostic of the person rather than the
model. In this case, if we knew that something was seriously wrong with
the person, yet the same model as before with the same parameters
continued to predict that person's behavior just as accurately as
before, we would take this "failure of the model to fail" as showing
something wrong with the model!

Yes, but not necessarily wrong with the theory. Of the infinite set of models
that fall within the theory, you might be able to find one that fit the
new data, and if this new one fitted the old data as well as did the model
you earlier used, you probably have found something significant about the
way people can fail. But not if it took too many new ad-hoc choices of
fitting parameters to do it.

there is a vast difference between a
theory that is falsified by one in 20 of the the data sets used to test
it, and one that is falsified by one in 10^9 data sets.

Again, the questions should concern the range of conditions over which the
range of descriptive error is thus-and-so. There is indeed a vast difference
between the two kinds of theory you mention, IF the range of claim is the
same and the magnitude of error labelled "falsification" is the same. But
the range of claim is important. Most theories are silent about most data.
You wouldn't expect PCT to make any claim about how often a particular kind
of diverging set of curved tracks will be observed when some piece of lead
is exposed to a relativistic proton beam inside a bubble chamber. If you
include ALL randomly chosen data sets, PCT fails to describe even 1 in 10^9
of them in any detail. A THEORY describes data only within its range of claim.

It's only in imagination that "true" and "false" have
values of 1 and 0.

Oh, would that more people understood this!!!!

Come on, you statistics guys. What am I doing wrong here?

Nothing that I can see. A beautiful posting. But that's not unusual.
Hardly worth mentioning.

···

===============================
Given my problems with delays in receiving postings, I imagine that I will
have left for a month before your response to this arrives. I leave on
Oct 20, back Nov 15 for a week, and then gone again.

Martin