[Richard Kennaway, 970419.1730 BST]
Sittin' on the dock of CSGNET watching the threads come and go away is
rather different to jumping in and swimming with the sharks. I've got my
stuff together now, so here is a mass reply to the correspondence about
my article on correlations.
Everyone who wants to seems to have got a copy they can read by now, but
I'll just apologize for the Unix-ism in the URL I gave -- {dvi,tex} meant
that there are two files, one named ....dvi and the other ....tex. I've
added a postscript version now. The full URL again:
ftp://ftp.sys.uea.ac.uk/pub/kennaway/drafts/correlationinfo.ps. I can't
provide an web version until someone produces something like a dvi or
PostScript viewer which can run as a Java applet on a web page (which
someone told me is or will soon be a possibility).
Bruce Abbott (970416.1140 EST):
The question is, meaningless for what purpose? For predicting a particular
individual's performance (Y) based on that person's value of X, yes,
correlations must be very high indeed to be of any value. But your
conclusion above assumes that those published correlations were published
based on their supposed utility in making such predictions. This is incorrect.
At some point I'll get around to looking at some of these papers. I have
Runkel's book and I'll get hold of the paper by Daniel Brown that Dag
Forssell mentioned.
Where a
number of variables are involved, all free to vary at least to some extent,
the correlation between any one of them and the response measure may be
poor. Here is an example I worked up involving four predictor variables and
a response variable. Each predictor variable is orthogonal to the others in
the population (though not necessarily in the sample), and each is normally
distributed. The correlations of predictors with the response variable were
as follows:P1 P2 P3 P4
r 0.519 0.544 0.527 0.468
r-sq 0.269 0.296 0.278 0.219All of these correlations are within the range of values declared as
"meaningless" in your paper. It may therefore come as something of a
surprise to some readers that the four predictors used together perfectly
predict the value of the response measure.
Let me guess: P1...P4 are independently normally distributed variables and
the response variable (which I'll call R) is their sum? Actually, I
didn't guess, I peeked ahead to your later message explaining how they
were generated. [...scribble on back of envelope...] Yes, that's right.
Four such variables will each correlate 0.5 with their sum, while being
together (of course) a perfect predictor of the sum.
If you let the variables be correlated, you can do the same trick with
just two. I can construct P1, P2, and R, such that each correlates -0.5
with each other, while P1 and P2 together perfectly predict R.
If we depart from normal distributions, I can show you N random variables,
each having zero correlation with R and with each other, yet together they
perfectly predict R. For N=2, I can even make it physically realistic.
In general, given variables P1...Pn, and the correlation of each with R
when the others are unconstrained, the degree to which they can jointly
predict R can be anything from no better than the best of them, to
perfectly.
What is the moral of this? What these examples tell me is that low
correlation does not imply the absence of yet-unobserved mechanisms
linking the apparently uncorrelated variables. But they do not tell me
that low correlation implies the presence of anything but itself.
Martin Taylor 970419 13:50:
Now suppose you were an experimenter, looking to see what was affecting this
variable that interested you, and you guessed that Bruce's influence P1
might be doing so. You might vary P1 and see what effect it had on the
variable V.
...and so on, measuring each of the others, then -- presto! --
discovering that all of them together predict R very well. Well, yes,
that might happen. It doesn't seem realistic to base one's programme of
research on the possibility, though. But I'm not a psychologist.
The next version of my paper might have something to say about the
mathematics of combining many poor predictors into a good predictor.
Back of an envelope calculation tells me that for a multivariate normal
distribution of variables P1...Pn and R, such that the bivariate
correlations of each of P1...Pn with each other are zero and of each with
R is 0.5, then for P1...Pn to jointly "explain" 90% of the variance of R
requires n to be at least 2 log 0.1/log 0.75 (0.1 is 1 - 0.9, and 0.75 is
1 - 0.5^2). This is a tad over 16, and it doesn't guarantee that when
those 16 observations are made, that 90% of the variance will be explained,
only that it cannot happen with fewer. If there is any correlation among
P1...Pn, n might be higher or lower. There must be better ways to spend
money than looking for more and more poor predictors of the things you
want to predict.
Hans Blom, 970417:
Much depends on what the correlation info is _used for_. Even low
correlations may be important to indicate that two variables are
somehow related
I omitted to indicate in the abstract, that the concern of the paper is
primarily with the validity of making individual predictions rather than
aggregate ones. I think everyone on CSGNET agrees that aggregate
statistics can be used to make aggregate predictions.
The questions of statistics and physics are different, I think. Much
simplified: Statistics wants to know _whether_ there is a relation,
physics what the _form_ of the relation is.
No, someone doing statistics is concerned with the form of the
relationship, someone doing physics or any other science is concerned
with the mechanism that gives rise to that relationship. Knowing the
empirical form of the relationship between two things is useful data, but
a scientist who discovers such a relationship should want to then find an
explanation for it. "F = ma" is more than just an observed relationship
between measurements. (Certain philosophers of science have claimed that
a scientific theory is no more than a compact representation of the mass
of experimental data; I disagree, and as far as I know, the view has
little currency nowadays.)
But "practical" might be a better word to use in the title than
"physical".
Yet we may know that the large scale study actually
demonstrated that X was better for 50.1% of the patients and worse
for 49.9%. _Lacking any other knowledge_, would the medical decision
be different from the logical one?
The practitioner faced with a patient to treat might as well go with the
drug that, statistically, has been more efficacious in tests (with costs
and side-effects also taken into consideration). However, the medical
researchers should try to find out what those drugs are actually doing,
how they are influencing the body's mechanisms. It may be that for some
patients, drug X is definitely worse than drug Y. The practitioner may
get a fractionally higher success rate by switching from Y to X, and
there's nothing wrong with that, but the response of an individual
patient may be completely different.
It must also be remembered that statistical significance is just one test
of the meaningfulness of a result. All tests of statistical significance
make assumptions -- assumptions of independence of various
observations from each other, assumed distributions of random variables,
etc. The meaningfulness of the results depends also on the accuracy of
those assumptions. If a study gave your example result of 50.1% X /
49.9% Y, I would not consider it as demonstrating any superiority of X at
all, regardless of how high the statistical "confidence level" was,
simply because such a tiny effect is likely to be swamped by errors in
those assumptions. In general, the smaller the effect detected, the more
those assumptions must be scrutinised. This is routine in physics: when
someone claims to have detected a tiny, but surprising phenomenon, the
first thing they and their colleagues do is think up all sorts of
artefacts that might be responsible, and find out if they are.
On the question of the number of bits of information given by a
correlation:
Bruce Abbott (970417.2200 EST):
[describing a situation where Y = cubic function of X + random noise]
Now, computing Pearson r for the set of points you observed in this process,
you discover that it is 0.886. Only one bit of information about the
position of Y given X, and yet you have established that the underlying
function is, say, cubic, with an extremely high level of confidence. Go figure.
My calculation of mutual information is only valid for bivariate normal
distributions. I had hoped to get some sort of mathematical result for
arbitrary distributions, but it didn't work out.
That aside, one must distinguish in your example the problem of
predicting the underlying value of Y from X, from predicting the actual
value of Y from X. In your example, you can do the former, but not
necessarily (with much precision) the latter. (The precision you get for
the latter will depend on how wiggly the cubic is relative to the size of
the random disturbance.)
As a scientist, if all I had was an statistically observed cubic-like
drift in the mean of Y with X, that might suggest to me what sort of
mechanisms to look for that might be causing it. If, in individual
cases, it is swamped by the random component, then the basic point of my
paper applies: no valid individual predictions.
I'd also quibble with the terminology of "underlying". The observed
value of Y simply is what it is. It may be expressible as a sum of Y1 (a
cubic function of X) and Y2 (a random variable apparently unrelated to
X). This decomposition may have some physically observable counterpart.
But to call Y1 the "underlying" value of Y seems to me to be a piece of
verbalising with no physical counterpart.
(Abbott, same message, replying to Powers:)
Are you saying that a 75% reduction in noise
is valuless? I hope you weren't in the business of designing radio
receivers!
In the more usual units for measuring these things, a 75% reduction in
noise is an (additive) increase in signal to noise ratio by 6dB. It is
certainly useful to improve the design of an amplifier to get an extra
6dB of SN ratio. However, we're then talking about population
predictions. Individual predictions -- the precise value of the output
of an amplifier for a given input at an instant of time -- are not
required by radio engineers. They want to know the expected value for
each possible input (they want that to be as linear as possible), and the
noise level (they want that to be as low as possible). Given those, the
value of the noise at any instant is unimportant, except for non-Gaussian
excursions due to lightning strikes and the like.
Bruce Abbott (970417.1025 EST):
The only point at which I take issue is that Richard's declaration that low
correlations (0.50 or below) are "useless" may be misunderstood. What he
means by that is that they are useless _individually_ for the _purpose of
predicting the Y-value of a point from its X-value_ (or vice versa).
That's right.
As I
have noted, this does not mean that they are without scientific merit for
other purposes, which may account for such correlations being frequently
reported in the scientific literature.
I'll reserve judgement on that. Hmm..."The practical meaning of the
correlation coefficient for multivariate normal distributions, part II:
aggregate predictions". I don't promise to write it though.
Thanks to everyone for all the responses.
···
__
\/__ Richard Kennaway, jrk@sys.uea.ac.uk, http://www.sys.uea.ac.uk/~jrk/
\/ School of Information Systems, Univ. of East Anglia, Norwich, U.K.