[From Richard Kennaway (970421.1605 BST)]
Bruce Abbott (970419.1740 EST):
Yes, and despite their low values, such correlations may provide useful
information concerning the linkages among variables. Richard, have you ever
heard of factor analysis? Path analysis?
Factor analysis yes, path analysis no.
Useful for what? Real examples would be illuminating.
Richard Kennaway, 970419.1730 BST --
But they do not tell me
that low correlation implies the presence of anything but itself.
I'm not sure what you mean to convey here; could you be more explicit?
What I mean is, that the presence of a low non-zero correlation of P with R
does not imply that P is likely to constitute part of a good predictor or
explanation of R. It just tells you that there's a low, non-zero
correlation of P with R. Maybe you decide to see what might be combined
with P to better predict or explain R. Maybe you decide P isn't useful.
The low correlation is an input to your decision as an experimenter about
where to look next to explain R, but cannot be regarded as being on its own
a partial explanation of R.
You gave a fictitious example where four poor correlates of the variable to
be predicted did together give a good -- perfect -- prediction. Does this
ever happen with real data? Can you cite any experiments in which some
number of factors, each poorly correlating (c < 0.6, say) with the variable
to be predicted, together gave an excellent prediction (c > 0.98) on new
data? If I got a result like that in my research, my reaction wouldn't be
"oh yeah, four variables each correlating 0.5, not surprising that together
they account for everything", it would be "Wow! Eureka! Four pieces of
junk produce this pearl of knowledge! At last, a solid prediction that I
can base further research on!"
I stipulate that they must predict well on new data, as it's very easy to
fit curves as accurate as you like to any given data. The accuracy of that
fit is no guarantee that it will fit data that were not used to construct
Back of an envelope calculation tells me that for a multivariate normal
distribution of variables P1...Pn and R, such that the bivariate
correlations of each of P1...Pn with each other are zero and of each with
R is 0.5, then for P1...Pn to jointly "explain" 90% of the variance of R
requires n to be at least 2 log 0.1/log 0.75 (0.1 is 1 - 0.9, and 0.75 is
1 - 0.5^2). This is a tad over 16, and it doesn't guarantee that when
those 16 observations are made, that 90% of the variance will be explained,
only that it cannot happen with fewer. If there is any correlation among
P1...Pn, n might be higher or lower.
This is a bit confusing in that you have used n to stand both for the number
of independent predictors and the number of observations. For a
multivariate analysis, you must have at least as many observations as
predictors, and in practice you would want far more.
n is only the number of predictors. I didn't mention the number of
observations (call it N), but I assumed it to be large enough to neglect
the inaccuracy in the measures of correlations -- far more than n, as you
This topic reminds me of a method of mathematical proof called "career
induction". You want to prove a theorem, say, that all Grossmayer syzygies
are noetherian (don't worry, the words are just flimflam). You prove a
special case -- all Grossmayer syzygies of degree at most 1 are noetherian.
Then you prove another case: all semi-flat Grossmayer syzygies of degree 2
are noetherian. Then another, and another...slicing off ever thinner
pieces of salami from the goal. If each piece is thick enough to make a
publication, you can base a career on it, hence the name. As a
mathematician, when I notice I'm engaging in career induction, it's a
wake-up call to me that I'm on the wrong track.
My gut feeling (which I admit is uncontaminated by any experience of
experimental research) is that "career factor analysis" -- accumulating
more and more partial correlates of the thing to be predicted --
constitutes a similar sort of aimless groping in the dark. I am open to
hearing of examples where it panned out.