Combining poor correlates

[From Richard Kennaway (970421.1605 BST)]

Bruce Abbott (970419.1740 EST):

Yes, and despite their low values, such correlations may provide useful
information concerning the linkages among variables. Richard, have you ever
heard of factor analysis? Path analysis?

Factor analysis yes, path analysis no.

Useful for what? Real examples would be illuminating.

Richard Kennaway, 970419.1730 BST --

But they do not tell me
that low correlation implies the presence of anything but itself.

I'm not sure what you mean to convey here; could you be more explicit?

What I mean is, that the presence of a low non-zero correlation of P with R
does not imply that P is likely to constitute part of a good predictor or
explanation of R. It just tells you that there's a low, non-zero
correlation of P with R. Maybe you decide to see what might be combined
with P to better predict or explain R. Maybe you decide P isn't useful.
The low correlation is an input to your decision as an experimenter about
where to look next to explain R, but cannot be regarded as being on its own
a partial explanation of R.

You gave a fictitious example where four poor correlates of the variable to
be predicted did together give a good -- perfect -- prediction. Does this
ever happen with real data? Can you cite any experiments in which some
number of factors, each poorly correlating (c < 0.6, say) with the variable
to be predicted, together gave an excellent prediction (c > 0.98) on new
data? If I got a result like that in my research, my reaction wouldn't be
"oh yeah, four variables each correlating 0.5, not surprising that together
they account for everything", it would be "Wow! Eureka! Four pieces of
junk produce this pearl of knowledge! At last, a solid prediction that I
can base further research on!"

I stipulate that they must predict well on new data, as it's very easy to
fit curves as accurate as you like to any given data. The accuracy of that
fit is no guarantee that it will fit data that were not used to construct
it.

Back of an envelope calculation tells me that for a multivariate normal
distribution of variables P1...Pn and R, such that the bivariate
correlations of each of P1...Pn with each other are zero and of each with
R is 0.5, then for P1...Pn to jointly "explain" 90% of the variance of R
requires n to be at least 2 log 0.1/log 0.75 (0.1 is 1 - 0.9, and 0.75 is
1 - 0.5^2). This is a tad over 16, and it doesn't guarantee that when
those 16 observations are made, that 90% of the variance will be explained,
only that it cannot happen with fewer. If there is any correlation among
P1...Pn, n might be higher or lower.

This is a bit confusing in that you have used n to stand both for the number
of independent predictors and the number of observations. For a
multivariate analysis, you must have at least as many observations as
predictors, and in practice you would want far more.

n is only the number of predictors. I didn't mention the number of
observations (call it N), but I assumed it to be large enough to neglect
the inaccuracy in the measures of correlations -- far more than n, as you
say.

This topic reminds me of a method of mathematical proof called "career
induction". You want to prove a theorem, say, that all Grossmayer syzygies
are noetherian (don't worry, the words are just flimflam). You prove a
special case -- all Grossmayer syzygies of degree at most 1 are noetherian.
Then you prove another case: all semi-flat Grossmayer syzygies of degree 2
are noetherian. Then another, and another...slicing off ever thinner
pieces of salami from the goal. If each piece is thick enough to make a
publication, you can base a career on it, hence the name. As a
mathematician, when I notice I'm engaging in career induction, it's a
wake-up call to me that I'm on the wrong track.

My gut feeling (which I admit is uncontaminated by any experience of
experimental research) is that "career factor analysis" -- accumulating
more and more partial correlates of the thing to be predicted --
constitutes a similar sort of aimless groping in the dark. I am open to
hearing of examples where it panned out.

···

__
\/__ Richard Kennaway, jrk@sys.uea.ac.uk, http://www.sys.uea.ac.uk/~jrk/
  \/ School of Information Systems, Univ. of East Anglia, Norwich, U.K.

[From Bruce Abbott (970425.1355 EST)]

Richard Kennaway (970421.1605 BST) --

You gave a fictitious example where four poor correlates of the variable to
be predicted did together give a good -- perfect -- prediction. Does this
ever happen with real data? Can you cite any experiments in which some
number of factors, each poorly correlating (c < 0.6, say) with the variable
to be predicted, together gave an excellent prediction (c > 0.98) on new
data? If I got a result like that in my research, my reaction wouldn't be
"oh yeah, four variables each correlating 0.5, not surprising that together
they account for everything", it would be "Wow! Eureka! Four pieces of
junk produce this pearl of knowledge! At last, a solid prediction that I
can base further research on!"

I stipulate that they must predict well on new data, as it's very easy to
fit curves as accurate as you like to any given data. The accuracy of that
fit is no guarantee that it will fit data that were not used to construct
it.

This sort of thing really isn't my area (mostly I do single-subject
experimental studies that do not involve computing correlations), so I've
simply reached into my old files to retrieve an example I learned about in
grad school when studying human judgment and decision making. The results
to be described are from Keeley, S. M., & Doherty, M. E. (1972). Bayesian
and Regression Modeling of Graduate Admission Policy. _Organizational
Behavior and Human Performance_, _8_, 297-323.

Keeley and Doherty (1972) wanted to know how experienced judges (four
biology profs who have served on an admissions committee) determine whom to
accept or reject for admission to a biology master's program. The authors
consulted these judges to determine what information supplied about the
applicant they believed useful in arriving at their decisions. Six sources
of information ("cues") were identified as useful: Overall GPA (A), Quality
of School (S), GRE Verbal (V), GRE Quantitative (Q), Background in Physical
Sciences (B), and Grades in Physical Sciences (G). The judges were then
supplied this information ("profiles") about 528 hypothetical applicants;
the information supplied about an applicant varied from profile to profile,
and consisted of either 1, 2, 4, or all six cues. A number of the profiles
were duplicated in order to assess the reliability of the judgements.
Judges were asked to judge the probability that each applicant would obtain
the MA if admitted.

Here are the added proportions of variance in the judgements accounted for
by each cue and the total variance accounted for, for the 6-cue profiles,
for each judge:

Cue WDB WJB RC WH
A .65 .72 .70 .54
S .07 .00 .05 .02
V .12 .09 .09 .04
Q .08 .02 .07 .11
B .00 .00 .00 .00
G .00 .00 .00 .00
Tot .88 .83 .92 .73
AQ .02 .04 .01 .15
Tot .90 .87 .93 .88

("AQ" is a "configural cue," the cross-product A*Q. The "added proportion"
is the additional proportion of variance accounted for by a cue that is not
already accounted for by previous cues.)

Cues B and G accounted for essentionally none of the variance in the judge's
decisions. Overall GPA accounted for most of the variance, although adding
S, V, Q, and (in the case of judge WH) the A*Q product brought variance
accounted for up to the .87 to .93 range, equivalent to a range of
correlations from 0.93 to 0.96.

There is more information presented in this study than I've given here
(e.g., the judge's reliabilities), but I think this is enough to make the
point. Also, I know that Doherty normally includes a "hold-out" sample,
which he uses to validate the regression by having it predict responses on a
a portion of the data not used to establish the cue weighting. I do not
find mention of this procedure in this paper, perhaps because the purpose
was to _compare_ regression and Baysian results on the same data. Another
work, in which such a validation was conducted, produced a model for hiring
decisions for insurance salepersons, but I don't have a copy of that study
and thus cannot provide a report of its results.

Regards,

Bruce