correlations

[Richard Kennaway, 970419.1730 BST]

Sittin' on the dock of CSGNET watching the threads come and go away is
rather different to jumping in and swimming with the sharks. I've got my
stuff together now, so here is a mass reply to the correspondence about
my article on correlations.

Everyone who wants to seems to have got a copy they can read by now, but
I'll just apologize for the Unix-ism in the URL I gave -- {dvi,tex} meant
that there are two files, one named ....dvi and the other ....tex. I've
added a postscript version now. The full URL again:
ftp://ftp.sys.uea.ac.uk/pub/kennaway/drafts/correlationinfo.ps. I can't
provide an web version until someone produces something like a dvi or
PostScript viewer which can run as a Java applet on a web page (which
someone told me is or will soon be a possibility).

Bruce Abbott (970416.1140 EST):

The question is, meaningless for what purpose? For predicting a particular
individual's performance (Y) based on that person's value of X, yes,
correlations must be very high indeed to be of any value. But your
conclusion above assumes that those published correlations were published
based on their supposed utility in making such predictions. This is incorrect.

At some point I'll get around to looking at some of these papers. I have
Runkel's book and I'll get hold of the paper by Daniel Brown that Dag
Forssell mentioned.

Where a
number of variables are involved, all free to vary at least to some extent,
the correlation between any one of them and the response measure may be
poor. Here is an example I worked up involving four predictor variables and
a response variable. Each predictor variable is orthogonal to the others in
the population (though not necessarily in the sample), and each is normally
distributed. The correlations of predictors with the response variable were
as follows:

           P1 P2 P3 P4
   r 0.519 0.544 0.527 0.468
   r-sq 0.269 0.296 0.278 0.219

All of these correlations are within the range of values declared as
"meaningless" in your paper. It may therefore come as something of a
surprise to some readers that the four predictors used together perfectly
predict the value of the response measure.

Let me guess: P1...P4 are independently normally distributed variables and
the response variable (which I'll call R) is their sum? Actually, I
didn't guess, I peeked ahead to your later message explaining how they
were generated. [...scribble on back of envelope...] Yes, that's right.
Four such variables will each correlate 0.5 with their sum, while being
together (of course) a perfect predictor of the sum.

If you let the variables be correlated, you can do the same trick with
just two. I can construct P1, P2, and R, such that each correlates -0.5
with each other, while P1 and P2 together perfectly predict R.

If we depart from normal distributions, I can show you N random variables,
each having zero correlation with R and with each other, yet together they
perfectly predict R. For N=2, I can even make it physically realistic.

In general, given variables P1...Pn, and the correlation of each with R
when the others are unconstrained, the degree to which they can jointly
predict R can be anything from no better than the best of them, to
perfectly.

What is the moral of this? What these examples tell me is that low
correlation does not imply the absence of yet-unobserved mechanisms
linking the apparently uncorrelated variables. But they do not tell me
that low correlation implies the presence of anything but itself.

Martin Taylor 970419 13:50:

Now suppose you were an experimenter, looking to see what was affecting this
variable that interested you, and you guessed that Bruce's influence P1
might be doing so. You might vary P1 and see what effect it had on the
variable V.

...and so on, measuring each of the others, then -- presto! --
discovering that all of them together predict R very well. Well, yes,
that might happen. It doesn't seem realistic to base one's programme of
research on the possibility, though. But I'm not a psychologist.

The next version of my paper might have something to say about the
mathematics of combining many poor predictors into a good predictor.
Back of an envelope calculation tells me that for a multivariate normal
distribution of variables P1...Pn and R, such that the bivariate
correlations of each of P1...Pn with each other are zero and of each with
R is 0.5, then for P1...Pn to jointly "explain" 90% of the variance of R
requires n to be at least 2 log 0.1/log 0.75 (0.1 is 1 - 0.9, and 0.75 is
1 - 0.5^2). This is a tad over 16, and it doesn't guarantee that when
those 16 observations are made, that 90% of the variance will be explained,
only that it cannot happen with fewer. If there is any correlation among
P1...Pn, n might be higher or lower. There must be better ways to spend
money than looking for more and more poor predictors of the things you
want to predict.

Hans Blom, 970417:

Much depends on what the correlation info is _used for_. Even low
correlations may be important to indicate that two variables are
somehow related

I omitted to indicate in the abstract, that the concern of the paper is
primarily with the validity of making individual predictions rather than
aggregate ones. I think everyone on CSGNET agrees that aggregate
statistics can be used to make aggregate predictions.

The questions of statistics and physics are different, I think. Much
simplified: Statistics wants to know _whether_ there is a relation,
physics what the _form_ of the relation is.

No, someone doing statistics is concerned with the form of the
relationship, someone doing physics or any other science is concerned
with the mechanism that gives rise to that relationship. Knowing the
empirical form of the relationship between two things is useful data, but
a scientist who discovers such a relationship should want to then find an
explanation for it. "F = ma" is more than just an observed relationship
between measurements. (Certain philosophers of science have claimed that
a scientific theory is no more than a compact representation of the mass
of experimental data; I disagree, and as far as I know, the view has
little currency nowadays.)

But "practical" might be a better word to use in the title than
"physical".

Yet we may know that the large scale study actually
demonstrated that X was better for 50.1% of the patients and worse
for 49.9%. _Lacking any other knowledge_, would the medical decision
be different from the logical one?

The practitioner faced with a patient to treat might as well go with the
drug that, statistically, has been more efficacious in tests (with costs
and side-effects also taken into consideration). However, the medical
researchers should try to find out what those drugs are actually doing,
how they are influencing the body's mechanisms. It may be that for some
patients, drug X is definitely worse than drug Y. The practitioner may
get a fractionally higher success rate by switching from Y to X, and
there's nothing wrong with that, but the response of an individual
patient may be completely different.

It must also be remembered that statistical significance is just one test
of the meaningfulness of a result. All tests of statistical significance
make assumptions -- assumptions of independence of various
observations from each other, assumed distributions of random variables,
etc. The meaningfulness of the results depends also on the accuracy of
those assumptions. If a study gave your example result of 50.1% X /
49.9% Y, I would not consider it as demonstrating any superiority of X at
all, regardless of how high the statistical "confidence level" was,
simply because such a tiny effect is likely to be swamped by errors in
those assumptions. In general, the smaller the effect detected, the more
those assumptions must be scrutinised. This is routine in physics: when
someone claims to have detected a tiny, but surprising phenomenon, the
first thing they and their colleagues do is think up all sorts of
artefacts that might be responsible, and find out if they are.

On the question of the number of bits of information given by a
correlation:
Bruce Abbott (970417.2200 EST):
[describing a situation where Y = cubic function of X + random noise]

Now, computing Pearson r for the set of points you observed in this process,
you discover that it is 0.886. Only one bit of information about the
position of Y given X, and yet you have established that the underlying
function is, say, cubic, with an extremely high level of confidence. Go figure.

My calculation of mutual information is only valid for bivariate normal
distributions. I had hoped to get some sort of mathematical result for
arbitrary distributions, but it didn't work out.

That aside, one must distinguish in your example the problem of
predicting the underlying value of Y from X, from predicting the actual
value of Y from X. In your example, you can do the former, but not
necessarily (with much precision) the latter. (The precision you get for
the latter will depend on how wiggly the cubic is relative to the size of
the random disturbance.)

As a scientist, if all I had was an statistically observed cubic-like
drift in the mean of Y with X, that might suggest to me what sort of
mechanisms to look for that might be causing it. If, in individual
cases, it is swamped by the random component, then the basic point of my
paper applies: no valid individual predictions.

I'd also quibble with the terminology of "underlying". The observed
value of Y simply is what it is. It may be expressible as a sum of Y1 (a
cubic function of X) and Y2 (a random variable apparently unrelated to
X). This decomposition may have some physically observable counterpart.
But to call Y1 the "underlying" value of Y seems to me to be a piece of
verbalising with no physical counterpart.

(Abbott, same message, replying to Powers:)

Are you saying that a 75% reduction in noise
is valuless? I hope you weren't in the business of designing radio
receivers!

In the more usual units for measuring these things, a 75% reduction in
noise is an (additive) increase in signal to noise ratio by 6dB. It is
certainly useful to improve the design of an amplifier to get an extra
6dB of SN ratio. However, we're then talking about population
predictions. Individual predictions -- the precise value of the output
of an amplifier for a given input at an instant of time -- are not
required by radio engineers. They want to know the expected value for
each possible input (they want that to be as linear as possible), and the
noise level (they want that to be as low as possible). Given those, the
value of the noise at any instant is unimportant, except for non-Gaussian
excursions due to lightning strikes and the like.

Bruce Abbott (970417.1025 EST):

The only point at which I take issue is that Richard's declaration that low
correlations (0.50 or below) are "useless" may be misunderstood. What he
means by that is that they are useless _individually_ for the _purpose of
predicting the Y-value of a point from its X-value_ (or vice versa).

That's right.

As I
have noted, this does not mean that they are without scientific merit for
other purposes, which may account for such correlations being frequently
reported in the scientific literature.

I'll reserve judgement on that. Hmm..."The practical meaning of the
correlation coefficient for multivariate normal distributions, part II:
aggregate predictions". I don't promise to write it though.

Thanks to everyone for all the responses.

···

__
\/__ Richard Kennaway, jrk@sys.uea.ac.uk, http://www.sys.uea.ac.uk/~jrk/
  \/ School of Information Systems, Univ. of East Anglia, Norwich, U.K.

[Richard Kennaway, 970419.2125 BST]

Just after I posted a long message, the following caught my eye:

Bruce Abbott (970416.1140 EST):

Correlations are most often computed in order to determine whether there is
any linear association between the variables in question. Such associations
may arise through a variety of mechanisms; one rather well-known case in
this forum is through control-system activity when variations in a
disturbance to the CV are negatively correlated with variations in the
action of the system, as in the rubber band demo.

According to what I've read here, that isn't a good example in a discussion of low correlations. It is said that the correlations one finds in those experiments are almost invariably in the high 90s. 98 or 99% is not unusual.

···

__
\/__ Richard Kennaway, jrk@sys.uea.ac.uk, http://www.sys.uea.ac.uk/~jrk/
  \/ School of Information Systems, Univ. of East Anglia, Norwich, U.K.

[From Bruce Abbott (970419.1740 EST)]

Richard Kennaway, 970419.1730 BST --

Me (various):

Thanks, Richard, for your thoughtful reply. In matters mathematical, I
defer to your expertise.

It may therefore come as something of a
surprise to some readers that the four predictors used together perfectly
predict the value of the response measure.

Let me guess: P1...P4 are independently normally distributed variables and
the response variable (which I'll call R) is their sum? Actually, I
didn't guess, I peeked ahead to your later message explaining how they
were generated. [...scribble on back of envelope...] Yes, that's right.
Four such variables will each correlate 0.5 with their sum, while being
together (of course) a perfect predictor of the sum.

If you let the variables be correlated, you can do the same trick with
just two. I can construct P1, P2, and R, such that each correlates -0.5
with each other, while P1 and P2 together perfectly predict R.

Yes, the shared variance of the predictors can only count once toward
predicting R. If the shared variance were 1.0, then the second predictor
would contribute nothing over that provided by the first. I provided the
simplest case (independent predictors) so as to offer the clearest example
of the principle.

In general, given variables P1...Pn, and the correlation of each with R
when the others are unconstrained, the degree to which they can jointly
predict R can be anything from no better than the best of them, to
perfectly.

Yes.

What is the moral of this? What these examples tell me is that low
correlation does not imply the absence of yet-unobserved mechanisms
linking the apparently uncorrelated variables.

Yes, and despite their low values, such correlations may provide useful
information concerning the linkages among variables. Richard, have you ever
heard of factor analysis? Path analysis?

But they do not tell me
that low correlation implies the presence of anything but itself.

I'm not sure what you mean to convey here; could you be more explicit?

Martin Taylor 970419 13:50:

Now suppose you were an experimenter, looking to see what was affecting this
variable that interested you, and you guessed that Bruce's influence P1
might be doing so. You might vary P1 and see what effect it had on the
variable V.

...and so on, measuring each of the others, then -- presto! --
discovering that all of them together predict R very well. Well, yes,
that might happen. It doesn't seem realistic to base one's programme of
research on the possibility, though. But I'm not a psychologist.

I think physical scientists often fail to appreciate just how difficult it
is to obtain clear measures of _anything_ going on in the nervous system, or
to replicate observations. Unlike electrons or the chemical elements, no
two organisms are identical, nor do they behave exactly alike when exposed
to what would seem to be identical test conditions. One is not free to take
measures of physical signals in the brains of human beings, and even if one
could, we are talking about a closed-loop, massively parallel system of
enormous complexity whose wiring diagram is understood, if at all, in only
the vaguest way. Getting a clear measure of anything in this system,
uncontaminated by other influences, is nearly impossible in most cases. For
this reason it is often necessary to take a number of indirect measures,
each of which may be subject to many uncontrollable influences, and which
themselves are composites of other variables, and from the pattern of
relationship revealed attempt to deduce something about the underlying
organization responsible for that pattern. It's a tough problem to cope with.

Back of an envelope calculation tells me that for a multivariate normal
distribution of variables P1...Pn and R, such that the bivariate
correlations of each of P1...Pn with each other are zero and of each with
R is 0.5, then for P1...Pn to jointly "explain" 90% of the variance of R
requires n to be at least 2 log 0.1/log 0.75 (0.1 is 1 - 0.9, and 0.75 is
1 - 0.5^2). This is a tad over 16, and it doesn't guarantee that when
those 16 observations are made, that 90% of the variance will be explained,
only that it cannot happen with fewer. If there is any correlation among
P1...Pn, n might be higher or lower.

This is a bit confusing in that you have used n to stand both for the number
of independent predictors and the number of observations. For a
multivariate analysis, you must have at least as many observations as
predictors, and in practice you would want far more.

There must be better ways to spend
money than looking for more and more poor predictors of the things you
want to predict.

The goal is not to look for "more and more poor predictors" but to identify
good predictors. Two or more poor predictors may turn out to be composites,
each of which is measuring the same unitary variable; a correlational
analysis may identify such cases, and one may then devise a more direct
measure of the latter. Of course, no one is satisfied with low
correlations. Not even psychologists.

I omitted to indicate in the abstract, that the concern of the paper is
primarily with the validity of making individual predictions rather than
aggregate ones. I think everyone on CSGNET agrees that aggregate
statistics can be used to make aggregate predictions.

Yes, but "aggregrate prediction" is not necessarily the goal; more often the
goal is to identify what influences exist among a set of variables, and the
nature of their functions. Such information can be used to construct and
test models.

... someone doing statistics is concerned with the form of the
relationship, someone doing physics or any other science is concerned
with the mechanism that gives rise to that relationship. Knowing the
empirical form of the relationship between two things is useful data, but
a scientist who discovers such a relationship should want to then find an
explanation for it.

I agree. In psychology as in other sciences, the explanation will be found
in the physical structure of the system.

It must also be remembered that statistical significance is just one test
of the meaningfulness of a result. All tests of statistical significance
make assumptions -- assumptions of independence of various
observations from each other, assumed distributions of random variables,
etc. The meaningfulness of the results depends also on the accuracy of
those assumptions. If a study gave your example result of 50.1% X /
49.9% Y, I would not consider it as demonstrating any superiority of X at
all, regardless of how high the statistical "confidence level" was,
simply because such a tiny effect is likely to be swamped by errors in
those assumptions. In general, the smaller the effect detected, the more
those assumptions must be scrutinised.

I agree.

This is routine in physics: when
someone claims to have detected a tiny, but surprising phenomenon, the
first thing they and their colleagues do is think up all sorts of
artefacts that might be responsible, and find out if they are.

Tiny, but surprising phenomena can be discovered only when one has
exceedingly tight control over the test situation, or when all the
influences on the phenomenon are accessible for measurement and can be
measured precisely. This is generally not true in behavioral research; by
comparison physicists have it easy.

Now, computing Pearson r for the set of points you observed in this process,
you discover that it is 0.886. Only one bit of information about the
position of Y given X, and yet you have established that the underlying
function is, say, cubic, with an extremely high level of confidence. Go

figure.

. . . one must distinguish in your example the problem of
predicting the underlying value of Y from X, from predicting the actual
value of Y from X. In your example, you can do the former, but not
necessarily (with much precision) the latter. (The precision you get for
the latter will depend on how wiggly the cubic is relative to the size of
the random disturbance.)

Yes. The need to distinguish these two problems was the main point of my
discussion.

As a scientist, if all I had was an statistically observed cubic-like
drift in the mean of Y with X, that might suggest to me what sort of
mechanisms to look for that might be causing it.

Yes, exactly so.

If, in individual
cases, it is swamped by the random component, then the basic point of my
paper applies: no valid individual predictions.

Yes, of course!

I'd also quibble with the terminology of "underlying". The observed
value of Y simply is what it is. It may be expressible as a sum of Y1 (a
cubic function of X) and Y2 (a random variable apparently unrelated to
X). This decomposition may have some physically observable counterpart.
But to call Y1 the "underlying" value of Y seems to me to be a piece of
verbalising with no physical counterpart.

O.K., but what I had in mind is the basic measurement model, which assumes
that there is a real physical quantity to be measured, which can be
expressed in some unit of measurement, and the observed quantity (expressed
in the same units), which is a composite of the actual ("underlying")
quantity and measurement error. For that case I think my usage is justified.

In the more usual units for measuring these things, a 75% reduction in
noise is an (additive) increase in signal to noise ratio by 6dB. It is
certainly useful to improve the design of an amplifier to get an extra
6dB of SN ratio. However, we're then talking about population
predictions. Individual predictions -- the precise value of the output
of an amplifier for a given input at an instant of time -- are not
required by radio engineers. They want to know the expected value for
each possible input (they want that to be as linear as possible), and the
noise level (they want that to be as low as possible). Given those, the
value of the noise at any instant is unimportant, except for non-Gaussian
excursions due to lightning strikes and the like.

Yes, this is another manifestation of the issue I wished to highlight, which
is that the usefulness of a moderate correlation depends on the use to which
it will be put.

Because you are more familiar with concepts like the S/N ratio than I, I
wonder if you could tell me what relationship exists between the S/N and the
proportions of variance accounted and unaccounted for. I have a feeling
that a correlation could be reexpressed as a signal-to-noise ratio, although
information about the sign of the correlation would vanish.

As I
have noted, this does not mean that they are without scientific merit for
other purposes, which may account for such correlations being frequently
reported in the scientific literature.

I'll reserve judgement on that.

Fair enough.

Richard Kennaway, 970419.2125 BST]

Bruce Abbott (970416.1140 EST):

Correlations are most often computed in order to determine whether there is
any linear association between the variables in question. Such associations
may arise through a variety of mechanisms; one rather well-known case in
this forum is through control-system activity when variations in a
disturbance to the CV are negatively correlated with variations in the
action of the system, as in the rubber band demo.

According to what I've read here, that isn't a good example in a discussion

of low correlations. It is said that the correlations one finds in those
experiments are almost invariably in the high 90s. 98 or 99% is not unusual.

I was simply providing an example of a mechanism -- familiar to CSGnet
readers -- through which associations can arise. But now that you've
brought it up, these correlations appear because the task involved
(tracking) is one in which the size of the effect (arm/mouse/cursor
movement) is huge in comparison to the level of unknown (and unmeasured)
disturbances to the system, and all the important variables can be measured
directly (e.g., cursor movement) or correlate well with the measured
variables (e.g., cursor position and perceived cursor position, relative to
target). When one must make do with indirect measures of variables that are
not accessible with the available technology and not so well correlated with
their measures, I would not expect to see such impressive matches.

Regards,

Bruce

[From Richard Kennaway (970421.1610 BST)]

Bruce Abbott (970419.1740 EST):

Richard Kennaway, 970419.1730 BST --
Because you are more familiar with concepts like the S/N ratio than I, I
wonder if you could tell me what relationship exists between the S/N and the
proportions of variance accounted and unaccounted for. I have a feeling
that a correlation could be reexpressed as a signal-to-noise ratio, although
information about the sign of the correlation would vanish.

The square of the correlation is the proportion of variation in R
"explained" (I really can't bear to drop the scare quotes from that word in
the context of analysis of variance) by P. So if P = R + N (where N =
noise), and R and N are independent, then Var P = Var R + Var N

If P, R, and N are normally distributed (maybe if they aren't, I haven't
checked), and c is the correlation between P and R, then c-squared is (I'm
hoping this is true) Var R/Var P.

SN power ratio in dB (call it SNdB) is 10*log-base-10(Var R/Var N).

Therefore c-squared = 1/(1 + 1/SN), where SN = 10^(SNdB/10)

Equivalently, SNdB = -10*log-base-10( 1/c^2 - 1 ).

This is zero when c = 0.707 = 1/sqrt(2). This means that the signal and the
noise have equal amplitude. P is then the sum of two independent variables
of equal variance, and from another calculation I know that for normal
distributions, such a sum correlates 1/sqrt(2) with either component, so
everything looks consistent. Here are a few more values:

    c SNAdB Var R/Var N
    0.2 -13.80 0.042
    0.5 -4.77 0.33
    0.707 0 1
    0.866 4.77 3
    0.9 6.29 4.26
    0.95 9.66 9.26
    0.995 ~20 ~100
    0.9995 ~30 ~1000

Radio engineers are usually working in the region where signal is vastly
greater than noise. Psychologists -- those that work with low
correlations, at least -- work in the region where Var R is of the same
order of magnitude as Var N.

···

__
\/__ Richard Kennaway, jrk@sys.uea.ac.uk, http://www.sys.uea.ac.uk/~jrk/
  \/ School of Information Systems, Univ. of East Anglia, Norwich, U.K.

[Martin Taylor 970421 12:30]

Richard Kennaway, 970419.1730 BST]

The next version of my paper might have something to say about the
mathematics of combining many poor predictors into a good predictor.
Back of an envelope calculation tells me that for a multivariate normal
distribution of variables P1...Pn and R, such that the bivariate
correlations of each of P1...Pn with each other are zero and of each with
R is 0.5, then for P1...Pn to jointly "explain" 90% of the variance of R
requires n to be at least 2 log 0.1/log 0.75 (0.1 is 1 - 0.9, and 0.75 is
1 - 0.5^2). This is a tad over 16, and it doesn't guarantee that when
those 16 observations are made, that 90% of the variance will be explained,
only that it cannot happen with fewer.

Then I'd suggest you look over the back of your envelope to see whether the
mistake is in an assumption or in an algebraic operation. Bruce Abbott's
demonstration used four uncorrelated variables, each having a correlation
of 0.5 with the variable R. Those four _completely and exactly_ determined
R. No 90% correlations there, just perfect prediction. And 4 < 16.

Sorry about that, but it is so--or maybe I'm misunderstanding you.

Martin

[From Richard Kennaway, 970422.0940 BST]

Martin Taylor 970421 12:30:
[commenting on my calculation that it takes 16 independent 0.5 correlates
with R to "explain" 90% of the variance]

Then I'd suggest you look over the back of your envelope to see whether the
mistake is in an assumption or in an algebraic operation. Bruce Abbott's
demonstration used four uncorrelated variables, each having a correlation
of 0.5 with the variable R. Those four _completely and exactly_ determined
R. No 90% correlations there, just perfect prediction. And 4 < 16.

Sorry about that, but it is so--or maybe I'm misunderstanding you.

Ah. So it is. I think the picture I had in my mind in my calculation was
that each Pi would "explain" a proportion c-squared of the remaining
variance in R, after taking P1...P(i-1) into account, which is not so in
Bruce Abbott's example. Scratch the calculation.

···

__
\/__ Richard Kennaway, jrk@sys.uea.ac.uk, http://www.sys.uea.ac.uk/~jrk/
  \/ School of Information Systems, Univ. of East Anglia, Norwich, U.K.

[From Bill Powers (970422.0757 MST)]

Richard Kennaway, 970422.0940 BST] --

Trying to work with two computers via floppies is very inconvenient! I am
reminded, by your reply, of Bruce Abbott's example in which four variables
predict a function of them perfectly, even though their mutual correlations
are very low.

I didn't really understand this example when it was given, but now I do.
You have a function Y = f(X1, X2, X3, X4). There are no other variables,
known or unknown, affecting Y. Therefore Y, obviously, is perfectly
predicted by the function of the four variables, no matter what the function
is and no matter how the individual variables change.

The randomness of the four X variables is a red herring. They could vary in
any way, random or not, correlated or not, in any combination, and Y would
still be predicted perfectly by them. There is no random component in the X
variables: they are perfectly known.

Since the values of the four variables are perfectly known, this is not a
statistical problem. To make it a statistical problem you would have to say

Y(theoretical) = f(X1(actual), X2(actual), X3(actual), X4(actual)),
Y(observed) = f(X1(observed), X2(observed),
                X3(observed), X4(observed)),
  where
   Xn(0bserved) = Xn(actual) + rn
   rn = random variable

Now the observed value of Y would differ from the theoretically expected
value of Y, and the prediction would not be perfect unless the observations
of Xn were perfect. In the case of Y = X1 + X2 + X3 + X4 as the theoretical
relationship, the prediction error would be the square root of the sum of
squares of the individual observation errors due to rn. The total error of
observation would be larger than any individual error of measurement.

So Bruce Abbott's example is spurious.

Best,

Bill P.

[Martin Taylor 970422 14:00]

Bill Powers (970422.0757 MST)]

I'm flabbergasted, agog, amazed...call it what you will. It's awfully hard
to guess what spurious reasons you will come up with not to see what people
are trying to show you unless it conforms with your preconceptions:-(

Richard Kennaway, 970422.0940 BST] --

Trying to work with two computers via floppies is very inconvenient! I am
reminded, by your reply, of Bruce Abbott's example in which four variables
predict a function of them perfectly, even though their mutual correlations
are very low.

I didn't really understand this example when it was given, but now I do.
You have a function Y = f(X1, X2, X3, X4). There are no other variables,
known or unknown, affecting Y. Therefore Y, obviously, is perfectly
predicted by the function of the four variables, no matter what the function
is and no matter how the individual variables change.

The randomness of the four X variables is a red herring. They could vary in
any way, random or not, correlated or not, in any combination, and Y would
still be predicted perfectly by them. There is no random component in the X
variables: they are perfectly known.

Not so. Bruce is talking about an unknown underlying mechanism--unknown
to the experimenter, that is. To the experimenter, the values of X are
indeed random. God the creator of this little world made it so. The
experimenter measures the Xs, NOT knowing them beforehand.

God the creator of that little world knows how Y is generated. As for
the experimenter, he knows three things: (1) when he observes any one, two
or three of the X values, he thereby learns nothing at all about the value
of the fourth X, (2) when he observes all the X values, he predicts Y
exactly, at least to the precision of the measurement of the Xs, and
(3) when he observes any one of the X values, it correlates 0.5 with
his subsequent measure of the Y value.

Since the values of the four variables are perfectly known, ...

to God but not to the experimenter,

...this is not a
statistical problem. To make it a statistical problem you would have to say

Y(theoretical) = f(X1(actual), X2(actual), X3(actual), X4(actual)),
Y(observed) = f(X1(observed), X2(observed),
                X3(observed), X4(observed)),
  where
   Xn(0bserved) = Xn(actual) + rn
   rn = random variable

Now the observed value of Y would differ from the theoretically expected
value of Y, and the prediction would not be perfect unless the observations
of Xn were perfect. In the case of Y = X1 + X2 + X3 + X4 as the theoretical
relationship, the prediction error would be the square root of the sum of
squares of the individual observation errors due to rn. The total error of
observation would be larger than any individual error of measurement.

Correct.

For example: X1 = 3m +- 1mm, X2 = 2m +- 1mm, X3 = 4m +- 1mm, X4 = 1m +- 1mm

Y = 10 m +- 2mm. The X's are known to within about 1 part in 2.5 thousand
on average. Y is known to one part in 5 thousand. Not bad, and a little
better than the Xs individually. I don't know what the correlation between
Y and the sum of Xs would be in this system (measured with a precision
easily achieved with a ruler), but I'll bet it's over .995. If the Xs
are known to be positive, as is often the case, then Y will always be
known to better relative precision than the Xs individually.

So Bruce Abbott's example is spurious.

No, I think the spuriosity lies elsewhere:-)

Bruce's example is right on the mark.

Martin

[From Bill Powers (970422.1542 MST)]

Martin Taylor 970422 14:00]--

I'm flabbergasted, agog, amazed...call it what you will. It's awfully hard
to guess what spurious reasons you will come up with not to see what
people are trying to show you unless it conforms with your >preconceptions:-(

Nice bit of projection.

You have a function Y = f(X1, X2, X3, X4). There are no other variables,
known or unknown, affecting Y. Therefore Y, obviously, is perfectly
predicted by the function of the four variables, no matter what the
function is and no matter how the individual variables change.
The randomness of the four X variables is a red herring. They could vary
in any way, random or not, correlated or not, in any combination, and Y
would still be predicted perfectly by them. There is no random component
in the X variables: they are perfectly known.

Not so. Bruce is talking about an unknown underlying mechanism--unknown
to the experimenter, that is. To the experimenter, the values of X are
indeed random. God the creator of this little world made it so. The
experimenter measures the Xs, NOT knowing them beforehand.

I think Richard Kennaway dealt with this interpretation very well. You're
saying that the experimenter measured the X's, and just happened to combine
them so that Y was predicted perfectly. If the law (known only to God) had
been Y = aX1 + bX2 + cX3 + dX4, the experimenter would have had to guess at
a, b, c, and c in order to get the perfect prediction. His guess that Y =
sum of X's would not predict very well at all.

God the creator of that little world knows how Y is generated. As for
the experimenter, he knows three things: (1) when he observes any one, two
or three of the X values, he thereby learns nothing at all about the value
of the fourth X, (2) when he observes all the X values, he predicts Y
exactly, at least to the precision of the measurement of the Xs, and
(3) when he observes any one of the X values, it correlates 0.5 with
his subsequent measure of the Y value.

Your scenario assumes (1) that the experimenter measures the _true_ values
of the variables, with no noise, and (2) that the experimenter combines them
as God combines them to produce Y. If these conditions hold, then it makes
no difference that the individual variables happen to vary with a normal
distribution, or that the experimenter is unable to predict their values in
advance. If you truly treat these variables as random variables, then they
have a mean and a distribution. If the true value of Y is the sum of the
true values of X, and the X's are simply normally distributed variables with
a mean of 0, then Y = 0 is the correct prediction.

Since the values of the four variables are perfectly known, ...

to God but not to the experimenter,

No, Martin, you just said that the experimenter _measures the values of the
four variables_. The values of the four variables, as soon as they are
measured, are known to the experimenter.

However, if the experimenter does not added these values together with equal
weights of 1, the resulting prediction of Y will be wrong. So the
eperimenter must also know how God intended those variables to be combined
to produce Y.

...this is not a
statistical problem. To make it a statistical problem you would have to

say ...

Correct.

For example: X1 = 3m +- 1mm, X2 = 2m +- 1mm, X3 = 4m +- 1mm, X4 = 1m +- 1mm

Y = 10 m +- 2mm. The X's are known to within about 1 part in 2.5 thousand
on average. Y is known to one part in 5 thousand. Not bad, and a little
better than the Xs individually. I don't know what the correlation between
Y and the sum of Xs would be in this system (measured with a precision
easily achieved with a ruler), but I'll bet it's over .995. If the Xs
are known to be positive, as is often the case, then Y will always be
known to better relative precision than the Xs individually.

Oh, Martin, you are such a slippery character. Your assumption that the X's
are positive is, of course, necessary in order to make your conclusion true,
but it is in no way required for or implied by Bruce's illustration. Let me
give you a different numerical example, one not selected to make your
conclusion be true:

X1 = 3m +- 1mm
X2 = -2m +- 1mm
X3 = -4m +- 1 mm
X4 = 3m +- 1mm

Sum: Y = -0m +- 2 mm

Now Y is known with a relative precision less than the average relative
precision of the Xs -- considerably less!

It is true that the _relative_ error in Y will be less than the _relative_
error in X _over a large enough sample_. But if we use my analysis, with the
X's correlating 0.25 with Y, we will NOT come up with a "perfect" prediction
of Y, which was the point of Bruce's example that you seem to have wandered
off from.

So Bruce Abbott's example is spurious.

No, I think the spuriosity lies elsewhere:-)
Bruce's example is right on the mark.

Have it your way, Martin.

Best,

Bill P.

[Hans Blom, 970423]

(Richard Kennaway, 970419.1730 BST)

Sittin' on the dock of CSGNET watching the threads come and go away
is rather different to jumping in and swimming with the sharks.

Oh, how I sympathize with that :-).

Much depends on what the correlation info is _used for_. Even low
correlations may be important to indicate that two variables are
somehow related

I omitted to indicate in the abstract, that the concern of the paper
is primarily with the validity of making individual predictions
rather than aggregate ones. I think everyone on CSGNET agrees that
aggregate statistics can be used to make aggregate predictions.

I'm not sure whether that's considered important on CSGNET. One
translation of that statement is that we learn from experience. Yet
the mechanisms underlying learning are deemed of little importance in
this particular locality, it seems. Sorry, private peeve ;-).

The questions of statistics and physics are different, I think.
Much simplified: Statistics wants to know _whether_ there is a
relation, physics what the _form_ of the relation is.

No, someone doing statistics is concerned with the form of the
relationship, someone doing physics or any other science is
concerned with the mechanism that gives rise to that relationship.

You reassure me.

Knowing the empirical form of the relationship between two things is
useful data, but a scientist who discovers such a relationship
should want to then find an explanation for it. "F = ma" is more
than just an observed relationship between measurements. (Certain
philosophers of science have claimed that a scientific theory is no
more than a compact representation of the mass of experimental data;
I disagree, and as far as I know, the view has little currency
nowadays.)

Often, an "explanation" is another relationship, at a deeper level.
Ultimately, nature just is as it is -- and at any particular level
that seems to be the best explanation.

And although I don't agree with the "no more" in the statement that
"a scientific theory is no more than a compact representation of the
mass of experimental data", I do assert that this is the _practical_
value of a scientific theory. But then I'm an engineer...

But "practical" might be a better word to use in the title than
"physical".

Or you could expand on the theme how statistical considerations
support physics or physics-like relationships in e.g. economics or
psychology. The question often appears to be either "given these
data, what is the best model?" (where "best" has both the aspects of
"explaining" the data and simplicity of the model/form of the
relationship) and "assuming this model, how well do the data support
it?" Both questions are essentially open-ended: both the data and our
imagination can support an infinity of different models. A question
that is especially relevant to me: when should we stop refining our
models because we have exhausted the information that is contained in
the data?

It must also be remembered that statistical significance is just one
test of the meaningfulness of a result. All tests of statistical
significance make assumptions -- assumptions of independence of
various observations from each other, assumed distributions of
random variables, etc. The meaningfulness of the results depends
also on the accuracy of those assumptions. If a study gave your
example result of 50.1% X / 49.9% Y, I would not consider it as
demonstrating any superiority of X at all, regardless of how high
the statistical "confidence level" was, simply because such a tiny
effect is likely to be swamped by errors in those assumptions.

Yet, many medical publications do show little more than that drug X
is "better" than drug Y (p < 0.01). I detest such studies. They
frequently depend on very inaccurate or subjective measurements, and
the result is significant only because the population was large. I
kind of automatically make the translation to the 50.1% vs 49.9%
type, at least if the size of the population is given :-). Pretty bad
science, I think.

Greetings,

Hans