correlations and usefulness

[Martin Taylor 2007.07.16.16.16]

[From Bill Powers (20-07.07.16.1352 MDT)]

Anyway, what do you think of that nonlinear relationship I'm proposing between correlation and anticorrelation? I wouldn't be surprised to see I'm inventing another wheel, but it's intriguing to think we can show that a correlation of 0.7071 is no better than a coin-toss for making predictions. Could that really be true?

"No better than a coin toss" could mean a lot of things. If there are two choices, it means you have no information about which one is more probably on the next occurrence of the notionally repeated event. Clearly, if you have known data that correlates 0.7071 with the event stream of interest, and you have the known datum that relates to the next event, you do have more than zero information about which way the event will turn out. So the answer in that case is "No, it could not really be true".

Think of it another way, as the estimation of one continuous variable by measuring another.

I know this was directed at Richard, but I want to point out that the usual approach is to look at the fraction of the variance in one measure that is accounted for by looking at a different measure. A correlation of 0.7071 means that you can reduce the unestimated part of the variance to 50% of what it would have been had you not known the value of the estimator. That's pretty good, since if you had just one other independent variable that was unrelated to the first, and it also correlated 0.7071 with the one you are interested in, you would have accounted for ALL the variance -- in other words, by having these two estimator variables, you could precisely determine the value of the variable of real interest.

I that makes no sense to you in the abstract, look at it from the vector point of view. If variable A and X are correlated 0.7071, they are at 45 degrees. If B and X are also correlated 0.7071 AND B and A are correlated zero, then A, B, and X define a plane in which X is the hypotenuse of the right-angled triangle whose sides are A and B.

If A and B are correlated, then of course A B and X don't lie in a plane, and there is still variance unaccounted for; you would need at least one more estimator to give you a full description of X. The length of the projection of X onto the direction orthogonal to both A and B tells you how much you still don't know about X if you know A and B.

"Proportion of variance accounted for" is, in my book, a pretty good index of how useful one variable is for estimating another.

Martin

Anyway, what do you think of
that nonlinear relationship I’m proposing between correlation and
anticorrelation? I wouldn’t be surprised to see I’m inventing another
wheel, but it’s intriguing to think we can show that a correlation of
0.7071 is no better than a coin-toss for making predictions. Could that
really be true?
Think of it another way, as the
estimation of one continuous variable by measuring another.

I know this was directed at Richard, but I want to point out that the
usual approach is to look at the fraction of the variance in one measure
that is accounted for by looking at a different measure. A correlation of
0.7071 means that you can reduce the unestimated part of the variance to
50% of what it would have been had you not known the value of the
estimator. That’s pretty good, since if you had just one other
independent variable that was unrelated to the first, and it also
correlated 0.7071 with the one you are interested in, you would have
accounted for ALL the variance – in other words, by having these two
estimator variables, you could precisely determine the value of the
variable of real interest.
[From Bill Powers (2007.07.16.1805 MDT)]

Martin Taylor 2007.07.16.16.16 –

I’m defininitely not defending my proposed view; just trying to encourage
a serious look at it.

Your interpretation involves a particular verbal way of interpreting the
results. I’m offering a different one. Or am I?

If that makes no
sense to you in the abstract, look at it from the vector point of view.
If variable A and X are correlated 0.7071, they are at 45 degrees. If B
and X are also correlated 0.7071 AND B and A are correlated zero, then A,
B, and X define a plane in which X is the hypotenuse of the right-angled
triangle whose sides are A and B.

OK, that’s a place to start. Introduce a vector orthogonal to A, called
A’. If the angle between A and B is R, then the angle between A’ and B is
90 - R in degrees. This says that when r(A,B) < 0.7071, them r(A’,B)

0.7071, so A predicts B less well than any vector at right angles to
(unrelated to) A does.

Well, perhaps that’s it. If r(A,B) is less than 0.7071, this says that
factors independent of A correlate more highly with B than A does. But
that just says they contribute more to the variance than A does.

“Proportion of
variance accounted for” is, in my book, a pretty good index of how
useful one variable is for estimating another.

Except for this strange use of the term “accounted for.”

I’m about ready to give up my new “predictive value”. But this
brings us back to a question I’ve raised before and still don’t know how
to find the answer to. Given that A correlates with B, what are the
chances that an observation of A will correctly predict B within n
standard deviations of B? I think there must be a clear answer in the
basic definitions of statistical quantities, but I just don’t know enough
to find it. Maybe it’s the standard error of the estimate that Rick
suggested. If so, that’s my “predictive value” of a
correlation. Perhaps Rick will add it to his spread sheet. Incidentally,
in that attached pdf file I sent, there is a discussion of
“shrinking” correlations taken from samples so as to get a
better estimate of the true population correlation. In an example it
makes a huge difference.

Best,

Bill P.

[Martin Taylor 2007.07.16.22.49]

[From Bill Powers (2007.07.16.1805 MDT)]

Martin Taylor 2007.07.16.16.16 --

If that makes no sense to you in the abstract, look at it from the vector point of view. If variable A and X are correlated 0.7071, they are at 45 degrees. If B and X are also correlated 0.7071 AND B and A are correlated zero, then A, B, and X define a plane in which X is the hypotenuse of the right-angled triangle whose sides are A and B.

OK, that's a place to start. Introduce a vector orthogonal to A, called A'. If the angle between A and B is R, then the angle between A' and B is 90 - R in degrees. This says that when r(A,B) < 0.7071, them r(A',B) > 0.7071, so A predicts B less well than any vector at right angles to (unrelated to) A does.

Not quite. Only if A' is in the plane defined by A and B, and if B is between A and A' will that be true. However, that said, you are on the right track. If A is correlated less than 0.7071 with B, then knowing A tells you about less than half the variance of B. The rest of the world independent of A is represented by all vectors orthogonal to A, and that means there would potentially be some measure(s) independent of A that could tell you more about B than does A. It doesn't say that you can find those measures.

What usually lies behind the use of these statistics is an idea -- hypothesis, if you like -- that something about A and B is causally related. Perhaps A influences B, in the way that a reference signal (A) would influence its perceptual signal (B). In a low-gain control system the correlation between reference and perception might well be less than that between disturbance and perception; but you wouldn't say that knowing the reference value was of less than no value in guessing the value of the perceptual signal.

Well, perhaps that's it. If r(A,B) is less than 0.7071, this says that factors independent of A correlate more highly with B than A does. But that just says they contribute more to the variance than A does.

Yes, that's it.

The way I look at it is that usually in our world, interesting variables are affected by many influences. If we can find any of these influences, then we have a chance to affect the "interesting variable". Correlation doesn't demonstrate causation, but lack of correlation usually is pretty good evidence that neither variable influences the other. If you have a reliable correlation AND a plausible mechanism for the influence to act on the "interesting variable", then you can try controlling the influence variable and see if you can alter the statistics of the "interesting variable". It's like "The Test", applied without presuming the existence of a control system.

This works, even if the correlation is 0.1 (meaning that the influence might affect only 1% of the variance of the "interesting variable". Usually 1% is uninteresting, and it's not worth while to pursue such possible relationships further, but sometimes even such small relationships do matter if they are reliable.

"Proportion of variance accounted for" is, in my book, a pretty good index of how useful one variable is for estimating another.

Except for this strange use of the term "accounted for."

Why is that strange? I think it's precise, and pretty well in accord with everyday usage. As in financial bookkeeping, you want to be able to account for all the variance, but using any particular estimator you can account for only a proportion. Wouldn't you say the same if your books showed you could account for only $50 of every $100 that you knew had passed through your cashier's hands?

I'm about ready to give up my new "predictive value". But this brings us back to a question I've raised before and still don't know how to find the answer to. Given that A correlates with B, what are the chances that an observation of A will correctly predict B within n standard deviations of B?

Assuming normal gaussian statistics and linear relationships, isn't this just the question of how much of the variance of B is accounted for by measurement of A? Or do I misunderstand the question? If A accounts for 40% of the variance of B, then the variance of B knowing A is 60% of what it is when you don't know A. You just have a that much tighter Gaussian probability distribution of B, and you can look up in tables what the probability is that B will be within n standard deviations of the expected value given the measured value of A.

I think that's what you are asking, isn't it? Get the regression of B on A, which gives you the expected value of B if you know a value of A, and use the tightened probability distribution (the variance of B no accounted for by A) to determine the probability density for any desired value of B.

Martin

I think that’s what you are
asking, isn’t it? Get the regression of B on A, which gives you the
expected value of B if you know a value of A, and use the tightened
probability distribution (the variance of B no accounted for by A) to
determine the probability density for any desired value of
B.
[From Bill Powers (2007.07.16.2150 MDT)]

Martin Taylor 2007.07.16.22.49 –

That sounds like it, but I don’t know how to do that. Can you (or Rick or
Richard K. or Gary C.) spell it out in equations of one
syllable?

Probability density is not a self-explanatory term. What I need to know
is what fraction of predictions of B from A will be wrong by more than x%
of the value given by the regression line, for a given correlation
between B and A. Is it sufficient to know the correlation, or is other
information, like the sigmas, required? Do measurement errors figure into
the equations, too? I really need cookbook procedures here.

If we rely on a statistical fact to make predictions, just how wrong are
we likely to be? That’s the question.

I can do this Monte-Cristo-wise, if necessary. But really, the
relationships we need must be well-known.

Best,

Bill P.

[Martin Taylor 2007.07.17.11.08]

[From Bill Powers (2007.07.16.2150 MDT)]

Martin Taylor 2007.07.16.22.49 --

I think that's what you are asking, isn't it? Get the regression of B on A, which gives you the expected value of B if you know a value of A, and use the tightened probability distribution (the variance of B no accounted for by A) to determine the probability density for any desired value of B.

That sounds like it, but I don't know how to do that. Can you (or Rick or Richard K. or Gary C.) spell it out in equations of one syllable?

Probability density is not a self-explanatory term.

It's what you are pltting when you draw the typical bell-curve: the relative probability that the result is within epsilon of any chosen value.

What I need to know is what fraction of predictions of B from A will be wrong by more than x% of the value given by the regression line, for a given correlation between B and A. Is it sufficient to know the correlation, or is other information, like the sigmas, required? Do measurement errors figure into the equations, too? I really need cookbook procedures here.

Rather than equations, which you can look up on Wikipedia or in any statistics book, let's try to make it intuitive what you are doing, with pictures. Then you can work out for yourself what needs to be done for any situation, not just the linear and gaussian situations for which correlation and linear regression are designed.

-------Linear Gaussian situation-----------

Call the interesting variable "X". X has been measured on a lot of occasions, and the measured values have been found to be distributed in a Gaussian way (standard bell curve) around some mean value:

                   x
                 X x
                x x
              x x
           x x
____________________________________

This is the raw probability density distribution of X. The curve is always scaled so that its integral from plus to minus infinity is 1.0.

Your question asks "what is the probability that if measure X once more, the value I get will be greater than C". The answer is the integral of the part of the curve above C, which you can look up in any table of the normal distribution. In "significance" statistics, it's called the "P-value", and people quote P < .01 or P < .001, and stuff like that.

                   x
                 X x
                x |x
              x |xxx
           x |xxxxx
______________________C_____________

Now you have happened, for reasons known only to yourself, to measure another variable A at the same time as you measured these many values of X. Funnily enough, you find that if A has a high value, X more often than not also has a high value, and conversely.

High value of A:

                       x
                      xxx
                     x>xxx
                   x |xxxxx
                 x |xxxxxxx
_______________________C__________________

More of the distribution is above C. Remember that the curve is scaled so that the total area under it is 1.0, and the integral of the part above C is the probability you are asking about.

Low value of A:

              x
             x x
            x x
          x x |
        x x |
_______________________C_________________

The peaks of the bells curves (of the X probability density) for different values of A define ssome curve on a graph of X against A. We are assuming for simplicity that everything is linear, meaning the bell curves are Gaussian and the relation between A and X is a straight line. So, if we plot the graph of the peaks of the distributions of X for different values of A, we get something loke:

                                x
                            x

C------------------------------------

···

                    x
                x
            x
        x

______________________________________
                A values

Each of the "x" marks on the plot actually represents an entire bell curve, which is very hard to show in ASCII.

The line in the graph is the regression line descibed by the y = ax + b expression you get when you compute a regression with the usual formulae.

Now you want to know what is the probability that the next measured value of X will be above C, given that we know the corresponding value of A. From the regression equation, you find the value of X corresponding to the known value of A.

From your earlier measurements you know the variance (and thus the standard deviation) of X. From the correlation, you know what the variance is around any given value of A, and thus the standard deviation: since r^2 is the proportion of variance accounted for, (1-r^2) is the remaining variance about any fixed value of A.

You know how far C is above or below the value of X expected for the given value of A in units of the standard deviation sqrt( Var(X)*(1-r^2) ) (graph 3 or 4 for high and low values of A). You just use this value in the same way you would use it to find the probability a raw measurement of X will be above C (the second graph).

If we rely on a statistical fact to make predictions, just how wrong are we likely to be? That's the question.

That's the question the whole field of statistics was developed to answer. One might say that at heart it's pretty well all there is to statistics.

-----------non-linear non-gaussian---------------

I can do this Monte-Cristo-wise, if necessary. But really, the relationships we need must be well-known.

Yes. They are all in tables (for linear and gaussian conditions). It's when you go into nonlinear relationships and asymmetric and non-monotonic distributions that things get messy. Sometimes Monte-Christo methods are the best approach. But the principles I illustrated above still are appropriate.

Consider, for example, a case in which X is non-linearly related to A, in that X is likely to be high when A is high or low, but low when A is mediem:

x x
     x x
          x x
               x
________________________________

You can still find the distribution of probability density of X for any particular value of A, if only by measuring lots of A-X pairs; and using thoe distributions you found, you can determine how likely it is that the next measurement of X will be above C, given that the correspnding value of A is known.

You can do it if the distribution is wildly irregular. Let's imagine a checkerboard distribution: if A has a value between an odd integer and the next even integer (e.g. 3.7), then X is equally likely to be between 0 and 1, between 2 and 3, or between 4 and 5, and will never take on any other value; but if A has any other value (such as 2.5), then X is only found between 1 and 2 or between 3 and 4.

It's hard to set up equations for that kind of thing, but using the concepts above, you can still get your answer. For example, if A is measured to be 2.4 and you want to know how likely is it that X > 3.5, the procedure is to look at the distribution of X for A between 1 and 2 (X will be equally likely to be found anywhere in the ranges 1 to 2 and 3 to 4), to get the answer 0.25.

-----summary--------

To recap, the procedure is to look at the distribution of X values for all the different A values, and then to then see how much of that distribution is above your chosen C (cut-off) value for the measured value of A associated with the next X.

When you actually have equations that can reasonably accurately describe both how X is expected to relate to A (regression), and what the variance of X is when you know A (correlation), you can use them to make your assessments. But you don't need the equations if you have other ways to estimate the distributions of X for different A values.

I hope this will help, despite not including actual equations, which I think would be more likely to obscure than to illuminate the underlying concepts.

Martin