[Martin Taylor 2007.07.17.11.08]
[From Bill Powers (2007.07.16.2150 MDT)]
Martin Taylor 2007.07.16.22.49 --
I think that's what you are asking, isn't it? Get the regression of B on A, which gives you the expected value of B if you know a value of A, and use the tightened probability distribution (the variance of B no accounted for by A) to determine the probability density for any desired value of B.
That sounds like it, but I don't know how to do that. Can you (or Rick or Richard K. or Gary C.) spell it out in equations of one syllable?
Probability density is not a self-explanatory term.
It's what you are pltting when you draw the typical bell-curve: the relative probability that the result is within epsilon of any chosen value.
What I need to know is what fraction of predictions of B from A will be wrong by more than x% of the value given by the regression line, for a given correlation between B and A. Is it sufficient to know the correlation, or is other information, like the sigmas, required? Do measurement errors figure into the equations, too? I really need cookbook procedures here.
Rather than equations, which you can look up on Wikipedia or in any statistics book, let's try to make it intuitive what you are doing, with pictures. Then you can work out for yourself what needs to be done for any situation, not just the linear and gaussian situations for which correlation and linear regression are designed.
-------Linear Gaussian situation-----------
Call the interesting variable "X". X has been measured on a lot of occasions, and the measured values have been found to be distributed in a Gaussian way (standard bell curve) around some mean value:
x
X x
x x
x x
x x
____________________________________
This is the raw probability density distribution of X. The curve is always scaled so that its integral from plus to minus infinity is 1.0.
Your question asks "what is the probability that if measure X once more, the value I get will be greater than C". The answer is the integral of the part of the curve above C, which you can look up in any table of the normal distribution. In "significance" statistics, it's called the "P-value", and people quote P < .01 or P < .001, and stuff like that.
x
X x
x |x
x |xxx
x |xxxxx
______________________C_____________
Now you have happened, for reasons known only to yourself, to measure another variable A at the same time as you measured these many values of X. Funnily enough, you find that if A has a high value, X more often than not also has a high value, and conversely.
High value of A:
x
xxx
x>xxx
x |xxxxx
x |xxxxxxx
_______________________C__________________
More of the distribution is above C. Remember that the curve is scaled so that the total area under it is 1.0, and the integral of the part above C is the probability you are asking about.
Low value of A:
x
x x
x x
x x |
x x |
_______________________C_________________
The peaks of the bells curves (of the X probability density) for different values of A define ssome curve on a graph of X against A. We are assuming for simplicity that everything is linear, meaning the bell curves are Gaussian and the relation between A and X is a straight line. So, if we plot the graph of the peaks of the distributions of X for different values of A, we get something loke:
x
x
C------------------------------------
···
x
x
x
x
______________________________________
A values
Each of the "x" marks on the plot actually represents an entire bell curve, which is very hard to show in ASCII.
The line in the graph is the regression line descibed by the y = ax + b expression you get when you compute a regression with the usual formulae.
Now you want to know what is the probability that the next measured value of X will be above C, given that we know the corresponding value of A. From the regression equation, you find the value of X corresponding to the known value of A.
From your earlier measurements you know the variance (and thus the standard deviation) of X. From the correlation, you know what the variance is around any given value of A, and thus the standard deviation: since r^2 is the proportion of variance accounted for, (1-r^2) is the remaining variance about any fixed value of A.
You know how far C is above or below the value of X expected for the given value of A in units of the standard deviation sqrt( Var(X)*(1-r^2) ) (graph 3 or 4 for high and low values of A). You just use this value in the same way you would use it to find the probability a raw measurement of X will be above C (the second graph).
If we rely on a statistical fact to make predictions, just how wrong are we likely to be? That's the question.
That's the question the whole field of statistics was developed to answer. One might say that at heart it's pretty well all there is to statistics.
-----------non-linear non-gaussian---------------
I can do this Monte-Cristo-wise, if necessary. But really, the relationships we need must be well-known.
Yes. They are all in tables (for linear and gaussian conditions). It's when you go into nonlinear relationships and asymmetric and non-monotonic distributions that things get messy. Sometimes Monte-Christo methods are the best approach. But the principles I illustrated above still are appropriate.
Consider, for example, a case in which X is non-linearly related to A, in that X is likely to be high when A is high or low, but low when A is mediem:
x x
x x
x x
x
________________________________
You can still find the distribution of probability density of X for any particular value of A, if only by measuring lots of A-X pairs; and using thoe distributions you found, you can determine how likely it is that the next measurement of X will be above C, given that the correspnding value of A is known.
You can do it if the distribution is wildly irregular. Let's imagine a checkerboard distribution: if A has a value between an odd integer and the next even integer (e.g. 3.7), then X is equally likely to be between 0 and 1, between 2 and 3, or between 4 and 5, and will never take on any other value; but if A has any other value (such as 2.5), then X is only found between 1 and 2 or between 3 and 4.
It's hard to set up equations for that kind of thing, but using the concepts above, you can still get your answer. For example, if A is measured to be 2.4 and you want to know how likely is it that X > 3.5, the procedure is to look at the distribution of X for A between 1 and 2 (X will be equally likely to be found anywhere in the ranges 1 to 2 and 3 to 4), to get the answer 0.25.
-----summary--------
To recap, the procedure is to look at the distribution of X values for all the different A values, and then to then see how much of that distribution is above your chosen C (cut-off) value for the measured value of A associated with the next X.
When you actually have equations that can reasonably accurately describe both how X is expected to relate to A (regression), and what the variance of X is when you know A (correlation), you can use them to make your assessments. But you don't need the equations if you have other ways to estimate the distributions of X for different A values.
I hope this will help, despite not including actual equations, which I think would be more likely to obscure than to illuminate the underlying concepts.
Martin