[From Bruce Abbott (2009.06.24.0835 EST)]
Richard Kennaway (2009.06.23.1507 BST)
Bruce Abbott (2009.06.23.0955 EST)
BA: I may be sticking my neck out here as I�m not sure I�ve completely
understood Kennaway�s paper, but I believe that the prediction of sign
was based on distributions of normalized variables; their mean values
would be zero. In the limiting case of zero correlation, there would be
a 50% probability that the prediction of sign would be correct.
This would not be the case for a variable whose mean was non-zero.
RK: For the case of non-zero means, the table in my paper applies to
predicting whether Y is above or below its mean from whether X is above or
below its mean.
O.K.
Imagine a scatter plot (Y as a function of X) relating human height
versus weight. Both variables are always positive, and they tend to be
strongly positively correlated. At each height along the X-axis, there
would be a corresponding distribution of weights along the Y-axis. The
regression line fitted to these data would give the predicted value of
Y at a given X value. For each observed value of Y, we could compute
the difference between that observed value of weight and the predicted
value. This difference is called a residual. Residuals can be positive,
negative, or zero depending on whether the observed value is above,
below, or on the predicted value, respectively. The residuals tend to
get smaller as the correlation between X and Y increases, reaching zero
when the correlation is plus or minus 1.0.
The task here is to �correctly� predict the value of Y for a given
individual, based on that individual�s height. How successful you are
depends on how you define a correct prediction. For a theoretical
distribution containing an infinite number of possible Y-values, the
chances of making a perfect prediction to an infinite number of decimal
places is essentially zero. To make the task practical, you have to
specify the range of values around the predicted value that would be
considered close enough to qualify as correct.
As the correlation between X and Y increases, the width of the
distribution of Y-values at a given value of X becomes smaller.
Consequently, any given prediction is more likely to fall within some
acceptable margin of error.
I mention all this because I�m worried that Kennaway�s
prediction-of-sign result might be misinterpreted to mean that one
almost cannot predict the sign of Y from X, let alone get close to the
actual value of Y, unless the correlation is extremely high.
RK: No worries -- that is the correct interpretation, understanding sign as
the sign of the difference between Y and its mean. And you really cannot
get close to the actual value of Y, unless the correlation is extremely
high.
Yes, of course: The linear transformation doesn't change things. It might be
worth emphasizing here that we are discussing predicting an individual value
of Y, one of many values of Y that may be paired with a given value of X.
For example, if we plotted the heights and weights of many individuals on a
scatter plot, at a given height we may find many individuals of various
weights. Based on the regression line relating weight to height, we can
predict the weight of a person given his or her height, but the predicted
weight is an estimate of the mean weight of the population of individuals of
that height. If other factors besides height influence weight, then the
individual's actual weight will differ from this predicted value, more or
less, depending on the effect of those other factors. Those other, unknown
or unmeasured factors reduce the correlation between height and weight.
As you have shown so clearly, predicting an individual value is a rather
dicey affair unless the correlation between X and Y is very high. To improve
that prediction, one can seek to identify and measure those other factors
and include them in a multiple regression equation. This will raise the
correlation (in this case, the multiple R), and R-sq, the percentage of
variance in a response variable that is accounted for by variance in the
predictor variables.
Most of the work involving correlation is aimed at identifying factors that
can account for the variance in some response variable. For example, one can
seek to identify factors that contribute to cardio-vascular disease. Diet,
exercise, level of stored body fat, distribution of body fat, and levels of
cholesterol have been investigated and found to have some predictive value.
However, without a detailed model of how these and other factors have their
influences (directly or indirectly) within the body, one cannot say whether,
for a given individual, reducing cholesterol (or changing the HDL/LDL) ratio
will help or hurt. The regression equations only tell us that making such
changes in the whole population will improve the cardio-vascular health of
the population on average. Still, such research may help to identify factors
that need to be included in a detailed physiological model, as Phil Runkel
noted in "Casting Nets."
It would probably be clearer to speak instead of the confidence
interval (e.g., the 95% CI) around the predicted value. This is the
interval within which the true value of Y is expected to lie on 95% of
occasions.
RK: This is also covered in the paper. The best estimate of the value of Y
(rather than of its rank in its own distribution) is cX. To determine the
spread of errors, one must look at the standard deviation of Y, conditional
on knowing the value of X. If the unconditional standard deviation of Y is
s, the conditional s.d. is s*sqrt(1-c*c). The ratio of the former to the
latter is called the "improvement ratio", 1/sqrt(1-c*c). This is tabulated
in Table 1, and may be dispiriting reading for someone hoping to predict Y
from X from a scatterplot. For example, a correlation of 0.866 reduces the
standard deviation by a factor of only 2.
I'm picturing a scatter plot relating weight to height, with the regression
line fitted to the data. With a large enough data set, the points would tend
to form an oval cloud around the line. The larger the correlation, the more
the points will tend to stay close to the line, reducing the minor axis of
the oval relative to the major axis. Thus, at any given height, the
distribution of body weights will be smaller (lower standard deviation). An
average of these conditional distributions is provided by the standard
deviation of the residuals (deviations of the points from the regression
line, along the Y-axis.) This standard deviation can be compared to the
standard deviation of body weight overall, ignoring height. With a
correlation of 0.866, the standard deviation of the residuals is half that
of the body weights overall, and the weight predicted from height is closer
to actual height, on average, than simply predicting that an individual will
have the average weight of the population as a whole. Your predictions of
weight, based on height, will be closer to the observed values on average.
However, unless the correlation is high, the ability to predict an
individual Y from X will remain poor.
RK: One might compare this to the effect of genuine measurements. My weight
varies over a range of about 5 pounds -- suppose a standard deviation of 2
pounds. When I weigh myself, my scales have a resolution of 0.2 pounds. If
I assume they're accurate, then that's an improvement ratio of roughly 2/0.2
= 10. Equivalent correlation: way off the end of Table 1.
Genuine measurements? I don't understand the distinction you are making
between "genuine measurements" (weighing yourself) and predictions based on
regression. The measurements entered into regression analysis can be just as
"genuine" as any.
RK: Should I dust this paper off and send it to a journal? Which one?
Because the focus of the paper is on making predictions of individual values
and not on using correlation to identify predictive variables, I would
suggest that a journal on tests and measures (psychometrics) would be
appropriate.
Bruce A.