[From Bill Powers (2007.08.06.0630 MDT)]
Rick Marken (2007.08.05.2215) –
Great. Now let’s see what can be said about predictions for groups versus
predictions for individuals at 0.9 correlation between predictor and
predicted variables. We will probably find that Richard Kennaway has
already said what we’re finding, but let’s go ahead and find it
anyway.
At 0.9 correlation, it turns out that the sign of the change can be
predicted correctly 75% of the time when we divide the data into 122/9 or
13 groups, and 80% of the time when we use 5 groups (such as totally
unlike, somewhat unlike, don’t know, somewhat like, and exactly like). If
we use just two groups, the number of correct predictions goes up to
somewhere between 97% and 100%. This reflects what Phil Runkel said about
“fine slicing” – the uncertainties go up as the number of
slices increases. Kennaway said that at 0.9 correlation, the probability
of correct prediction of sign was about 86%, but I don’t know how that
relates to the numbers here. It looks similar.
A better estimate might come from aggregating the data into groups,
taking averages, and then looking at the predictions.
The log difference in income for two levels of resolution interpolates to
0.885. This says that if one country has an income 7.7 times that of
another country. the regression line predicting infant mortality is
almost certain to predict lower mortality for the first country (between
97% and 100% accuracy).
The sign can be estimated correctly 80% of the time when the predictor is
divided into 5 levels. The chances of guessing correctly three times in a
row are 51%, and five times in a row, 35%. Or to use a context I brought
up a long time ago, if we have five facts of this same quality, and a
deduction depends on all five of them, the chances of the deduction being
factually correct are about 1 in 3. Since scientific deductions usually
depend on a lot more than five facts being true at once, our 0.9
correlation would not very useful for reaching complex logical
conclusions.
All these are very rough calculations and could be done better.
Would it be possible to generate artificial data sets with known
correlations and do this procedure for various levels of correlation? Can
you try this with Martin’s data sets to see what happens? I suppose
you’re getting bored with all this nitpicking of details, but I think
it’s interesting. This is sort of like checking your long division by
multiplying the divisor by the quotient to see how close you come to the
original dividend. We’re taking the regression line we get from a
statistical analysis and seeing how well it does in predicting individual
items from the original data set. We can actually count the number of
wrong predictions for each way of interpreting the results. This probably
gives us an approximate upper limit on the accuracy we can expect when we
use the same regression line and method of prediction to predict items
from new data sets.
I know that this is a long way from real grown-up statistical analysis,
but it seems to tell me something I want to know.
Best.
Bill P.