Comparing correlation with measurement

I recently posted the following story to the LessWrong blog. It might be of interest here, although it will be mostly preaching to the converted.

Prof. Sagredo has assigned a problem to his two students Simplicio and Salviati: "X is difficult to measure accurately. Predict it in some other way."

Simplicio collects some experimental data consisting of a great many pairs (X,Y) and with high confidence finds a correlation of 0.6 between X and Y. So given the value y of Y, his best prediction for the value of X is 0.6ya/b, where the standard deviations of X and Y are a and b respectively. A correlation of 0.6 is generally considered pretty high in psychology and social science, especially if it's established with p=0.001 to be above, say, 0.5. So Simplicio is quite pleased with himself.

Salviati instead tries to measure X, and finds a variable Z which is experimentally found to have a good chance of lying close to X. Let us suppose that the standard deviation of Z-X is 10% that of X. A measurement whose range of error is 10% of the range of the thing measured is about as bad as it could be and still be called a measurement. (One might argue that any sort of entanglement whatever is a measurement, but one would be wrong.) It's a rubber tape measure. By that standard, Salviati is doing rather badly.

In effect, Simplicio is trying to predict someone's weight from their height, while Salviati is putting them on a (rather poor) weighing machine (and both, presumably, are putting their subjects on a very expensive and accurate weighing machine to obtain their true weights).

So we are comparing a good correlation with a bad measurement. But how do they compare with each other, rather than with other correlations or other measurements? Let us suppose that the underlying reality is that Y = X + D1 and Z = X + D2, where X, D1, and D2 are normally distributed and uncorrelated (and causally unrelated, which is a stronger condition). I'm choosing the normal distribution because it's easy to calculate exact numbers, but I don't believe the conclusions would be substantially different for other distributions.

For convenience, assume the variables are normalised to all have mean zero, and let X, D1, and D2 have standard deviations 1, d1, and d2 respectively.

Z-X is D2, so d2 = 0.1. The correlation between Z and X is c(X,Z) = cov(X,Z)/(sd(X)sd(Z)) = 1/sqrt(1+d2^2) = 0.995.

The correlation between X and Y is c(X,Y) = 1/sqrt(1+d1^2) = 0.6, so d1 = 1.333.

We immediately see something suspicious here. Even a terrible measurement yields a sky-high correlation. Or put the other way round, if you're bothering to measure correlations, your data are rubbish. Even this "good" correlation gives a signal to noise ratio of less than 1. But let us proceed to calculate the mutual informations. How much do Y and Z tell you about X, separately or together?

For the bivariate normal distribution, the mutual information between variables A and B with correlation c is lg(I), where lg is the binary logarithm and I = sd(A)/sd(A|B). (The denominator here -- the standard deviation of A conditional on the value of B -- happens to be independent of the particular value of B for this distribution.) This works out to 1/sqrt(1-c^2). So the mutual information is -lg(sqrt(1-c^2)).

              corr. mut. inf.
Simplicio 0.6 0.3219
Salviati 0.995 3.3291

What can Simplicio do with his one third of a bit? If he tries to predict just the sign of X from the sign of Y, he will be right only 70% of the time (i.e. arccos(-c(X,Y))/pi). Salviati will be right 96.8% of the time. Salviati's estimate will even be in the right decile 89% of the time, while on that task Simplicio can hardly do better than chance. So even a "good" correlation is useless as a measurement.

Simplicio and Salviati show their results to Prof. Sagredo. Simplicio can't figure out how Salviati did so much better without taking measurements on hundreds of samples. Salviati seemed to just think about the problem and come up with a contraption out of nowhere that did the job, without doing a single statistical test. "But at least," says Simplicio, "you can't throw away my 0.3219, it all adds up!" Sagredo points out that it literally does not add up. The information gained about X from Y and Z together is not 0.3219+3.3291 = 3.6510 bits. The correct result is found from the standard deviation of X conditional on both Y and Z, which is sqrt(1/(1 + 1/d1^2 + 1/d2^2)). The information gained is then lg(sqrt(1 + 1/d1^2 + 1/d2^2)) = 0.5*lg(101.5625) = 3.3402. The extra information over knowing just Z is only 0.0040 = 1/250 of a bit, because nearly all of Simplicio's information is already included in Salviati's.

Sagredo tells Simplicio to go away and come up with some real data.

···

--
Richard Kennaway, jrk@cmp.uea.ac.uk, http://www.cmp.uea.ac.uk/~jrk/
School of Computing Sciences,
University of East Anglia, Norwich NR4 7TJ, U.K.

[From Rick Marken (2009.09.07.0945)]

I recently posted the following story to the LessWrong blog. It might be of interest here, although it will be mostly preaching to the converted.

I presume this is supposed to show that the correlations I presented tell us nothing about the relationship between taxation and recession, which is fine with me. Then we’re back to asking why economists keep confidently asserting that increasing taxes is recessionary. Have they got a rubber ruler or something? :wink:

Your final point seems to be that data that yield low correlations is not real data. I think Bill would agree with you on this. But it’s interesting that when I presented some low correlation data some time ago (apparently in about 2002), Bill thought the were sufficiently interesting that I should present them to economists. I don’t like to appeal to authority to try to justify my misguided efforts at exploring the economic data, but I do think it is interesting (if I am remembering this correctly; maybe not) that Bill, who believes, along with you (and me, to some extent), that small correlations are useless, seemed to find the following data of interest.

The data are correlations between quarterly measures of private investment (Ip) and Growth (dGDP/dt) for the period 1949-2002 at different lags:

Lead/Lag r lp-Growth
-5 -0.21
-4 -0.27

-3 -0.28
-2 -0.19
-1 -0.08
0 0.24
1 0.34
2 0.36
3 0.28

The lead (-) or lag (+) is the number of quarters by which private investment (Ip) leads or lags Growth. The correlations (second column) are based on data series of about 200 data points. These are pretty low correlations but they do have a seductive pattern; they show that increases in investment seem to follow rather than lead growth.

Just for fun I looked at the same data (Ip versus Growth) from several different countries and found exactly the same pattern of correlations; present investment is negatively associated with subsequent growth and future investment is positively related to prior growth. While the pattern of correlations was the same for all countries, it occurred on different time scales, faster for countries with smaller GDP, slower for countries with larger ones. Of course, all this is useless since the correlations are small. Nevertheless, I would like to know what you think of these results. If you consider them completely useless (as I imagine you would) then I wonder if you have any idea how we would go about testing models of the macro economy. Is data irrelevant?

Best

Simplicio

···

On Mon, Sep 7, 2009 at 6:16 AM, Richard Kennaway jrk@cmp.uea.ac.uk wrote:

Prof. Sagredo has assigned a problem to his two students Simplicio and Salviati: “X is difficult to measure accurately. Predict it in some other way.”

Simplicio collects some experimental data consisting of a great many pairs (X,Y) and with high confidence finds a correlation of 0.6 between X and Y. So given the value y of Y, his best prediction for the value of X is 0.6ya/b, where the standard deviations of X and Y are a and b respectively. A correlation of 0.6 is generally considered pretty high in psychology and social science, especially if it’s established with p=0.001 to be above, say, 0.5. So Simplicio is quite pleased with himself.

Salviati instead tries to measure X, and finds a variable Z which is experimentally found to have a good chance of lying close to X. Let us suppose that the standard deviation of Z-X is 10% that of X. A measurement whose range of error is 10% of the range of the thing measured is about as bad as it could be and still be called a measurement. (One might argue that any sort of entanglement whatever is a measurement, but one would be wrong.) It’s a rubber tape measure. By that standard, Salviati is doing rather badly.

In effect, Simplicio is trying to predict someone’s weight from their height, while Salviati is putting them on a (rather poor) weighing machine (and both, presumably, are putting their subjects on a very expensive and accurate weighing machine to obtain their true weights).

So we are comparing a good correlation with a bad measurement. But how do they compare with each other, rather than with other correlations or other measurements? Let us suppose that the underlying reality is that Y = X + D1 and Z = X + D2, where X, D1, and D2 are normally distributed and uncorrelated (and causally unrelated, which is a stronger condition). I’m choosing the normal distribution because it’s easy to calculate exact numbers, but I don’t believe the conclusions would be substantially different for other distributions.

For convenience, assume the variables are normalised to all have mean zero, and let X, D1, and D2 have standard deviations 1, d1, and d2 respectively.

Z-X is D2, so d2 = 0.1. The correlation between Z and X is c(X,Z) = cov(X,Z)/(sd(X)sd(Z)) = 1/sqrt(1+d2^2) = 0.995.

The correlation between X and Y is c(X,Y) = 1/sqrt(1+d1^2) = 0.6, so d1 = 1.333.

We immediately see something suspicious here. Even a terrible measurement yields a sky-high correlation. Or put the other way round, if you’re bothering to measure correlations, your data are rubbish. Even this “good” correlation gives a signal to noise ratio of less than 1. But let us proceed to calculate the mutual informations. How much do Y and Z tell you about X, separately or together?

For the bivariate normal distribution, the mutual information between variables A and B with correlation c is lg(I), where lg is the binary logarithm and I = sd(A)/sd(A|B). (The denominator here – the standard deviation of A conditional on the value of B – happens to be independent of the particular value of B for this distribution.) This works out to 1/sqrt(1-c^2). So the mutual information is -lg(sqrt(1-c^2)).

         corr.      mut. inf.

Simplicio 0.6 0.3219

Salviati 0.995 3.3291

What can Simplicio do with his one third of a bit? If he tries to predict just the sign of X from the sign of Y, he will be right only 70% of the time (i.e. arccos(-c(X,Y))/pi). Salviati will be right 96.8% of the time. Salviati’s estimate will even be in the right decile 89% of the time, while on that task Simplicio can hardly do better than chance. So even a “good” correlation is useless as a measurement.

Simplicio and Salviati show their results to Prof. Sagredo. Simplicio can’t figure out how Salviati did so much better without taking measurements on hundreds of samples. Salviati seemed to just think about the problem and come up with a contraption out of nowhere that did the job, without doing a single statistical test. “But at least,” says Simplicio, “you can’t throw away my 0.3219, it all adds up!” Sagredo points out that it literally does not add up. The information gained about X from Y and Z together is not 0.3219+3.3291 = 3.6510 bits. The correct result is found from the standard deviation of X conditional on both Y and Z, which is sqrt(1/(1 + 1/d1^2 + 1/d2^2)). The information gained is then lg(sqrt(1 + 1/d1^2 + 1/d2^2)) = 0.5*lg(101.5625) = 3.3402. The extra information over knowing just Z is only 0.0040 = 1/250 of a bit, because nearly all of Simplicio’s information is already included in Salviati’s.

Sagredo tells Simplicio to go away and come up with some real data.

Richard Kennaway, jrk@cmp.uea.ac.uk, http://www.cmp.uea.ac.uk/~jrk/

School of Computing Sciences,

University of East Anglia, Norwich NR4 7TJ, U.K.


Richard S. Marken PhD
rsmarken@gmail.com
www.mindreadings.com

[From Richard Kennaway (20090907.1831 BST)]

[From Rick Marken (2009.09.07.0945)]

I recently posted the following story to the LessWrong blog. It
might be of interest here, although it will be mostly preaching to
the converted.

I presume this is supposed to show that the correlations I presented
tell us nothing about the relationship between taxation and
recession, which is fine with me.

Actually, I wasn't specifically addressing that. But since you bring it up...

The data are correlations between quarterly measures of private
investment (Ip) and Growth (dGDP/dt) for the period 1949-2002 at
different lags:

Lead/Lag r lp-Growth
-5 -0.21
-4 -0.27
-3 -0.28
-2 -0.19
-1 -0.08
0 0.24
1 0.34
2 0.36
3 0.28

The lead (-) or lag (+) is the number of quarters by which private
investment (Ip) leads or lags Growth. The correlations (second
column) are based on data series of about 200 data points. These are
pretty low correlations but they do have a seductive pattern; they
show that increases in investment seem to follow rather than lead
growth.

Seductive is the word. The best of these is 0.36. I don't have your
original data, but attached is a scatterplot of 208 points (54 years
* 4 quarters/year) with a correlation of 0.36. Data like this are
not a sound basis for policy decisions, or for understanding what is
actually happening.

Imagine if mathematical statistics had been developed in the Middle
Ages, and scientists had applied it to the four-element theory of
matter: Fire, Air, Water, and Earth. They could have come up with
all sorts of measures of the fieriness, airiness, wateriness, and
earthiness of different substances, and measured correlations and
goodness knows what. They could even have extended their theory to a
hypothetical fifth element, Ether, and made complicated statistical
analyses to determine whether the ether hypothesis could explain any
of the variance in their data. But it would all have been useless,
because none of these things exist as fundamental elements.

I posted a few months back about the lack of connection, in both
directions, between causation and correlation.

Anyone can guess that growth and investment will both affect each other

Just for fun I looked at the same data (Ip versus Growth) from
several different countries and found exactly the same pattern of
correlations; present investment is negatively associated with
subsequent growth and future investment is positively related to
prior growth. While the pattern of correlations was the same for all
countries, it occurred on different time scales, faster for
countries with smaller GDP, slower for countries with larger ones.
Of course, all this is useless since the correlations are small.
Nevertheless, I would like to know what you think of these results.
If you consider them completely useless (as I imagine you would)
then I wonder if you have any idea how we would go about testing
models of the macro economy. Is data irrelevant?

The data need to be good enough, which these aren't, and there needs
to be a model whose components correspond to measurable things. What
do you conclude from these data? That the stimulus of increasing
production causes the response of more investment six months later,
the result of which is to reduce growth a year after that?

···

On Mon, Sep 7, 2009 at 6:16 AM, Richard Kennaway ><<mailto:jrk@cmp.uea.ac.uk>jrk@cmp.uea.ac.uk> wrote:

--
Richard Kennaway, jrk@cmp.uea.ac.uk, Richard Kennaway
School of Computing Sciences,
University of East Anglia, Norwich NR4 7TJ, U.K.

[From Bill Powers (2009.09.07.1157 MDT)]

Richard Kennaway (20090907.1831 BST) --

RM:
Lead/Lag r lp-Growth

-5 -0.21
-4 -0.27
-3 -0.28
-2 -0.19
-1 -0.08
0 0.24
1 0.34
2 0.36
3 0.28

The lead (-) or lag (+) is the number of quarters by which private investment (Ip) leads or lags Growth. The correlations (second column) are based on data series of about 200 data points. These are pretty low correlations but they do have a seductive pattern; they show that increases in investment seem to follow rather than lead growth.

RK: Seductive is the word. The best of these is 0.36. I don't have your original data, but attached is a scatterplot of 208 points (54 years * 4 quarters/year) with a correlation of 0.36. Data like this are not a sound basis for policy decisions, or for understanding what is actually happening.

BP: Thanks for that scatter plot, Richard. Rick is right in saying I fell for the nice progression of correlations and approved it, but obviously I shouldn't have done so. It didn't occur to me then that if there are different correlations at different lags, then showing the correlations using the same data set for intermediate lags will simply interpolate the intermediate values of correlation. I think we need more of those scatter plots to look at -- just realizing how many different straight lines could be drawn through selected points shows what a vast range of different results could be represented by that correlation. That's what you were talking about with your ellipses in that paper which ought to be famous.

RM: Just for fun I looked at the same data (Ip versus Growth) from several different countries and found exactly the same pattern of correlations; present investment is negatively associated with subsequent growth and future investment is positively related to prior growth. While the pattern of correlations was the same for all countries, it occurred on different time scales, faster for countries with smaller GDP, slower for countries with larger ones. Of course, all this is useless since the correlations are small. Nevertheless, I would like to know what you think of these results. If you consider them completely useless (as I imagine you would) then I wonder if you have any idea how we would go about testing models of the macro economy. Is data irrelevant?

RK: The data need to be good enough, which these aren't, and there needs to be a model whose components correspond to measurable things.

BP: Actually, wouldn't using lagged data from different countries (one lag for each country, so each lag uses a different data set) improve matters here?

What do you conclude from these data? That the stimulus of increasing production causes the response of more investment six months later, the result of which is to reduce growth a year after that?

Ah, yes. I didn't think of that one, either. Rick is right in saying all this is useless. Statistics of this kind are really a declaration that we don't know what we're talking about.

Best,

Bill P.

[From Rick Marken (2009.09.07.1540)]

Richard Kennaway (20090907.1831 BST)–

Imagine if mathematical statistics had been developed in the Middle Ages, and scientists had applied it to the four-element theory of matter: Fire, Air, Water, and Earth. They could have come up with all sorts of measures of the fieriness, airiness, wateriness, and earthiness of different substances, and measured correlations and goodness knows what. They could even have extended their theory to a hypothetical fifth element, Ether, and made complicated statistical analyses to determine whether the ether hypothesis could explain any of the variance in their data. But it would all have been useless, because none of these things exist as fundamental elements.

I think this analogy is somewhat weak because the variables I am looking at (unlike earthiness, etc) are considered “good” measures of economic performance by virtually all governments in the world. The correlations I am presenting are simply descriptions of relationships between these economic measures; I am not using them to make inferences about what is “really” going on. If correlations were around at the time of Tycho Brahe they might have provided some hints about the the nature of the solar system based on correlations between his “good” planetary data. Of course, it would still take a model to account for the actual data but perhaps knowing the relationship between changes in the ephemerides of the planets might have provided some hints. Maybe not.

The data need to be good enough, which these aren’t, and there needs to be a model whose components correspond to measurable things. What do you conclude from these data? That the stimulus of increasing production causes the response of more investment six months later, the result of which is to reduce growth a year after that?

I conclude only that the observed phase relationship between temporal variations in investment and growth are not what are predicted by economists. But the results are not completely inexplicable. The fact that growth seems to precede investment is consistent with the idea that producers don’t invest (and thus increase production) until they see an increase in the market (demand) for their product (GDP is a measure of demand in the sense that it is proportional to the amount paid to the aggregate consumer by the aggregate producer). So actually the phase relationship between investment and growth is consistent with a control theory model of the aggregate producer.

Best

Rick

···


Richard S. Marken PhD
rsmarken@gmail.com
www.mindreadings.com