Another thought about correlations

The question occured to me: if A and B have a correlation of rAB, and B and C have a correlation of rBC, what bounds can be placed on the correlation between A and C?

If rAB and rBC are both 1, then obviously rAC must also be 1. If rAB = rBC = sqrt(1/2), it turns out that rAC can be anywhere in the range 0..1, but cannot be negative.

I worked out a general formula: rAC must be in the range

  rAB rBC +/- sqrt( (1 - rAB^2) (1 - rBC^2) )

When rAB = rBC = r this simplifies to 2r^2 - 1 <= rAC <= 1.

Then I thought: suppose you have a chain of variables A, B, C, D... such that the correlation between every consecutive pair is r. How long does the chain have to be before the correlation between the first variable and the last is not guaranteed to be positive?

If r=1 then you can make an arbitrarily long chain and the correlation from beginning to end will always be 1. This corresponds to the notion of logical inference, where everything is simply true or false with certainty. r < 1 is a sort of fuzzy inference, and the more steps you take, the fuzzier things get, until all connection is lost with the starting point.

Suppose you want to be able to make n steps in such a chain of fuzzy inference. By n=1 I mean seeing positive correlations between A and B, and between B and C, and concluding a positive correlation between A and C. n=4 would mean getting from A to F.

Here's a table of the minimum correlation you need to perform a chain of length n.

          n r

          1 0.707104
          2 0.866024
          3 0.923878
          4 0.951054
          5 0.965923
         10 0.989819
         15 0.995182
         20 0.997204
         50 0.999523

An approximate formula for n >= 4 is r = 1 - 1.22/n(n+1).

So, unless r is at least 0.707, you can't string even two correlations together.

I can't resist a somewhat frivolous comparison with computers. A typical modern desktop or laptop has around a billion transistors being clocked a billion times a second. It can easily run continuously for a million seconds (about 12 days) without going wrong. That amounts to 10^24 individual transistor operations, every one of which works correctly. (You might not think it when you're dealing with the software, but the basic semiconductor transistor is the most reliable device ever invented. It has to be, or computers would be impossible.) Plugging n=10^24 into the above, r = 1 to 48 decimal places.

···

--
Richard Kennaway, jrk@cmp.uea.ac.uk, http://www.cmp.uea.ac.uk/~jrk/
School of Computing Sciences,
University of East Anglia, Norwich NR4 7TJ, U.K.

[Martin Taylor 2010.03.19.17.12]

The question occured to me: if A and B have a correlation of rAB, and B and C have a correlation of rBC, what bounds can be placed on the correlation between A and C?

There's a very easy way to determine this. Consider OA, OB, and OC as vectors from the origin in 3-space (3 because there are three of them). The correlation between the vectors is the cosine of the angle between them. So rAB = cos(angle AOB) and rBC = cos(angle BOC). the maximum angle between vectors OA and OC is (angle AOB + angle BOC), which means the minimum correlation between A and C is cos(cos^-1(rAB) + cos^-1(rBC). The maximum correlation between A and C is cos(cos^-1(rAB) - cos^-1(rBC)).

Then I thought: suppose you have a chain of variables A, B, C, D... such that the correlation between every consecutive pair is r. How long does the chain have to be before the correlation between the first variable and the last is not guaranteed to be positive?

Sum the angles of the successive correlations (cos rXY) until they get to 90 degrees.

If r=1 then you can make an arbitrarily long chain and the correlation from beginning to end will always be 1. This corresponds to the notion of logical inference, where everything is simply true or false with certainty. r < 1 is a sort of fuzzy inference, and the more steps you take, the fuzzier things get, until all connection is lost with the starting point.

Suppose you want to be able to make n steps in such a chain of fuzzy inference. By n=1 I mean seeing positive correlations between A and B, and between B and C, and concluding a positive correlation between A and C. n=4 would mean getting from A to F.

Here's a table of the minimum correlation you need to perform a chain of length n.

         n r

         1 0.707104
         2 0.866024
         3 0.923878
         4 0.951054
         5 0.965923
        10 0.989819
        15 0.995182
        20 0.997204
        50 0.999523

An approximate formula for n >= 4 is r = 1 - 1.22/n(n+1).

So, unless r is at least 0.707, you can't string even two correlations together.

Right. Cos^-1(0.707) = 45 degrees. For a chain of N to have a guaranteed positive correlation, the angle must average no more than 90/N degrees. If all the correlations in the chain are equal, as in Richard's table, the required correlation is cos(90/N) (or more conventionally, cos (pi/2N)).

Martin