About correlation and regression

DG: Under these conditions, the
textbook states:
[From Bill Powers (2009.05.22.0856 MDT)]
David Goldstein (2009.05.22.10:09
EDT) –

r measures the slope of the
regression lines and is thus the tangent of the angle between the X-axis
and the regression line for the regression of y on x; and is the tangent
of the angle between the Y-axis and the regression line for x on
y.
BP: OK, that completes the circle of confusion. In fact, I have to wonder
if this textbook is correct. If r measures the slope of the regression
line, then all data sets showing the same correlation will also yield the
same slope of the regression line. That can’t be true.
I think that is true for only one value of correlation: 1.0. The slope of
the regression line for that correlation (normalized by the ratio of the
x and y standard deviations, as you point out) must also be 1.0. However,
it can’t be true of any correlation less than 1.0. I don’t have an
analytical proof, but the way I constructed that set of scatter plots
long ago would support my claim. I plotted y = ax + n(0.5 - random) +
b. “Random” is a Pascal function which returns a random real
number between 0 and 1, so the expression 0.5 - random generates values
between -0.5 and +0.5 with a mean of zero. Therefore no matter what value
of n is used, the regression line for a set of x,y values will have a
slope of a and an intercept of b. As n increases from zero,
however, the x-y correlation will fall from 1 toward zero. So by
demonstration, the correlation does not determine the slope of the
regression line and indeed is independent of it.

Since I did not take a statistics course in kindergarten, or any time
after that, I did not know that the correlation coefficient could be
interpreted as the cosine of an angle in hyperspace (though of course I
eventually learned that cosines range from -1 to 1, so I would have
realized, had the occasion arisen, that any number in that range could be
mistaken for the cosine or sine of an angle). Neither did I ever take a
course in vector or matrix algebra, or any course which might have
informed me about the possibility of interpreting any random collection
of numbers as a vector in hyperspace, or the use of the dot-product
between two such vectors to compute an angle in hyperspace between the
vectors. I know such things now, from a distance, through reading and
picking up what I could, but can’t claim any intuitive or deep
familiarity with the subject.

I still don’t know the meaning of “angle” when the term is used
to refer to the arccos of the dot product of two vectors divided by the
product of their lengths. Obviously I can visualize this in two or three
dimensions (where the angle is always in a plane), but going to higher
dimensions doesn’t work for me – I find myself still visualizing
three-dimensional relationships, probably incorrectly. More to the point,
what does this hyperangle look like when projected into the space of x
and y, in which the original data set exists? It’s that space that I’m
interested in, because it’s in that space that the slope of the
regression line exists.

All of which still leaves me wondering how the slope of the regression
line can be the correlation times the ratio of the standard deviations.
I’ve tried to work it out by going to the underlying computations in
terms of X, Y, XY, N and so on, but the algebra quickly gets out of hand
and I don’t know the tricks for simplifying it. I imagine that the
equations will eventually reduce to an extremely simple form, but I’m not
the one to accomplish that. Not so far, anyway.

Martin Taylor is terminally peeved with us kindergartners, which shows
mainly that he doesn’t know how to communicate with people who know less
than he does. “Why can’t you people,” he sings to a tune from
My Fair Lady," “Be more like ME?” That’s all right, of
course, since he knows a lot and can probably come up with many useful
ideas. And sometimes he does descend to lucidity, when he admits that the
people to whom he’s teaching something actually need to have the details
filled in and explained. He does that when telling newcomers about PCT.
But I think he tires of teaching kindergartners and wishes for the
company of those to whom he doesn’t have to explain everything, and gets
annoyed because we are not that kind of people, and eventually is driven
to blaming his students for his discomfort. Understandable, but not
helpful from the students’ point of view. Not that he’s under any
particular obligation to care about that, but neither are the students
particularly obliged to defer to his wrath. If we just wait quietly for a
while, he’ll probably get over it.

Best,

Bill P.

[From Bill Powers (2009.05.22.1831 MDT)]

Rick Marken (2009.05.22.1440) –

DG: Under these conditions, the textbook states:
David Goldstein
(2009.05.22.10:09 EDT) –
r measures the slope of the regression lines and is thus the tangent
of the angle between the X-axis and the regression line for the
regression of y on x; and is the tangent of the angle between the Y-axis
and the regression line for x on y.

BP: OK, that completes the circle of confusion. In fact, I have to
wonder if this textbook is correct. If r measures the slope of the
regression line, then all data sets showing the same correlation will
also yield the same slope of the regression line. That can’t be
true.

RM: The correlation coefficient, r, is the slope of the regression line
that relates standardized values of the covariates, x and y. This is,
it is the slope of the regression line relating x to y in z units, where
the x value of x is

z.x = (x-mean.x)/standard deviation. x

and the value of y is

z.y = (y-mean.y)/standard deviation. y

This means r is perfectly useless as a measure of the slope relating the
variables x and y as measured in their original units. Basically, r is a
dimensionless measure of slope; so it’s really kind of misleading to even
call it a slope.

But all the z units do is remove means and change the proportions of x
and y in the plot. All this is making me very sleepy. I guess that means
that my brain isn’t quite up to the systematic algebra we need now. I
suspect that the faint gnawing sound in the background is Richard
Kennaway sharpening up his instruments, and if we could hear the
muttering from East Anglia more clearly, it would be saying “Well,
now, it can’t be that hard, let’s just see about this …”

I hope.

Best,

Bill P.

[From Bill Powers (2009.05.22.1256 MDT)]

See my answer to Rick.

[From David Goldstein
(2009.05.22.13:35 EDT)]

[About Bill Powers
(2009.05.22.0856 MDT)]

I am enclosing a scan of the two
pages in the undergraduate textbook which may make things
clearer.

Best,

Bill P.

[David Goldstein (2009.05.22.10:09 EDT)]

My undergraduate statististics textbook (Introduction to Applied Statistics by John G. Peatman) contains the same formula that Bill posted, namely:

regression coefficient for y on x = pearson correlation for yx times the ratio of standard deviation of y to the stardard deviation of x.

and

regression coefficient for x on y = person correlation for xy times the ratio of standard deviation of x to the standard deviation of y.

If the x and y variables are standardized, the means becomes 0 and the standard deviations becomes 1.

and the regression coefficient for yx = pearson correlation for yx

and the regression coefficent for xy = person correlation for xy.

Under these conditions, the textbook states:

r measures the slope of the regression lines and is thus the tangent of the angle between the X-axis and the regression line for the regression of y on x; and is the tangent of the angle between the Y-axis and the regression line for x on y.

[From David Goldstein (2009.05.22.13:35 EDT)]

[About Bill Powers (2009.05.22.0856 MDT)]

I am enclosing a scan of the two pages in the undergraduate textbook which may make things clearer.

p108 rotated.pdf (57.4 KB)

p109 rotated.pdf (43.7 KB)

···

[From Rick Marken (2009.05.22.1440)]

DG: Under these conditions, the
textbook states:
Bill Powers (2009.05.22.0856 MDT)–

David Goldstein (2009.05.22.10:09
EDT) –

r measures the slope of the
regression lines and is thus the tangent of the angle between the X-axis
and the regression line for the regression of y on x; and is the tangent
of the angle between the Y-axis and the regression line for x on
y.

BP: OK, that completes the circle of confusion. In fact, I have to wonder
if this textbook is correct. If r measures the slope of the regression
line, then all data sets showing the same correlation will also yield the
same slope of the regression line. That can’t be true.

The correlation coefficient, r, is the slope of the regression line that relates standardized values of the covariates, x and y. This is, it is the slope of the regression line relating x to y in z units, where the x value of x is

z.x = (x-mean.x)/standard deviation. x

and the value of y is

z.y = (y-mean.y)/standard deviation. y

This means r is perfectly useless as a measure of the slope relating the variables x and y as measured in their original units. Basically, r is a dimensionless measure of slope; so it’s really kind of misleading to even call it a slope.

I’m still working on that Multiple Regression version of the test. Things are starting to look promising.

Best

Rick

···

I think that is true for only one value of correlation: 1.0. The slope of
the regression line for that correlation (normalized by the ratio of the
x and y standard deviations, as you point out) must also be 1.0. However,
it can’t be true of any correlation less than 1.0. I don’t have an
analytical proof, but the way I constructed that set of scatter plots
long ago would support my claim. I plotted y = ax + n(0.5 - random) +
b. “Random” is a Pascal function which returns a random real
number between 0 and 1, so the expression 0.5 - random generates values
between -0.5 and +0.5 with a mean of zero. Therefore no matter what value
of n is used, the regression line for a set of x,y values will have a
slope of a and an intercept of b. As n increases from zero,
however, the x-y correlation will fall from 1 toward zero. So by
demonstration, the correlation does not determine the slope of the
regression line and indeed is independent of it.

Since I did not take a statistics course in kindergarten, or any time
after that, I did not know that the correlation coefficient could be
interpreted as the cosine of an angle in hyperspace (though of course I
eventually learned that cosines range from -1 to 1, so I would have
realized, had the occasion arisen, that any number in that range could be
mistaken for the cosine or sine of an angle). Neither did I ever take a
course in vector or matrix algebra, or any course which might have
informed me about the possibility of interpreting any random collection
of numbers as a vector in hyperspace, or the use of the dot-product
between two such vectors to compute an angle in hyperspace between the
vectors. I know such things now, from a distance, through reading and
picking up what I could, but can’t claim any intuitive or deep
familiarity with the subject.

I still don’t know the meaning of “angle” when the term is used
to refer to the arccos of the dot product of two vectors divided by the
product of their lengths. Obviously I can visualize this in two or three
dimensions (where the angle is always in a plane), but going to higher
dimensions doesn’t work for me – I find myself still visualizing
three-dimensional relationships, probably incorrectly. More to the point,
what does this hyperangle look like when projected into the space of x
and y, in which the original data set exists? It’s that space that I’m
interested in, because it’s in that space that the slope of the
regression line exists.

All of which still leaves me wondering how the slope of the regression
line can be the correlation times the ratio of the standard deviations.
I’ve tried to work it out by going to the underlying computations in
terms of X, Y, XY, N and so on, but the algebra quickly gets out of hand
and I don’t know the tricks for simplifying it. I imagine that the
equations will eventually reduce to an extremely simple form, but I’m not
the one to accomplish that. Not so far, anyway.

Martin Taylor is terminally peeved with us kindergartners, which shows
mainly that he doesn’t know how to communicate with people who know less
than he does. “Why can’t you people,” he sings to a tune from
My Fair Lady," “Be more like ME?” That’s all right, of
course, since he knows a lot and can probably come up with many useful
ideas. And sometimes he does descend to lucidity, when he admits that the
people to whom he’s teaching something actually need to have the details
filled in and explained. He does that when telling newcomers about PCT.
But I think he tires of teaching kindergartners and wishes for the
company of those to whom he doesn’t have to explain everything, and gets
annoyed because we are not that kind of people, and eventually is driven
to blaming his students for his discomfort. Understandable, but not
helpful from the students’ point of view. Not that he’s under any
particular obligation to care about that, but neither are the students
particularly obliged to defer to his wrath. If we just wait quietly for a
while, he’ll probably get over it.

Best,

Bill P.


Richard S. Marken PhD
rsmarken@gmail.com

[Martin Taylor 2009.05.22.15.45]

[From Bill Powers (2009.05.22.0856 MDT)]

David Goldstein
(2009.05.22.10:09
EDT) –

DG: Under these conditions, the
textbook states:

r measures the slope of the

regression lines and is thus the tangent of the angle between the
X-axis
and the regression line for the regression of y on x; and is the
tangent
of the angle between the Y-axis and the regression line for x on
y.

BP: OK, that completes the circle of confusion. In fact, I have to
wonder
if this textbook is correct. If r measures the slope of the regression
line, then all data sets showing the same correlation will also yield
the
same slope of the regression line. That can’t be true.

You are right. It isn’t. The pages David scanned refer to z scores,
which equate the standard deviations of the two variables. In the
example below it is as though you had divided y by 3 and then compared
it with x. Actually, the regression slope is sometimes taken as the
slope after this kind of scale normalization, but to my mind it isn’t
very useful in that context, since if one is interested in the
regression, it is normally because one wants to know how much y is
likely to be affected by a certain change in x.

I think that is true for only one value of correlation: 1.0. The slope
of
the regression line for that correlation (normalized by the ratio of
the
x and y standard deviations, as you point out) must also be 1.0.

Not true. If y = 3x, the correlation between x and y is 1.0, and the
slope of the regression line is 3. If y = 3x + noise, the correlation
can be anything, but the slope of the regression line will be near 3.

Since I did not take a statistics course in kindergarten, or any time
after that, I did not know that the correlation coefficient could be
interpreted as the cosine of an angle in hyperspace (though of course I
eventually learned that cosines range from -1 to 1, so I would have
realized, had the occasion arisen, that any number in that range could
be
mistaken for the cosine or sine of an angle).

That was not what would have been in a first statistics course. What
would have been in a first statistics course is that the regression
slope is independent of the correlation. The geometric interpretation
of correlation is far from “being mistaken for” the cosine of an angle.
It IS the cosine of the angle between the vectors representing the data
points. I show a derivation of that fact below.

Neither did I ever take a
course in vector or matrix algebra, or any course which might have
informed me about the possibility of interpreting any random collection
of numbers as a vector in hyperspace,

You don’t need a course for that. You need only ordinary (visual)
common sense, of which you have a great deal. What you don’t need is to
assert with some assurance that it isn’t so, as you have done in the
past. A vector is just an ordered set of N numbers, and any ordered set
of N numbers can be treated as a vector in N-space (the ordering
corresponds to an otherwise arbitrary labelling of the N axes). Whether
it is useful to treat an ordered set of N numbers as a vector in
N-space in any particular case depends on the situation and on whether
the person doing it likes using that technique.

I still don’t know the meaning of “angle” when the term is used
to refer to the arccos of the dot product of two vectors divided by the
product of their lengths. Obviously I can visualize this in two or
three
dimensions (where the angle is always in a plane), but going to higher
dimensions doesn’t work for me – I find myself still visualizing
three-dimensional relationships, probably incorrectly.

No. The angle in question is always in the 2-space (the plane) defined
by the three points representing the origin and the two datasets. The
dimensionality of the space in which the points are described is
irrelevant. There’s nothing unusual about the concept of an angle in
hyperspace. It’s not as if you were progressing from angle to
solid-angle to some kind of hyper-angle, it’s just a simple angle in a
plane. You can have simple lines in any space of at least one
dimension, planes in any space of at least two dimensions, cubes and
other normal solids in any space of at least three dimensions. They
don’t change character just because the space is of higher
dimensionality. I think you are expecting complication where there is
none.

More to the point,
what does this hyperangle look like when projected into the space of x
and y, in which the original data set exists? It’s that space that I’m
interested in, because it’s in that space that the slope of the
regression line exists.

It doesn’t have to be projected into that space, because it’s there
initially. That’s where it is defined. The preceding has nothing
whatever to do with your next point, though…

All of which still leaves me wondering how the slope of the regression
line can be the correlation times the ratio of the standard deviations.

Try visualizing it, thinking initially of a correlation of unity. Say
you have y = 3x, and you have observed the following pairs {0,0}, {1,
3}, {3,9}, {1.5. 4.5}, {4, 12}. If you plot these numbers, x varies
over a range of only 4, while y varies over a much wider range, 12. The
standard deviations are also different by a factor of 3 (1.6 and 4.8,
approximately). So the formula works in this case.

Now imagine adding noise equally to both variables. The range (or
standard deviation) of x is increased relatively more than is the range
(or standard deviation) of y, so the slope (and the correlation) are
decreased. Qualitatively, this is what the formula says. Visually, you
can’t prove the formula, but just by imagining adding this equal
increment of noise to both variables you can see that it does more or
less what you should expect. To prove, or to understand, the formula,
try .
Getting back to the unrelated issue of the relation between correlation
and the angle between two vectors, think of these same data as two
points in a 5-D hyperspace (there are 5 data pairs). The data set X is
represented by the point {0, 1, 3, 1.5, 4} (the order is important),
and Y is represented by the point {0, 3, 9, 4.5, 12}. The origin is, of
course, {0, 0, 0, 0, 0}. What is the angle between the vectors that
connect the origin to the two points? If the sides of a triangle are a, b, c, and the opposite angles are
alpha, beta, gamma, then cos(alpha) = ((b^2 + c^2) - a^2)/ (2bc). In
this case, alpha is the angle between the two vectors at the origin.
Their lengths are “b” and "c’, and the length of the difference vector
is “a”. So, b = sqrt(sum(xi^2)) = sqrt(0^2 + 1^2 + 3^2 + 1.5^2 + 4^2) = sqrt(28.25)
c = sqrt(sum(yi^2)) = sqrt(0^2 + 3^2 + 9^2 + 4.5^2 + 12^2) =
sqrt(254.25)
a = sqrt(sum(xi-yi)^2)) = sqrt((0-0)^2 + (1-3)^2 + (3-9)^2 +
(1.5-4.5)^2 + (4-12)^2) = sqrt(113)
So in this case, cos(alpha) = ((sum(xi^2) + sum(yi^2) -
sum((xi-yi)^2))/2(sqrt(sum(xi^2)sum(yi^2)) = (28.25 + 254.25 -
113)/2
sqrt(28.25254.25) = 169.5/169.5 = 1.0 (The notation in straight
ASCII is visually a bit cumbersome, but I hope you understand what it
means).
The formula can be simplified, since sum(xi-yi)^2 = sum(xi^2) +
sum(yi^2) - 2
sum(xiyi), which makes the numerator become simply
2
sum(xiyi). The formula is then cos(alpha) = sum(xiyi)/sqrt(sum(xi^2)sum(yi^2))
I’m not using dot products or vector multiplications here. It’s only
the formula for a normal planar triangle for which one knows the
lengths of the sides. Now the question is whether this formula reduces to the usual formula
for correlation of two datasets. Actually, it doesn’t, because we are usually interested in the
correlation not with respect to the arbitrary origin used in the
example, but with respect to an origin at the mean of the data values
of each vector. There’s a good reason for this. By subtracting the mean
value from every datum in a vector, the vector length from the origin
to the data point (the square root of the sum of squares of its values)
is minimized. The origin that does this is unique, and the only
non-arbitrary way of locating the vector in the hyperspace. We usually
don’t want the length to be arbitrarily dependent on where the origin
of the space is located, though there are occasions when that is
exactly what we want. So now, where we had xi, lets substitute xi-X (where X is sum(xi)/N).
The two points representing the above data now become x = {-1.9, -0.9,
1.1, -0.4, 2.1}, y = {-5.7, -2.7, 3.3, -1.2, 6.3}
Using the simplified form of the formula, we now have cos(alpha) = sum((xi-X)
(yi-Y))/sqrt(sum((xi-X)^2)sum((yi-Y)^2))
= sum(xi
yi - xiY - Xyi
+XY)/sqrt(sum((xi-X)^2))sqrt(sum((yi-Y)^2))
The numerator can be simplified, by noting that sum(k
xi) = ksum(xi)
and sum(xi) = NX
The numerator then becomes sum(xi
yi) -NXY - NXY + NXY, which is just
sum(xiyi) -NXY
We can do the same trick with the denominator, since the “x” part of
the denominator can be written as sqrt(sum((xi^2) - 2xi
X + X^2)) = sqrt(sum (xi^2) - 2NX^2 +
NX^2) = sqrt(sum(xi^2) -NX^2)
and similarly for the y part of the denominator
With these simplifications, we have
cos(alpha) = (sum(xi*yi) -
NXY)/sqrt((sum(xi^2)-NX^2))*sqrt((sum(yi^2)-NY^2))
Dividing numerator and denominator by N, and using E to represent
“expected value” or “average”, we have
cos(alpha) = (E(xy) -
E(x)*E(y))/(sqrt(E(x^2)-(E^2(x)))*sqrt(E(y^2)-(E^2(y)))) (I hope I
matched the brackets properly)
whereas in Wikipedia one formula for correlation is given by
correl{x,y}={E(xy)-E(x)*E(x}/{sqrt{E(x^2)-E^2(x)}*sqrt{E(y^2)-E^2(y)}}.
So, if we set the origin so as to minimize the lengths of the two
vectors from the origin to the two points representing the x and y data
values, the formulae for correlation and for the cosine of the angle
between the two vectors from the origin to the data points are
identical.
I’m sorry that this derivation is visually so cumbersome. It would be
nice to be able to enter nicely typeset formulae, but I’m not sure I
can without simply turning the whole derivation into a jpeg picture,
which I’m not sure I know how to do. However, I do hope it explains why
the correlation between two datasets is the cosine of the angle between
the vectors representing them in a space of N dimensions, and that this
angle is a perfectly ordinary angle in the plane through the origin and
the two data points.
Martin

···

http://en.wikipedia.org/wiki/Linear_regression

[From Rick Marken (2009.05.22.2300)]

Bill Powers (2009.05.22.1831 MDT)]

RM: The correlation coefficient, r, is the slope of the regression line
that relates standardized values of the covariates, x and y.

But all the z units do is remove means and change the proportions of x
and y in the plot.

Hey, I’m on your side;-) I’m trying to explain why there might be confusion about r being a slope. But first I have to say that you comment about z units is not correct. z units do not “remove the mean” and they certainly don’t " change the proportions of x
and y in the plot" (whatever that means). Representing a variable in z units represents the variable in terms of standard deviations from the mean. So if the average voltage used in an experiment was 10 v and the standard deviation of the voltage values used in the experiment was 5v then a voltage of 10 has a z value of 0; a voltage of 15 has a z value of 1; a voltage of 8.5 has a z value of -.3 (it’s .3 standard deviation units below the mean of 10v). This is a “standard” way of representing variables because it represents variable values in terms of the number standard deviations it is away from the mean of the values of that variable. It is also standard because a voltage that is 1 standard deviation above its mean has the same “standard” value (1.0) as an amperage that is one standard deviation above it’s mean (1.0), even if the voltage value that corresponds to 1 z (standard) value is 20 and the amperage value that corresponds to the same z value (1.0) is .001.

The r value is not a slope in the sense that it represents the slope of the regression line relating the variables, x and y. But it is the slope of the line when x and y are represented in standard deviation units. For example, if x is current in amps and y is voltage in volts then variations in y will lead to variations in x that are a linear function of x; indeed, the slope of the regression line is 1/resistance measured in ohms. The correlation between x and y would be close to 1.0 but the slope would depend on what the resistance was. If the resistance was 10 ohms then the regression slope would be .01 when x is measured as volts and y is measured as amps: y = .01x. If we represent x and y in terms of their standard deviation from their average value (average over the range of voltage and amps measured in the experiment) then the slope of that line would be r, close to 1.0 in this case.

The are two “slopes” involved when we talk about correlation and regression. There’s the slope of the regression line fit to the data: y = bx+a where b is the slope. If, however, the x and y values are measures in terms of standard deviations from their mean, the r is the slope of the relationship between these variables and there is no intercept. That is, z.y = rz.x. This latter “slope” is pretty meaningless as a slope measure. However, this ambiguity may be responsible for the confusion about r and the slope of the regression line. r is a slope (of the line relating z.x to z.y – I don’t think I would even call that a regression line) but it is not the slope of the regression line relating x to y.

That should clear things up, eh? Of put you right back to sleep.

Love

Rick

···


Richard S. Marken PhD
rsmarken@gmail.com

[From bill Powers (2009.05.23.0911.MDT)]

Martin Taylor 2009.05.22.15.45

I thank both you and Rick for your attempts to help me understand regression lines. There is definitely a hole somewhere in my head, because so far I am not in the least enlightened. I sense that I already have an image of what the meaning is and that this image is getting in the way. I can't see clearly what it is, but maybe, with your patient help, I can find out.

When I think "regression line" in two dimensions, I'm thinking of a relationship like y = ax + b, where we're varying x (or observing natural variations in x) and measuring the value of y for each value of x. Assuming a linear and noise-free set of measurements, every pair of values x,y will be related by the same formula with the same values for a and b. Actually we would need only two measurements with different values of x to determine a and b, and those values of a and b would then work for all further measurements of x and y.

Now suppose we have a noisy set of measurements. In that case we would make many more measurements of x and y and compute the coefficients a and b of a line that minimizes the squared difference between x and y over the whole data set. For simplicity, suppose we make 8 measurements instead of only 2. Now we have 8 values of x and 8 values of y along the two axes.

According to what I am hearing, each set of values now is to be plotted in an eight-dimensional space, not the original 2 dimensions, with the vector length
of x being the square root of the sum of the squares of the individual x measurements, and the vector length of y being computed similarly. We have to try to imagine two vectors pointing in different directions in 8-space.

At this point I see no relationship between the original x-y space and the new 8-space. Each x measurement can be seen as a vector lying in the x-axis with a length equal to the value of x. But there are eight of these measurements, and in 8-space they all point in different directions. There seems to be no way to relate the direction of the vector in 8-space to the direction of the x axis.

Let me stop here and ask whether I have made a mistake already.

Best,

Bill

From Bill Powers (2009.05.23.11014 MDT)]

Rick Marken (2009.05.22.2300) –

Representing a variable in z
units represents the variable in terms of standard deviations from the
mean. So if the average voltage used in an experiment was 10 v and the
standard deviation of the voltage values used in the experiment was 5v
then a voltage of 10 has a z value of 0; a voltage of 15 has a z value of
1; a voltage of 8.5 has a z value of -.3 (it’s .3 standard deviation
units below the mean of 10v). This is a “standard” way of
representing variables because it represents variable values in terms of
the number standard deviations it is away from the mean of the values
of that variable.

That’s what I meant by “removing the mean.” A z value has had
the mean value of the variable subtracted out so it represents only the
deviation from the mean and doesn’t tell you what the mean was.

I am beginning to suspect that my real problem here is that I have had
the wrong idea of what a correlation is, and that this may stem from a
wrong idea of what a standard deviation is.

My basic assumption has been that a standard deviation tells how noisy a
measurement is, so a measurement with a standard deviation of zero
contains no noise, or uncertainty, at all. It did not occur to me that
repeated measurements of a variable within a range from v1 to v2
would have any standard deviation at all if every measurement were exact.
I took standard deviation to mean the unpredictable variation around a
mean. So it seems that my concept of a standard deviation corresponds to
a z value, sort of, except that I’ve thought of it as the same as the RMS
value: sqrt(sum(x[i] - xbar)^2)/N where x[i] is a set of measurements of
the SAME VALUE of x. In fact my Math Manual defines a standard deviation
exactly that way: as the RMS value.

You say that z.y = r*z.x. Given that

z.x = (x - xbar)/sigmax and

z.y = (y - ybar)/sigmay,

we get

(y - ybar) = r*(x - xbar)*(sigmay/sigmax) or

   sigmay

y = r*---------(x - xbar) + ybar

   sigmax

which is exactly the equation for the regression line in my math manual.
So the correlation coefficient, I now have to accept, is

r = z.y/z.x

I’m beginning to see why you like that dot notation. Very handy. The
“z” is like a Pascal “record” and the notation after
the dot is one element of the record.

In Martin Taylor’s latest post, Martin Taylor 2009.05.23.14.18, he goes
through a nice clear development of the difference between two vectors,
with the two vectors each having as many dimensions as the number of
measurements of x or y in the original space. He doesn’t quite get to the
cosine bit, but next thing to it. The difference between the vectors is a
new vector whose length is computed by an “extended Pythagorean
theorem”, extending from the end of one N-dimensional vector to the
end of the other. It turns out that the difference between the two
vectors is just SQRT(SUM((x[i] - y][i])^2)). However, that quantity is
relevant in a different discussion concerning tracking errors and
prediction errors, and takes us away from correlations and regression
lines.

It seems to me that the angle whose cosine is r not observable. To
observe it we would would have to be able to perceive a plane surface in
N dimensions and see a triangle drawn in that plane, which I can’t
imagine. I wouldn’t know how to orient that plane relative to the
original x-y plane in which the two variables to be correlated are
plotted, so I wouldn’t know how to project it into the x-y plane. As far
as I can see, that angle is a name given to an intermediate result in the
calculation, and is never itself used for anything. But I await further
enlightenment; to say I don’t understand something is not to say it’s
wrong.

Now I return to my exploration of my misunderstandings.

I said that Rick says

r = z.y/z.x

That can’t be right because z.x, z.y is only a single point. The
correlation can’t be determined from a single point. Actually, what Rick
said was that z.y = r*z.x. In that relationship, r is a number that
depends on ALL of the data pairs, not just one of them. So why does the
algebra seem to say that you can determine the correlation just by
knowing the z-values of two points?

Maybe the answer is in the expansion:

z.y  (y - ybar)   sigmax

r = — = --------- * ------

z.x  (x - xbar)   sigmay

That doesn’t do it, either. We still have just one x value and one y
value and that can’t be enough to determine r.

It’s beginning to look as if the original statements have something wrong
with them. It’s possible to derive the equation for the regression line
from an equation equivalent to the one just above, but in the regression
equation x and y lie on a single least-squares line, and don’t represent
the actual individual data points. In the regression equation, given x
you can compute y, and every x-y point so computed will lie on the same
line. However, in the z-statistic representation, x and y are the
individual data points which deviate from the least-squares line. And
those are the values of x and y carried into the derived regression
equation. Has a summation sign been left out somewhere?

I think it’s almost time to yell for help.
Richard???

Best,

Bill P.

[Martin Taylor 2009.05.23.23.56]

From Bill Powers (2009.05.23.11014 MDT)]

In Martin Taylor’s latest post, Martin Taylor 2009.05.23.14.18, he goes
through a nice clear development of the difference between two vectors,
with the two vectors each having as many dimensions as the number of
measurements of x or y in the original space.

That wording makes me wary of whether you did understand. Let me put
what I said in another way. A vector is always of ONE dimension, no
matter how many dimensions are in the space within which it lies. When
you say “vectors each having as many dimensions …” I am concerned
that you may think that the vector has more than one dimension. The N
components of a vector are the projections of the vector onto the N
axes of the space, but that doesn’t change the dimensionality of the
vector, which is only one.

He doesn’t quite get to the
cosine bit, but next thing to it.

No, I had done that in the previous message [Martin Taylor
2009.05.22.15.45], to which the later one [Martin Taylor
2009.05.23.14.18] may be seen as a stage-setting foreword explaining
what seemed not to have been clear in the earlier message.

The difference between the vectors is a
new vector whose length is computed by an “extended Pythagorean
theorem”, extending from the end of one N-dimensional vector to the
end of the other. It turns out that the difference between the two
vectors is just SQRT(SUM((x[i] - y][i])^2)). However, that quantity is
relevant in a different discussion concerning tracking errors and
prediction errors, and takes us away from correlations and regression
lines.

No, it is critical in computing the correlation.

It seems to me that the angle whose cosine is r not observable. To
observe it we would would have to be able to perceive a plane surface
in
N dimensions and see a triangle drawn in that plane, which I can’t
imagine.

I accept that you can’t imagine it, but I can’t imagine why you can’t.
A triangle is just a triangle. It has vertices A, B, C, sides of length
a, b, c, and angles alpha, beta, and gamma. All nine of those
parameters are simple real numbers. I’ll repeat my mantra: you are
trying to see difficulty where simplicity exists. There’s nothing
special about a triangle no matter how many dimensions are used to
specify its vertices. Once you know the lengths of its sides, which you
get from “hyper-Pythagoras”, you no longer need to think of the
hyperspace at all. Same if you know its angles. It’s just an ordinary
simple triangle. No tricks.

I wouldn’t know how to orient that plane relative to the
original x-y plane in which the two variables to be correlated are
plotted, so I wouldn’t know how to project it into the x-y plane.

I’ll repeat what I said in my last: "
Don’t think of the x-axis in this context. The separate points on the
x-axis are now the components of the x-vector. The 8 axes are “1st
measurement, 2nd measurement, 3rd measurement …, 8th measurement”,
and ALL of the x measurements are represented together in a single
point, or if you like, by a vector from the origin to that single point
(as in the diagrams above). In the diagrams above, the point in 1-space
is {3.5}, the point in 2-space is {1.2, 0.8}, and so forth. The point
in 8-space is {x1, x2, x3, …, x8}. It’s just one point."

The x-y plane you are thinking of is in a conceptually different space.
In your x-y plane, there are two axes and 8 points. Those points each
have a projection on the x axis and a projection on the y axis. Each
point represents two measurements, one of x and one of y.

When you use the hyperspace representation, that x-y plane does not
exist. It’s not that it has some strange orientation in the 8-D space.
To ask that question makes as much sense as to ask what is the colour
of a duck’s quack. The concepts are separate. In the 8-D space there is
one point that incorporates ALL the x values and only x values, and
another point that incorporates ALL the y values and only y values.
There is no point that includes both an X value and a Y value (though,
to complicate matters, one could define a 16-D space in which one point
incorporate all the values for both x and y).

As far
as I can see, that angle is a name given to an intermediate result in
the
calculation, and is never itself used for anything. But I await further
enlightenment; to say I don’t understand something is not to say it’s
wrong.

Actually, the angle can be used for quite a lot beyond simply having
its cosine represent a correlation. I don’t want to go into them until
what has been said so far has become clear, but they are certainly
useful in partitioning correlations, as we did in our exploration of
the relations among the correlation triangles in the tracking task,
which was useful in getting a notion of how the model-fit error might
be partitioned between noise and model deficiency, even for highly
accurate models.

Now I return to my exploration of my misunderstandings.

I said that Rick says

r = z.y/z.x

That can’t be right because z.x, z.y is only a single point.

As I understood Rick, it means that for a particular x value, you can
find the corresponding (expected) y value from the equation z.y = r *
z.x. It’s not intended to get r from a single x or y value.

The
correlation can’t be determined from a single point. Actually, what
Rick
said was that z.y = r*z.x. In that relationship, r is a number that
depends on ALL of the data pairs, not just one of them. So why does the
algebra seem to say that you can determine the correlation just by
knowing the z-values of two points?

It doesn’t. It says if you know r from all the data points, you can use
it to get an expected y value from an x value. Rick was just using it
to show how r could be construed as a slope if you scale the x and y
variables so as to equate their standard deviations.

I think it’s almost time to yell for help.
Richard???

Does this help?

Martin

[From Bill Powers (2009.05.24.0330 MDT)]

Martin Taylor 2009.05.23.23.56 –

At the end, you ask, “does this help?” The answer is,
“Were you trying to help, or just to scold me for not seeing what is
obvious to you?”

From Bill Powers
(2009.05.23.11014 MDT)]

In Martin Taylor’s latest post, Martin Taylor 2009.05.23.14.18, he goes
through a nice clear development of the difference between two vectors,
with the two vectors each having as many dimensions as the number of
measurements of x or y in the original space.

That wording makes me wary of whether you did understand. Let me put what
I said in another way. A vector is always of ONE dimension, no matter how
many dimensions are in the space within which it lies. When you say
“vectors each having as many dimensions …” I am concerned
that you may think that the vector has more than one dimension. The N
components of a vector are the projections of the vector onto the N axes
of the space, but that doesn’t change the dimensionality of the vector,
which is only one.

What I meant was that each of the components of the vector is in its own
dimension orthogonal to all the others. The vector is the resultant of
all these individual vectors; it is inclined to each of the axes of this
space by some angle. Since all the values of the component vector in this
space are collinear in the original x-y space (aligned along one axis),
introducing this hyperspatial rendition seems irrelevant. You didn’t
explain its relevance.

He doesn’t quite get to the
cosine bit, but next thing to it.

No, I had done that in the previous message [Martin Taylor
2009.05.22.15.45], to which the later one [Martin Taylor
2009.05.23.14.18] may be seen as a stage-setting foreword explaining what
seemed not to have been clear in the earlier message.

You didn’t say that in the post. You may have thought it, but you didn’t
say it.

The difference between the
vectors is a new vector whose length is computed by an “extended
Pythagorean theorem”, extending from the end of one N-dimensional
vector to the end of the other. It turns out that the difference between
the two vectors is just SQRT(SUM((x[i] - y][i])^2)). However, that
quantity is relevant in a different discussion concerning tracking errors
and prediction errors, and takes us away from correlations and regression
lines.

No, it is critical in computing the correlation.

What, the difference between the vectors is critical? I thought the
correlation was obtained from the dot product of the two
vectors:

···

=============================================================================
WIKI:
The dot product of two vectors
a = [a1, a2, … ,
an] and b = [b1,
b2, … , bn]

is defined as:
\mathbf{a}\cdot \mathbf{b} = \sum_{i=1}^n a_ib_i = a_1b_1 + a_2=============================================================================

Dividing that by the product of the vector lengths gives the correlation.
I don’t see the difference between the two vectors in that.

So your insisting that the difference vector is critical to computing the
correlation is not helpful, because as far as I can see, it isn’t
critical. What am I not understanding?

It seems to me that the angle
whose cosine is r not observable. To observe it we would would have to be
able to perceive a plane surface in N dimensions and see a triangle drawn
in that plane, which I can’t imagine.

I accept that you can’t imagine it, but I can’t imagine why you can’t.

That is not helpful, either. I can’t see that triangle tilted relative to
the multiple axes of this space. I can imagine seeing a triangle face-on,
but from what direction in the multiple-dimensioned space would I have
have to be looking to see it that way? You claim you can see it, and if
that’s true I congratulate you, but unless you can tell me how to do
that, I don’t get much advantage from your ability to do so.

A triangle is just a triangle.
It has vertices A, B, C, sides of length a, b, c, and angles alpha, beta,
and gamma. All nine of those parameters are simple real numbers. I’ll
repeat my mantra: you are trying to see difficulty where simplicity
exists.

But a triangle also has orientations – eight axes of rotation in an
8-dimensional space, for example – as well as a location, another eight
measures. Why are you ignoring those aspects of the triangle? If I
understood that, I would be happy to ignore them, too – but I
don’t.

There’s nothing special about a
triangle no matter how many dimensions are used to specify its vertices.
Once you know the lengths of its sides, which you get from
“hyper-Pythagoras”, you no longer need to think of the
hyperspace at all. Same if you know its angles. It’s just an ordinary
simple triangle. No tricks.

But it’s not a simple plane triangle in our ordinary 3-space; it is in
some other space with more dimensions. How do you decide what is
important about all its various attributes? You say I should think about
it as a simple triangle, but you don’t say why I should do that. Or how
to do it.

I wouldn’t know how to orient
that plane relative to the original x-y plane in which the two variables
to be correlated are plotted, so I wouldn’t know how to project it into
the x-y plane.

I’ll repeat what I said in my last: " Don’t think of the x-axis in
this context. The separate points on the x-axis are now the components of
the x-vector. The 8 axes are “1st measurement, 2nd measurement, 3rd
measurement …, 8th measurement”, and ALL of the x measurements
are represented together in a single point, or if you like, by a vector
from the origin to that single point (as in the diagrams above). In the
diagrams above, the point in 1-space is {3.5}, the point in 2-space is
{1.2, 0.8}, and so forth. The point in 8-space is {x1, x2, x3, …, x8}.
It’s just one point."

But the components aren’t just one point. If one component changes
magnitude, the vector changes direction, not just length. How can a
one-dimensional object change in more than one way at a time? You tell me
not to think of the x-axis in this context, but you don’t explain why
not. You tell me to think of 8 axes, but then tell me not to consider a
vector with a length, an origin, and a direction in this space as
multi-dimensional. You say a point is “just a point,” but a
point-object in 8-space has eight degrees of freedom, not just one.
Whatever you’re trying to say, I’m not understanding you. Repeating what
you said before is not going to suddenly make me understand it this time,
or the next five times, either.

I wonder if the problem isn’t that you are using a visual analog here but
are simply ignoring properties that don’t happen to matter to you in this
case. When you say a vector has only one dimension, you must be referring
to its length and ignoring its direction and origin. That makes the
analog into a mnemonic device rather than a quantitative representation,
and it becomes unique to you, and incommunicable. I am noticing things
about this image that I’m not supposed to be noticing, but how can I keep
from doing that?

The x-y plane you are thinking
of is in a conceptually different space. In your x-y plane, there are two
axes and 8 points. Those points each have a projection on the x axis and
a projection on the y axis. Each point represents two measurements, one
of x and one of y.

When you use the hyperspace representation, that x-y plane does not
exist. It’s not that it has some strange orientation in the 8-D space. To
ask that question makes as much sense as to ask what is the colour of a
duck’s quack.

So I am being helped by being told that I have asked a nonsensical
question. Your pedagogical technique doesn’t seem to take individual
differences into account. Perhaps the truth is that I don’t have the IQ
necessary to understand you. If so, you should either learn to speak as
if to a mentally impaired person, or give up. Well, I guess you do
already speak that way, but I mean sympathetically and helpfully, and
without emphasizing my disability.

The concepts are separate.
In the 8-D space there is one point that incorporates ALL the x values
and only x values, and another point that incorporates ALL the y values
and only y values. There is no point that includes both an X value and a
Y value (though, to complicate matters, one could define a 16-D space in
which one point incorporate all the values for both x and
y).

OK, I guess what you’re telling me is that this 8-D space is like a piece
of scratch paper on which you are doing some calculations, with only the
result of the calculation being transferred back to the sheet on which
the main development is being written. Otherwise is has nothing to do
with the main development, and any properties of this space other than
those you focus on have no influence on the results.

As far as I can see, that angle
is a name given to an intermediate result in the calculation, and is
never itself used for anything. But I await further enlightenment; to say
I don’t understand something is not to say it’s
wrong.

Actually, the angle can be used
for quite a lot beyond simply having its cosine represent a correlation.
I don’t want to go into them until what has been said so far has become
clear, but they are certainly useful in partitioning correlations, as we
did in our exploration of the relations among the correlation triangles
in the tracking task, which was useful in getting a notion of how the
model-fit error might be partitioned between noise and model deficiency,
even for highly accurate models.

I will leave that to you.

Now I return to my exploration
of my misunderstandings.

I said that Rick says

r = z.y/z.x

That can’t be right because z.x, z.y is only a single point.

As I understood Rick, it means that for a particular x value, you can
find the corresponding (expected) y value from the equation z.y = r *
z.x. It’s not intended to get r from a single x or y
value.

But you can solve that algebraic expression for r:

r = z.y/z.x

That can’t possibly give the correct value of r, since z.y/z.x is not the
same for every data point.

The correlation can’t be
determined from a single point. Actually, what Rick said was that z.y =
r*z.x. In that relationship, r is a number that depends on ALL of the
data pairs, not just one of them. So why does the algebra seem to say
that you can determine the correlation just by knowing the z-values of
two points?

It doesn’t. It says if you know r from all the data points, you can use
it to get an expected y value from an x value. Rick was just using it to
show how r could be construed as a slope if you scale the x and y
variables so as to equate their standard deviations.

But there’s nothing in there about expected values. The z.x statistic is

(x - xbar)/sigmax

where x is simply one of the original data points. That point combined
with z.y is not guaranteed to lie on the regression line. Or if it is,
that constraint was left out of the definition of z.x and everybody but
me knows about it.

Best,

Bill P.

[Martin Taylor 2009.05.24.18.03]

[From Bill Powers (2009.05.24.0330 MDT)]

Martin Taylor 2009.05.23.23.56 –

At the end, you ask, “does this help?” The answer is,
“Were you trying to help, or just to scold me for not seeing what is
obvious to you?”

When I end a message, as I often do, with “Does this help” it is a
straightforward question. I want to know whether the message helped to
resolve an issue with which the person to whom I addressed the message
was struggling. The problem is that I can only guess the true nature of
the issue, and I hope that what I say does address it. I’m asking for
relevant feedback. I don’t believe I have ever used that phrase in any
other way, and nor do I intend to use it in any other way in the
future.

You were having problems understanding some questions, and I tried to
deduce how best to put the answers in a way that would help you
understand. I myself have a problem of understanding, in that I cannot
guess what I might have said that would lead to the annoyed tone of
your response.

From Bill Powers
(2009.05.23.11014 MDT)]

In Martin Taylor’s latest post, Martin Taylor 2009.05.23.14.18, he goes
through a nice clear development of the difference between two vectors,
with the two vectors each having as many dimensions as the number of
measurements of x or y in the original space.

That wording makes me wary of whether you did understand. Let me put
what
I said in another way. A vector is always of ONE dimension, no matter
how
many dimensions are in the space within which it lies. When you say
“vectors each having as many dimensions …” I am concerned
that you may think that the vector has more than one dimension. The N
components of a vector are the projections of the vector onto the N
axes
of the space, but that doesn’t change the dimensionality of the vector,
which is only one.

What I meant was that each of the components of the vector is in its
own
dimension orthogonal to all the others. The vector is the resultant of
all these individual vectors; it is inclined to each of the axes of
this
space by some angle. Since all the values of the component vector in
this
space are collinear in the original x-y space (aligned along one axis),
introducing this hyperspatial rendition seems irrelevant. You didn’t
explain its relevance.

He doesn’t quite get
to the
cosine bit, but next thing to it.

No, I had done that in the previous message [Martin Taylor
2009.05.22.15.45], to which the later one [Martin Taylor
2009.05.23.14.18] may be seen as a stage-setting foreword explaining
what
seemed not to have been clear in the earlier message.

You didn’t say that in the post. You may have thought it, but you
didn’t
say it.

I knew you had very recently read the earlier post on the cosine
question, and you asked about the point in the cosine message that
seemed to you to be problematic. That seemed to me to be sufficient
reason to expect that you would see the later message as explaining
issues that to you were problematic in the cosine message. I’m sorry if
it wasn’t so evident to you.

The difference between
the
vectors is a new vector whose length is computed by an “extended
Pythagorean theorem”, extending from the end of one N-dimensional
vector to the end of the other. It turns out that the difference
between
the two vectors is just SQRT(SUM((x[i] - y][i])^2)). However, that
quantity is relevant in a different discussion concerning tracking
errors
and prediction errors, and takes us away from correlations and
regression
lines.

No, it is critical in computing the correlation.

What, the difference between the vectors is critical? I thought the
correlation was obtained from the dot product of the two
vectors:

=============================================================================
WIKI:

The dot product of two vectors

a = [a1, a2 , … ,
an] and b = [b1 ,
b2, … , bn]

is defined as:
\mathbf{a}\cdot \mathbf{b} = \sum_{i=1}^n a_ib_i = a_1b_1 + a_2=============================================================================

Dividing that by the product of the vector lengths gives the
correlation.
I don’t see the difference between the two vectors in that.

So your insisting that the difference vector is critical to computing
the
correlation is not helpful, because as far as I can see, it isn’t
critical. What am I not understanding?

That when the triangle has three sides, and you are concerned with one
of the angles, the length of the opposite side is very important.
Here’s the original post [Martin Taylor 2009.05.22.15.45] again, or at
least the relevant bit of it, with the difference vector part
highlighted to show how it appears in the dot product. I hope the
boldface gets through the various mail systems.
Getting back to the unrelated issue of the relation between correlation
and the angle between two vectors, think of these same data as two
points in a 5-D hyperspace (there are 5 data pairs). The data set X is
represented by the point {0, 1, 3, 1.5, 4} (the order is important),
and Y is represented by the point {0, 3, 9, 4.5, 12}. The origin is, of
course, {0, 0, 0, 0, 0}. What is the angle between the vectors that
connect the origin to the two points?
If the sides of a triangle are a, b, c, and the opposite angles
are
alpha, beta, gamma, then cos(alpha) = ((b^2 + c^2) - a^2)/
(2bc). In
this case, alpha is the angle between the two vectors at the origin.
Their lengths are “b” and "c’, and the length of the difference
vector
is “a”
. So,
b = sqrt(sum(xi^2)) = sqrt(0^2 + 1^2 + 3^2 + 1.5^2 + 4^2) = sqrt(28.25)
c = sqrt(sum(yi^2)) = sqrt(0^2 + 3^2 + 9^2 + 4.5^2 + 12^2) =
sqrt(254.25)
a = sqrt(sum(xi-yi)^2)) = sqrt((0-0)^2 + (1-3)^2 + (3-9)^2 +
(1.5-4.5)^2 + (4-12)^2) = sqrt(113)
So in this case, cos(alpha) = ((sum(xi^2) + sum(yi^2) -
sum((xi-yi)^2))/2(sqrt(sum(xi^2)sum(yi^2)) = (28.25 + 254.25 -
113)/2
sqrt(28.25254.25) = 169.5/169.5 = 1.0 (The notation in straight
ASCII is visually a bit cumbersome, but I hope you understand what it
means).
The formula can be simplified, since __sum(xi-yi)^2 = sum(xi^2) +
sum(yi^2) - 2
sum(xiyi)__, which makes the numerator become simply
__2
sum(xiyi)__. The formula is then
cos(alpha) = __sum(xi
yi)__/sqrt(sum(xi^2)*sum(yi^2))
That is exactly the Wikipedia formula. Sum(xi-yi)^2 is the square
of the length of the difference vector
.

It seems to me that
the angle
whose cosine is r not observable. To observe it we would would have to
be
able to perceive a plane surface in N dimensions and see a triangle
drawn
in that plane, which I can’t imagine.

I accept that you can’t imagine it, but I can’t imagine why you can’t.

That is not helpful, either. I can’t see that triangle tilted relative
to
the multiple axes of this space. I can imagine seeing a triangle
face-on,
but from what direction in the multiple-dimensioned space would I have
have to be looking to see it that way? You claim you can see it, and if
that’s true I congratulate you, but unless you can tell me how to do
that, I don’t get much advantage from your ability to do so.

I don’t suppose it will be helpful to repeat my mantra that a triangle
is just a triangle, and once you have established the side lengths, you
really aren’t concerned any more with the hyperspace, at least not when
you are considering the relations among the variables, because the
hyperspace representation is of the full ordered data set and you don’t
care about that any more. I can’t imagine seeing the triangle in (for
the tracking triangulations) 3600-D space, but I can imagine seeing the
shape of the triangle, whether it is needle-pointed, near equilateral,
right-angled, or whatever. That’s all you want, once you have used the
N-dimensional locations of the vertices to establish the side lengths.

A triangle is just a
triangle.
It has vertices A, B, C, sides of length a, b, c, and angles alpha,
beta,
and gamma. All nine of those parameters are simple real numbers. I’ll
repeat my mantra: you are trying to see difficulty where simplicity
exists.

But a triangle also has orientations – eight axes of rotation in an
8-dimensional space, for example – as well as a location, another
eight
measures. Why are you ignoring those aspects of the triangle? If I
understood that, I would be happy to ignore them, too – but I
don’t.

Ah, maybe I begin to see your difficulty at last. (But maybe not. I
hope I do).

Let’s get at the answer by stages. We look first at a different issue.
What are those 8 axes? What does a triangle vertex actually represent?

To answer those questions, let’s get at it a little obliquely, and
suppose that the 8 measures of x and y are 8 samples in a time-sequence
of samples, {x1, y1} taken at t1, and so forth. The eight values of x
are {x1, x2, x3, …x8}. I’ll describe four different ways of
representing those data.

  1. You could draw the x values as a sampled waveform on a graph with a
    time axis for the abscissa and x1, x2, x3… x8 as the successive
    ordinate values. On this same graph, you could draw another waveform
    for {y1, y2, y3,…y8}. You would have two curves on one graph, and to
    show the relations between them, you might look to see whether they
    both go up and down at the same times. The graph represents the
    complete data set, including its ordering.

  2. There’s another way of looking at these same data, which is to have
    a 3-D graph, on which left-right is a time axis, up-down is x value,
    and front-back is y-value. In this 3-D plot, there is one point at time
    t1, with the up-down coordinate x1 and the front-back coordinate y1.
    There’s another point for time t2, another for t3, … t8. Now we can
    take those points and draw a curve through them in 3-D space. This
    simple curve has all the same information as did the two curves in the
    2-D graph. You see how well x and y are related by looking to see
    whether the curve tends to lie near a diagonal plane, as it will if,
    say, y = ax + b + noise.

  3. Here’s a third way of viewing the same data, the one you find most
    congenial. Collapse representation 2 along the time axis, leaving only
    the up-down and front-back axes to define a graph on which all the
    corresponding x, y pairs are plotted, each y against its corresponding
    x. We call that a scatter-plot, and it loses the time information,
    since there’s nothing in the graph to show whether THIS point or THAT
    one represents data taken at time t1, t6, or t8. You could, of course,
    label them or even connect the dots with a loopy curve in the order the
    data were taken, but one usually doesn’t do this. The scatter is
    sufficient for the purpose of estimating correlation visually. This
    representation keeps the relative vales of xn and yn measured at tn,
    but loses the ordering of the data pairs, which you don’t need in order
    to compute correlation.

  4. Now we introduce a fourth way of showing the data – well, not
    actually “showing”, since one can’t represent a high-dimensional space
    on paper. We keep the time information as we did in the first two
    cases, but instead of laying the time out on a line, we use it to label
    the axes of a space. Think first of a sequence of only three samples of
    x and y, taken at t1, t2, and t3. Plot the three x values so that the
    coordinate of the t1 value is left-right, the t2 value is up-down, and
    the t3 value is front-back. Now there is a point at {x1, x2, x3}.
    There’s only one point, but it represents the whole ordered time series
    of x values. The location of the point describes a time series of three
    values, just as in representation 2 each point represented not an
    x-value or a y-value, but an x-y pair of values.

4 (cont). In the 3-D space representing a time series of three samples,
we can also place a point that represents the waveform of y, y1 being
its coordinate on the left-right axis, y2 its coordinate on the up-down
axis, and y3 its coordinate on the front-back axis. Now comes the
interesting part. How does correlation between the x and y waveforms
show up in this representation, which incorporates all the temporal
information in the same way as did method 2? High correlation means
that when x is high, so is y, proportionately, and when x is low, so is
y, proportionately. Accordingly, if x and y are perfectly correlated,
and any constant difference between them has been eliminated, the
points for the origin, for x, and for y will be colinear. If there is
noise in the measurements, or if there isn’t a real linear functional
relation between them, the point that represents the y waveform will be
in a different direction from the origin than the point that represents
the x waveform. There will be a non-zero angle between the vectors from
the origin to the x-point and the y-point. In fact, as the calculations
in other messages have shown, the cosine of the angle between them is
their correlation.

4 (cont). Now we are getting near the answer to your question. Let’s
augment the time sequence so that we have 8 x-y pairs taken at times
t1, t2, …, t8. We can make a representation that extends the 3-D
space to 8-D. In this space we have one point that represents the
entire x waveform. Its coordinates are {x1, x2, x3,…x8}. We have
another point that represents the y waveform, remembering that axis 1
still represents the value at t1, axis 2 represents the value at t2,
and so forth. One can define a vector from the origin to each of these
two points, the one represnting the x time series and the one
representing the y time series. One can’t visualize as a picture in
one’s mind how a vector is oriented in the 8-D space, but since
notionally it’s just a straight line, that may not matter (and it turns
out that it does not matter). Just as in the 3-D space of 3
time-samples, if the variation in the y waveform is simply proportional
to the variation in the x waveform, the vector for y will be in the
same direction from the origin as the vector for x.

4( cont). Now think back to the relation between representations 2 and
3 above. Representation 3, the scatter plot that loses time sequence
information, is enough to show you the correlation between x and y, as
also is representation 2, the 3-D time-series graph. In fact, because
representation 3 eliminates the time variable, the correlation is
easier to see than it is in representation 2. What does the orientation
of the vector from the origin to the x-point represent? It is the time
sequence of the data. If you rotate the x-vector in the space, you
change the x-values at the different sample times. If you relabel the 8
axes, you change the ordering of the samples. Relabelling doesn’t
present a problem, because the same reordering will happen for x and
for y, so their covariation will not change. It’s less easy to show
that rotating the space doesn’t affect the relationship between x and
y, because rotating actually changes the sample values represented.
However, your question, which now is nearly answered, is why this
doesn’t matter.

4 (cont) What we are interested in is the correlation between x and y.
Previously, it has been shown that this correlation is exactly the
cosine of the angle between the x and y vectors in the space. Many
different series of values lead to the same correlation. Rotating the
two vectors in the space while keeping their relative angle constant
changes the locations of the points representing x and y. These
different locations represent quite different time series, but since we
are keeping the angle constant between the x-vector and the y-vector
when doing our rotations, the two new time series will have the same
correlation as the original x and y time-series. Since they have the
same correlation, and we are not trying to recover the actual time
series from the calculation of the angle, we can orient the triangle
how we choose. We can lay it down on the 1-2 plane if we want, or we
can imagine it where we will. It makes no difference to the
correlation, provided we maintain the shape of the triangle.

The end result of all this is that you have to worry about the triangle
orientation only if you need to keep track of the actual data samples
and their sequence, which is not the case when you are computing a
correlation. When computing the correlation, you need only to be
concerned with the lengths of the three sides of the triangle, which
suffice to define each angle of the triangle.

I hope that all makes sense, and that it covers the issues raised in
the following parts of your message.


Now I return to my
exploration
of my misunderstandings.

I said that Rick says

r = z.y/z.x

That can’t be right because z.x, z.y is only a single point.

As I understood Rick, it means that for a particular x value, you can
find the corresponding (expected) y value from the equation z.y = r *
z.x. It’s not intended to get r from a single x or y
value.

But you can solve that algebraic expression for r:

r = z.y/z.x

That can’t possibly give the correct value of r, since z.y/z.x is not
the
same for every data point.

Correct. If I understood Rick correctly, r is given. You aren’t
supposed to be deriving it from the values of one point, but rather,
you are supposed to be estimating an as yet unmeasured y, knowing its
corresponding x. How well that will actually represent y when it is
later measured depends on the correlation (though of course, you could
get lucky with any value of correlation, even zero!).

I ask seriously: did this help?

Martin

[From Bill Powers (2009.0-5.25.1007 MDT)]

Martin Taylor 2009.05.25.10.14 –

MT: …there are more
considerations than just what you mention, considerations that ethicists
pose as a generic problem with things like vaccination. They are
exemplified by the thought problem of the person who can see that a
runaway tram will kill 5 people on the track unless he pushes one person
onto the track to stop the tram (of course he could selflessly lie of the
track himself, but that’s never proposed as an
alternative).

BP: I was thinking of a simpler case not involving a conflict: it’s what
Richard Kennaway calculated and called “screening.” If you have
a predictor of performance in school, for example, but it has a low
correlation with a screening test used to determine admittance, how low
does the correlation have to be before you stop using that test? If the
purpose of the test is to protect the school’s reputation, there would
appear to be no lower limit, except that the correlation should be
positive. On the other hand, if the odds of the assessment being correct
are 55:45 but the cost of the flunking out is half the benefit of
succeeding in school, from the standpoint of the person taking the test
the minimum correlation is … what?

Maybe what I’m asking is “at what correlation should the prospective
student not apply to a school using that test?”

David Goldstein sent me a little chart showing the probabilities of a
correct prediction and I carefully saved it where I can’t find it. My own
favorite table is from the chemical rubber handbook of about 50 years
ago. The probabilities are given in term of the odds against a deviation
of more than n standard deviations. In the following, being RIGHT means
the result of assuming the deviation is real:

Deviation more than Odds
against Chance of being

(standard
dev’s) being
WRONG RIGHT

0.67
1.00:1.00 50%

0.8
1.36:1.00
57%

0.9
1.72:1.00
63%

1.0
2.15:1.00 68%

1.5
6.48:1.00 87%

2.0
20.98:1.00 95%

I’ll leave the actual calculations to the game theorists. Maybe the
standard deviation is all we have to care about. The above, of course,
are for an N of 1. The concern is with judgments that can be made only
once.

Best,

Bill P.

On the more technical side, your
issue really depends on the relation between the variables being linear
and extending past zero from possible benefit to possible risk. Quite
often, the question isn’t whether the treatment has any risk attached,
but whether there is any benefit. Everything we do carries with it some
risk, and a potential treatment is no different. The question is what
kind of risk compared to what kind of benefit, and whether the risk
incurred by not using the treatment outweighs the different risk incurred
by taking it. Furthermore, a low correlation does not in itself suggest
that there is any added risk. It just says that it would be hard to tell
whether a particular patient would get more or less than the average
benefit, which could be substantial for all.

Here are four examples of possible situations. Only one shows high
correlation between dose and benefit, two have a highly predictable
relation between dose and benefit, three are safe to use and pose no risk
to the patient, and only one has low correlation and poses a risk to the
patient. I could have also shown one with high correlation and risk to
the patient, exemplifying immunization that protects against a damaging
illness if the dose is sufficient, but I think this is enough to make the
point.

different relations between correlation and safety

Martin

__________ NOD32 4098 (20090522) Information __________

Detta meddelande är genomsökt av NOD32 Antivirus.

http://www.nod32.com

[From Bill Powers (2009.05.25.0540 MDT)]

[Martin Taylor 2009.05.24.18.03]

MT: I ask seriously: did this help?

BP: I thank you for a sincere attempt which must have cost you a lot of
time to construct. I follow the steps you go through, but still don’t see
how that route could look simpler to you than just calculating the
correlation. However, I lack a complete set of mathematical bumps, and am
not the person to question the way another reaches
understanding.

I think my basic question has been answered, the answer being that the
cosine of the angle between the vectors representing x and y values has
nothing directly to do with the angle representing the slope of the
regression line relative to the x axis. The correlation may be the slope
of the regression line when the data are converted to z form, but since
that introduces the ratio of the standard deviations, something like the
correlation is hidden in there somehow.

My original confusion as to just what angle is related to the correlation
is also cleared up, I think. It’s not the angle of the slope indicating
the average proportionality factor between the two variables being
correlated; that slope is independent of the correlation r, which is the
main point I was trying to get straight.

And way, way, behind all those questions still lurks the fundamental
reason I have any interest at all in data with low correlations. I have
been asking for a long time: what are the chances that a judgment about
the next person you meet will be correct, if it is based on a statistical
relationship between the judgment and the observation on which you base
the judgment? I think this is a very important fact to know, because if
you consider the payoffs and penalties for making judgments that turn out
to be right or wrong, it becomes clear that there are some judgments that
should not be made. If I see any uses in statistics, that would be the
most important one: enabling us to know when the use of statistics-based
judgement is contraindicated. That is how we decide whether to call a
correlation “low.” If the risk to the person about whom the
judgment is being made outweights the potential benefits of making the
judgment, the correlation is too low to use. Of course the alternative to
that is to decide that we don’t care what the risk to the other person
is, and are concerned only with our own track record. Either way, we have
declared our position concerning the morality of judging others.

Best,

Bill P.

[Martin Taylor 2009.05.23.14.18]

I’ll try to help. For me the problem is that I see you as looking for
complications, where really they don’t exist. If a vector is
represented (as every vector is) by an ordered set of numbers, it is no
less a simple directed line for being representable as a line in a
high-dimensional space. It isn’t the least bit exotic.

Having said that as preamble, let’s proceed.

[From bill Powers (2009.05.23.0911.MDT)]

Martin Taylor 2009.05.22.15.45

When I think “regression line” in two dimensions, I’m thinking of a
relationship like y = ax + b, where we’re varying x (or observing
natural variations in x) and measuring the value of y for each value of
x. Assuming a linear and noise-free set of measurements, every pair of
values x,y will be related by the same formula with the same values for
a and b. Actually we would need only two measurements with different
values of x to determine a and b, and those values of a and b would
then work for all further measurements of x and y.

Fine so far.

Now suppose we have a noisy set of measurements. In that case we would
make many more measurements of x and y and compute the coefficients a
and b of a line that minimizes the squared difference between x and y
over the whole data set. For simplicity, suppose we make 8 measurements
instead of only 2. Now we have 8 values of x and 8 values of y along
the two axes.

Still fine.

Before continuing, I would like to emphasise your last sentence: “Now
we have 8 values of x and 8 values of y along the two axes.” Yes, we
do, and if you want to use computational analytic methods, that’s as
far as you go. It’s only if you would like to use the geometric
approach that you need to start worrying about 8-dimensional
hyperspace. Let’s suppose that you do want to use the geometric
approach. I do, because my way of approaching problems is essentially
visual, and if I can see how something fits together, I’m much more
assured about it than I am if I just did the algebraic or arithmetic
analysis. If the two approaches give different results, I know I’ve
made a mistake in one of them. For me, that mistake is rather more
probable in the analytic domain (inverting a sign, missing a bracket,
or something like that).

According to what I am hearing, each set of values now is to be plotted
in an eight-dimensional space, not the original 2 dimensions, with the
vector length

of x being the square root of the sum of the squares of the individual
x measurements, and the vector length of y being computed similarly. We
have to try to imagine two vectors pointing in different directions in
8-space.

Yes, in a way you do. But that’s a way station to getting back into a
2-D space. If you refer back to my preamble above, each of these
vectors can be seen as a simple ordinary line. Rather than thinking of
it in 8-space immediately, think of a sequence of increasingly
high-dimensional spaces.

Vectors in N-space

At this point I see no relationship between the original x-y space and
the new 8-space. Each x measurement can be seen as a vector lying in
the x-axis with a length equal to the value of x. But there are eight
of these measurements, and in 8-space they all point in different
directions. There seems to be no way to relate the direction of the
vector in 8-space to the direction of the x axis.

Don’t think of the x-axis in this context. The separate points on the
x-axis are now the components of the x-vector. The 8 axes are “1st
measurement, 2nd measurement, 3rd measurement …, 8th measurement”,
and ALL of the x measurements are represented together in a single
point, or if you like, by a vector from the origin to that single point
(as in the diagrams above). In the diagrams above, the point in 1-space
is {3.5}, the point in 2-space is {1.2, 0.8}, and so forth. The point
in 8-space is {x1, x2, x3, …, x8}. It’s just one point.

There’s another point in the same space, which has the axes “1st
measurement, 2nd measurement, … 8th measurement” that represents your
8 measurements of y, and there is a corresponding y-vector connecting
the origin to that point.

Now we have to be concerned about vector lengths or “magnitudes” (the

x>> symbols in the expression). In 2-D space, there is Pythagoras’s
Theorem, “The square on the hypotenuse is the sum of the squares on the
sides forming the right-angle”. In the case of the 2-D vector, those
sides are its component values on the two axes. Its length is sqrt(X^2

  • Y^2). The same holds in N dimensions. It’s just a fact of Euclidean
    spaces. In your 8-D space, the length of the x-vector is
    sqrt(sum(xi^2)).

The interesting length when we are talking about correlation is the
vector that connects the x-point to the y-point. That vector has
components that are the differences between the corresponding
components of the x and y vectors. Let’s do this again in spaces of
increasing dimensionality. If you have a 1-D x-vector as in the diagram
(3.5) and a y-vector in the SAME space with its single component being
(2), then the distance between them is 1.5. That’s obvious, but to fit
it into the sequence of increasingly high-dimensional spaces I will
write it as sqrt((3.5 - 2)^2). For the 2-D space, let’s add a y-vector
(0.8, 1.6). The length of the vector from the y-point to the x-point is
sqrt((1.2-0.8)^2 + (0.8-1.6)^2). The same applies to all the other
spaces. In the 8-D space, the length of the difference vector is
sqrt((x1-y1)^2 + (x2-y2)^2 … +(x8-y8)^2)

All of these vectors are just simple lines, regardless of how many
dimensions the space is in whjich they are described. When we were
talking triangulations in order to judge how much of the disparity
between the model and real track was due to noise and how much to model
failure, the dimensionality of the space was 3600, but the triangles
were still simple 2-D triangles in a plane.

Let me stop here and ask whether I have made a mistake already.

I don’t think so, except in thinking about the x-axis as though there
were 8 different x-axes rather than one x-point in the 8-D space.

I hope this helps a bit, and lets you go on from there with your
analysis of the geometric way of looking at time series.

Martin

(Attachment vectors_in_N-space.jpg is missing)

[Martin Taylor 2009.05.25.10.14]

[From Bill Powers (2009.05.25.0540 MDT)]

[Martin Taylor 2009.05.24.18.03]

MT: I ask seriously: did this help?

BP: I thank you for a sincere attempt which must have cost you a lot of
time to construct. I follow the steps you go through, but still don’t
see
how that route could look simpler to you than just calculating the
correlation.

In itself, it certainly is not simpler than just calculating the
correlation, because you have to make exactly the same calculations,
plus a little extra. Where (to me, with a visual mind) the advantage
comes is when one is thinking about the relationships among
correlations or the implications of correlations. I would not have been
able to demonstrate the relationship between control ratio and the
perception-disturbance correlation without using the correlation angle.
When we were discussing the possibilities of detecting small misfit
elements in a model that fits tracking data very well, putting together
more than one “correlation triangle” helped me to see in my mind’s eye
how one factor might be affecting another.

Everything that can be done with the correlation angles can be done by
calculating the formulae and by using vector and matrix analysis, and
if one needs to be exact, one must do that. But I find that many things
are more easily seen in the visual representation than in the plethora
of symbols needed to represent the same things as formulae. For
example, if A and B have a known correlation, and so do B and C, then
the maximum and minimum correlations between A and C can be determined
by thinking of one triangle OAB hinged to another, OBC, along the OB
line, and rotating them into the two possible coplanar positions, thus
adding or subtracting the BOA and BOC angles. You don’t get the exact
number from the visual representation, but you do get an idea of the
relative magnitudes, and that gives you a sanity check if you need to
do the calculations.

My original confusion as to just what angle is related to the
correlation
is also cleared up, I think. It’s not the angle of the slope indicating
the average proportionality factor between the two variables being
correlated; that slope is independent of the correlation r, which is
the
main point I was trying to get straight.

That’s what I thought you had known all along, which was obviously a
source of confusion in the discussion.

And way, way, behind all those questions still lurks the fundamental
reason I have any interest at all in data with low correlations. I have
been asking for a long time: what are the chances that a judgment about
the next person you meet will be correct, if it is based on a
statistical
relationship between the judgment and the observation on which you base
the judgment?.. If the risk to the person about whom the
judgment is being made outweights the potential benefits of making the
judgment, the correlation is too low to use. Of course the alternative
to
that is to decide that we don’t care what the risk to the other person
is, and are concerned only with our own track record. Either way, we
have
declared our position concerning the morality of judging others.

I haven’t commented on this, but I do have a couple of comments. My
first comment is that there are more considerations than just what you
mention, considerations that ethicists pose as a generic problem with
things like vaccination. They are exemplified by the thought problem of
the person who can see that a runaway tram will kill 5 people on the
track unless he pushes one person onto the track to stop the tram (of
course he could selflessly lie of the track himself, but that’s never
proposed as an alternative). If you have a drug that kills or does
serious damage with some small probability, but cures an otherwise
frequently fatal disease with some reasonably high probability and
almost certainly prevents its onward transmission, what, ethically, is
the right thing to do? Do you offer the drug, telling the patient that
there is x% probability of dying from the disease and a high
probability of in the process passing the disease on to several other
people who would also have an x% probability of dying from the disease,
y% (larger than x) of dying from the drug, z% (moderately high)
probability of being cured if the drug is taken, and the near certainty
of not passing on the disease if the drug is taken? That’s informed
consent, but is it socially responsible to allow people deliberately to
contribute to the development of an epidemic or pandemic? What about
immunization that simply prohibits disease transmission, when with low
probability the immunization itself will damage the person, but a
failure of a moderate proportion of the public to be immunized will lay
the whole population vulnerable to a potentially fatal disease? I offer
no opinion on that in general, but I suspect that if I were required to
make a decision in a particular case, it would depend on x, y, and z.
The problem is ethical and political, not technical.

On the more technical side, your issue really depends on the relation
between the variables being linear and extending past zero from
possible benefit to possible risk. Quite often, the question isn’t
whether the treatment has any risk attached, but whether there is any
benefit. Everything we do carries with it some risk, and a potential
treatment is no different. The question is what kind of risk compared
to what kind of benefit, and whether the risk incurred by not using the
treatment outweighs the different risk incurred by taking it.
Furthermore, a low correlation does not in itself suggest that there is
any added risk. It just says that it would be hard to tell whether a
particular patient would get more or less than the average benefit,
which could be substantial for all.

Here are four examples of possible situations. Only one shows high
correlation between dose and benefit, two have a highly predictable
relation between dose and benefit, three are safe to use and pose no
risk to the patient, and only one has low correlation and poses a risk
to the patient. I could have also shown one with high correlation and
risk to the patient, exemplifying immunization that protects against a
damaging illness if the dose is sufficient, but I think this is enough
to make the point.

different relations between correlation and safety

Martin

(Attachment risk-benefit_correl.jpg is missing)