Statistical Analysis

[From Kenny Kitzke (2007.07.19)]

<Bill Powers (2007.07.19.1310 MDT)>

<My opinion of statistical studies is still that all they give us is a lot of completely unexplained facts of very low quality. Of course that’s a statistical generalization.>

I feel your pain!

Before you give up on the value/merit of statistical studies, allow me to observe that I use statistical data and statistical analyses in my continuous performance improvement consulting with quite satisfying results.

I often see uses of statistical data in studies and experiments where conclusions are drawn and future results are claimed that are not justified by the analyses. Among the worst are using correlation as evidence of cause. Another is to apply population statistics as a reliable predictor of individual results.

Properly used, statistical analyses obey sound scientific/mathematical laws just like F=ma or a2 + b2 = c*2 for the hypotenuse of a right triangle. This can be valuable to understanding how systems work.

But, there are conditions for statistical laws to be valid. And, there are severe limitations on even using these valid laws to predict future outcomes. Let me address the first.

All statistical data is historic. Any predictive power is dependent upon the system that produced the data remaining functionally the same. This is true even for predicting future population outcomes much less trying to predict any particular individual outcome.

I often use dice rolling to illustrate statistical laws with my clients. Most have done this or you can do it live in front of them. They readily understand that the mathematical probability of possible outcomes predicts that the most likely next roll will be 7. They also readily agree that the next roll can not be predicted at all except that it will range from 2 to 12.

These statistical laws are so certain over time with large trials that casinos can count on making a profit if the system remains the same and is run the same way. However, if you change the system for “rolling” dice (so that outcomes are not equally likely), those laws will not work. For example, if the roller arranges the two die with sixes up between two fingers and drops them exactly one inch on to a cloth covered table, you will not necessarily find that 7 is the most frequent result in 1000 rolls.

Even if you can predict the probability of rolling a seven very accurately on the next roll, or of actually rolling more sevens than any other number in 1,000 rolls, the chances of rolling something other than a 7 on the next roll is greater than 80%.

Well, what may seem like bad news about the value of statistical analyses and predictions also has an upside. That is this. When a system is carefully characterized, you can intentionally perturb the system to see how the system change affects the resulting data. In other words, you can experiment to see how changes in a system actually change the results. It seems similar to a test for the controlled variable!

This reveals the only reliable way to improve any health care system. Looking at other systems will only give clues as to what might be a positive change for a desired new result. The same change in different systems can produce dramatically different results. So, if you want to see whether a single payer method is superior to multi-payer method in producing better health care outcomes, you have to take the current system and modify it ONLY in that way and record the results for a long enough time that you are convinced it is a helpful change.

I think you can see how difficult that is to do in a complex system with many variables that you can’t hold constant while experimenting one at a time. There are sophisticated statistical factorial experiments that would allow several levels of several variables to be evaluated for effect in one experiment. But, for very complex systems and ones that can change in ways that cannot be controlled, even these methods may not give precise answers. And, from my experience, very few organizations, even research organizations, are proficient at understanding and properly applying such advanced methods.

Lastly, humans are such complex systems. So the mysteries of human nature and behavior remain difficult to understand or predict even for populations and certainly for individuals.

See you soon at the University of Minnesota.

···

Get a sneak peek of the all-new AOL.com.

[From Bill Powers (2007.07.19.1935 MDT)]

Kenny Kitzke
(2007.07.19)

Kenny, you’re a breath of fresh air. I agree with you that there are
definitely useful and valid applications of statistics, but that they
don’t include trying to predict or characterize the behavior of
individuals. The temptation to apply group statistics to individuals,
however, is apparently too strong to be resisted by many people.

Your casino example is exactly the right one to show how statistics can
be useful. The casino’s profit is highly reliable and
predictable.

I especially agree with your recommendation that any proposed health
insurance plan be tested to see if it will actually be an improvement. If
everybody could agree to that, nobody would have to take a position on
what the best plan would be. We would simply find out.

You have depths I had not suspected.

See you soon.

Best,

Bill P.

[[From Rick Marken (2007.07.20.1430)]

Bill Powers (2007.07.19.1935 MDT)]--

Kenny Kitzke (2007.07.19)

Kenny, you're a breath of fresh air. I agree with you that there are
definitely useful and valid applications of statistics, but that they don't
include trying to predict or characterize the behavior of individuals. The
temptation to apply group statistics to individuals, however, is apparently
too strong to be resisted by many people.

I'm glad to hear that you agree that there may be some valid
applications of statistics. Is it just the use of regression for
individual prediction that gets you so upset about statistics? If so,
could you tell me what you would do if there were a limited number of
slots available for students in college (as there are, due to funding
limitations) and you wanted to put the most promising students in
those slots; how would you select the students for the slots? This is
the situation administrators are in when they are accepting students
into colleges. Should these administrators ignore the statistical
data, that shows that kids with high SAT scores tend to do better in
college, and just let kids in on a first come first basis or something
like that?

Your casino example is exactly the right one to show how statistics can be
useful. The casino's profit is highly reliable and predictable.

Yes, I made the same point in earlier posts. Statistical inference
methods, which are based on probability theory, can very accurately
predict group level results. That's why political polling works so
well -- at least, when the sampling is done appropriately (as Kenny
pointed out).

I especially agree with your recommendation that any proposed health
insurance plan be tested to see if it will actually be an improvement. If
everybody could agree to that, nobody would have to take a position on what
the best plan would be. We would simply find out.

What kind of test would you accept? The best test would be a
completely randomized design which would involve dividing the US into
two randomly selected groups of 150,000,000 each (that way we could
avoid sampling and the use of the dreaded statistics). One group
(treatment group) gets a single payer and the other (control group)
stays with the existing system. You would then have to measure the
performance of the two groups (infant mortality, life span, etc) over
some acceptable period of time, like, maybe, 20 years. Then you
compare the groups in terms of the measures you consider relevant. You
wouldn;t have to use statistics because you are testing the entire
population; no inferences necessary.

I see a few possible problems with this approach to testing policy
proposals: The study would cost a bazillion dollars, it would take a
long time, and when it was completed it is likely that those with
vested financial interests would challenge the results anyway, to the
extent that they came out in the "wrong" way.

The fact is that the only practical way to test policies is using
"quasi-experiments", the specialty of Donald T. Campbell who was one
of the few prominent conventional psychologists to admire you work. A
quasi-experiment is one where there is no true control group. For
example, a quasi experiment is done when different health care systems
are implemented in different states or countries. These groups are, of
course, not exactly equivalent before the treatment but you do the
best you can to decide the extent to which any observed difference in
the performance of the groups is a result of any of the initial
differences between the groups or a result of the treatment. One way
to do this is to factor out (statistically -- yikes) potential
confounds that are thought to be important, such as income levels or
health levels of all participants. This is called analysis of
covariance. Another way to do this is to look at time changes in the
performance variables -- so called "time series" designs.

I have done some time series "quasi-experiments" with economic data
(which you seemed to like at the time). One nice example was looking
at the US budget balance from 1980 to 2004. The question was whether
the policies of the Clinton administration (the treatment) were
responsible for the budget surplus that existed at the time Bush II
entered the White House. The data show that the budget hovered in a
deficit during the years before Clinton came in. Then the deficit
steadily and nearly linearly moves into surplus through the Clinton
years and then instantly turns back toward deficit when Bush comes in.

Since there is no control group, the change in deficit could be the
result of "other factors" that came into existence as soon as Clinton
entered office and went away as soon as he left. But this kind of
coincidence seems rather unlikely (to everyone except Republicans, I
suppose) and that's the logic of the time-series quasi experiment. My
data, for example, constitute what is called a reversal time-series
design. You start by getting a baseline measure of performance before
the treatment (in this case, the yearly measure of the budget balance
before the Clinton presidency). Then you institute the treatment (a
competent and decent human being) and measure the budget balance
during the treatment period. Finally, you "reverse" the treatment
(install Bush) and see what happens. It is assumed that confounding
changes in other variables (like economic growth) are unlikely to be
perfectly timed to correspond to the time of implementation and
reversal of the treatment. And, indeed, I measured growth over the
whole period and showed that this potential confound varied in the
same way before, during and after Clinton. So it was not a confound.

There may be time series data that could be used to test health care
policy questions, but it's unlikely that there is reversal data
(unless there is a country that went to single payer and then went
back). But there is all kinds of quasi-experimental data that is
available for testing health policy. The vested interests and
ideologues are going to reject that data anyway-- like the data on the
cost/effectiveness of Medicare, the VA, the French system, etc -- but
this kind of data (which you spurn as "statistical") is a lot cheaper
than the completely randomized design (sans sampling so that it is not
statistical) that I mentioned initially as the correct way to test the
effectiveness of health care policies. And, though not perfect, it
seems like a reasonable basis for making informed policy decisions.

I do take issue with one comment in Kenny's otherwise nice discussion
of statistics. Kenny said:

I often see uses of statistical data in studies and experiments
where conclusions are drawn and future results are claimed that
are not justified by the analyses. Among the worst are using
correlation as evidence of cause.

The idea that "correlation does not imply causality" is a
methodological, not a statistical concept. Correlation is an ambiguous
term; it refers to a descriptive statistic (r, the correlation
coefficient) and to the way data is obtained. r is just a measure of
the degree and direction of linear relationship between two variables.
It doesn't imply that there is nor does it imply that there is not a
causal relationship between the variables.

The ability to determine causality (according to standard research
methods dogma) depends on how the data was obtained. If the data was
obtained by correlational methods then r does not imply causality. In
correlational methods the variables to be correlated are collected in
an "uncontrolled" way; we just get measures of each variable from each
individual in the study. So we might measure the height and IQ of 20
different individuals. Any correlation (r) between these two variables
does not imply causality because there was no control for other
variables (like socioeconomic status) that might be responsible for
the relationship. If, however, the data were collected using
experimental methods -- where all variables other than the independent
and dependent variables are controlled (held constant) -- then a
correlation (r) between the IV and DV _does_ imply causality; the
correlation (r) means (according to standard research methods
assumptions) that variations in the IV do cause variations in the DV.

Of course, PCT shows that, when we are dealing with a closed loop
system, even a correlation (r) between an IV and DV that is obtained
under controlled conditions does not necessarily imply causality.
Since IV-DV experiments are typically analyzed using statistics like F
or t, students get the impression that these statistics _do_ imply
causality while the r statistic does not. It turns out that no
statistic implies causality. Causality is a conclusion based on
modeling, not on the type of data observed or the type of statistic
used to describe or make inferences about that data.

Another small nit in Kenny's piece. Kenny said:

Another [mistaken use of statistics] is to apply population statistics
as a reliable predictor of individual results.

There is no such things as "population statistics". Statistics are
descriptions of sample data that are used as the basis for inferences
about a population parameter. I imagine Kenny must have meant that it
is a mistake to use statistical measures based on group data (like the
least squares regression line) as a predictor of individual results.
And if this is a mistake (which I don't think it is) then you've got a
heck of 'splainin' to do to a lot of college admissions officers, who
use grades as a basis for predicting school performance, and business
executives, who use data on past performance as a basis for hiring.

Best

Rick

···

--
Richard S. Marken PhD
Lecturer in Psychology
UCLA
rsmarken@gmail.com

[Martin Taylor 2007.07.20.21.05]

[[From Rick Marken (2007.07.20.1430)]

Bill Powers (2007.07.19.1935 MDT)]--

Kenny Kitzke (2007.07.19)

Kenny, you're a breath of fresh air. I agree with you that there are
definitely useful and valid applications of statistics, but that they don't
include trying to predict or characterize the behavior of individuals. The
temptation to apply group statistics to individuals, however, is apparently
too strong to be resisted by many people.

I'm glad to hear that you agree that there may be some valid
applications of statistics. ....... (a lot of dots :slight_smile:

Great post, Rick!

Martin

I’m glad to hear that you agree
that there may be some valid

applications of statistics. Is it just the use of regression for

individual prediction that gets you so upset about
statistics?
If so, could you tell me what
you would do if there were a limited number of

slots available for students in college (as there are, due to
funding

limitations) and you wanted to put the most promising students in

those slots; how would you select the students for the
slots?
This is the situation
administrators are in when they are accepting students

into colleges. Should these administrators ignore the statistical

data, that shows that kids with high SAT scores tend to do better in

college, and just let kids in on a first come first basis or
something

like that?

I especially agree with
your recommendation that any proposed health

insurance plan be tested to see if it will actually be an improvement.
If

everybody could agree to that, nobody would have to take a position on
what

the best plan would be. We would simply find out.

What kind of test would you accept?
The best test would be a

completely randomized design which would involve dividing the US
into

two randomly selected groups of 150,000,000 each (that way we could

avoid sampling and the use of the dreaded statistics). One group

(treatment group) gets a single payer and the other (control
group)

stays with the existing system. You would then have to measure the

performance of the two groups (infant mortality, life span, etc)
over

some acceptable period of time, like, maybe, 20 years. Then you

compare the groups in terms of the measures you consider relevant.
You

wouldn;t have to use statistics because you are testing the entire

population; no inferences necessary.
I see a few possible
problems with this approach to testing policy

proposals: The study would cost a bazillion dollars, it would take a

long time, and when it was completed it is likely that those with

vested financial interests would challenge the results anyway, to
the

extent that they came out in the “wrong” way.
The fact is that the only
practical way to test policies is using

“quasi-experiments”, the specialty of Donald T. Campbell who
was one

of the few prominent conventional psychologists to admire you work.
A

quasi-experiment is one where there is no true control group. For

example, a quasi experiment is done when different health care
systems

are implemented in different states or countries. These groups are,
of

course, not exactly equivalent before the treatment but you do the

best you can to decide the extent to which any observed difference
in

the performance of the groups is a result of any of the initial

differences between the groups or a result of the treatment. One way

to do this is to factor out (statistically – yikes) potential

confounds that are thought to be important, such as income levels or

health levels of all participants. This is called analysis of

covariance. Another way to do this is to look at time changes in the

performance variables – so called “time series”
designs.
I have done some time series
“quasi-experiments” with economic data

(which you seemed to like at the time).
One nice example was
looking

at the US budget balance from 1980 to 2004. The question was whether

the policies of the Clinton administration (the treatment) were

responsible for the budget surplus that existed at the time Bush II

entered the White House. The data show that the budget hovered in a

deficit during the years before Clinton came in. Then the deficit

steadily and nearly linearly moves into surplus through the Clinton

years and then instantly turns back toward deficit when Bush comes
in.
I measured growth over the whole
period and showed that this potential confound varied in the same way
before, during and after Clinton. So it was not a
confound.
There may be time series data
that could be used to test health care

policy questions, but it’s unlikely that there is reversal data

(unless there is a country that went to single payer and then went

back). But there is all kinds of quasi-experimental data that is

available for testing health policy. The vested interests and

ideologues are going to reject that data anyway-- like the data on
the

cost/effectiveness of Medicare, the VA, the French system, etc –
but

this kind of data (which you spurn as “statistical”) is a lot
cheaper

than the completely randomized design (sans sampling so that it is
not

statistical) that I mentioned initially as the correct way to test
the

effectiveness of health care policies. And, though not perfect, it

seems like a reasonable basis for making informed policy
decisions.
I do take issue with one comment
in Kenny’s otherwise nice discussion

of statistics. Kenny said:

I often see uses of statistical
data in studies and experiments

where conclusions are drawn and future results are claimed that

are not justified by the analyses. Among the worst are using

correlation as evidence of cause.

The idea that “correlation does not imply causality” is a

methodological, not a statistical concept. Correlation is an
ambiguous

term; it refers to a descriptive statistic (r, the correlation

coefficient) and to the way data is obtained. r is just a measure of

the degree and direction of linear relationship between two
variables.

It doesn’t imply that there is nor does it imply that there is not a

causal relationship between the variables.
If, however, the data were
collected using experimental methods – where all variables other than
the independent and dependent variables are controlled (held constant) –
then a correlation (r) between the IV and DV does imply causality; the
correlation (r) means (according to standard research methods
assumptions) that variations in the IV do cause variations in the
DV.
I imagine Kenny must have meant
that it

is a mistake to use statistical measures based on group data (like
the

least squares regression line) as a predictor of individual results.

And if this is a mistake (which I don’t think it is) then you’ve got
a

heck of ‘splainin’ to do to a lot of college admissions officers,
who

use grades as a basis for predicting school performance, and
business

executives, who use data on past performance as a basis for
hiring.
[From Bill Powers (2007.07.20.1715 MDT)]

Rick Marken (2007.07.20.1430) –

.

I have always agreed, and have said many times, that there are valid uses
for group statistics. I don’t know where you get off implying otherwise.
The only invalid use is the attempt to use them to establish facts about
individuals. If you study groups, the facts you get are about groups. If
you study individuals, the facts you get are about individuals. You have
to show that some statement is true of essentially every member of a
group before you can use it successfully to predict what any individual
in the group will do. That means your group statistics must involve VERY
high correlations. That is almost never the case.

In the first place, the most promising individuals will get educated
anyway; I would give preference to those who need help more. Picking only
the best students who will learn even from a lousy teacher makes the
school and teachers look good, but doesn’t help the students who need
education the most.

In the second place, if the goal is to make the teaching look good, then
group statistics is the way to go: it doesn’t matter which students
benefit, as long as more people with the desired characteristics are
obtained through the screening process than without it. The ones who are
rejected are not your concern. If you pick people from group A (students
with high SAT scores), group B (the graduates from the university) will
do better than graduates who had low SAT scores would do. It doesn’t
matter which individuals are more successful – that is why you can use
group statistics. And it doesn’t matter if quite a few do worse than
expected, as long as the average goes the right way.

I would prefer individual interviews by the people who are going to teach
the students, with desire and ability to learn being weighted more than
past performance. Of course that is not practical when there are 5000
students from all over the country to be interviewed in the space of a
week or two by 10 or 20 faculty members. Anyway, it’s cheaper and easier
just to administer a test and hire a clerk to sort the results by score.
That’s why it’s done that way, not because that way is better for the
students.

I have the same question about this way of screening candidates that I
have about every other use of group statistics to deal with individuals.
How many students who will do poorly are accepted, and how many who would
do very well are rejected, by this method? A person who doesn’t care
about any of the individual applicants can safely ignore this question,
and use group statistics, and be sure of showing a good record of
successes, because success is measured as a group average. But if my
question is considered important, one simply has to do the arithmetic and
find out what the answer is. I’m sure the data are there. It’s just that
interest in finding this particular piece of information is very low,
possibly because everyone knows pretty much what the result would be,
though it’s not discussed in polite company.

If a student has any choice, my advice would be not to take any of those
tests, because the chance of being wrongly classified is very high (or so
I claim, until someone shows me otherwise). Unfortunately, if you don’t
take the tests you don’t get in, so you have to take them.

Very simple. Try out the system locally, and ask everyone how it’s
working and what the problems are. Study all the individual cases, and
THEN aggregate the results. Most systems that are put in place don’t even
do that much: whoever yells the loudest or buys the most votes gets his
system put in place with little or no pretesting. Then they defend it
against all criticisms.

We could eventually get that kind of data, but why not try it out on a
much smaller scale first? If a very clear advantage can’t be seen in a
small study, there’s no point in doing a larger one to see if it still
holds up. Large studies are large usually because on a smaller scale no
clear effect can be shown to exist, so (I would say) the approach is not
worth pursuing – unless you’re only interested in slight group average
effects.

Yes, that is why these full-scale expensive and grandiose proposals are
sure to be turned down. If I were against finding out the truth about
health care or education, that is exactly the proposal I would make to a
congressional committee. Then nobody could blame me when the proposal is
turned down.

This is fine if all you want to know is facts about groups. If you want
to apply the results to individuals, and care whether you treat each
person appropriately, you will not do it this way.

I did. It was valid group data about group phenomena. The conclusions
were not used to determine the fate of any individual in
particular.

No problem. Policy effects are group effects. If you examined individual
cases you would find many, many deviations, both small and large, from
the average relationships. The facts you cite are group facts, not facts
about individuals.

You were judging the effect of a policy on average effects over the whole
economy. A perfectly valid use of group statistics.

I wish you would stop trying to set me up as an enemy of statistics just
because I object to one egregious misuse of it. I may be dumb but
I’m not that dumb. I’m an enemy of the thoughtless, automatic,
superstitious use of group statistics as a way of finding out something
about an individual. I consider that to be a kind of formalized
prejudice. The point I have been making (and Richard Kennaway produced a
very detailed demonstration that I was more right than I knew) is that
you can use group statistics this way only if the correlations are in the
high 90’s – that is, they are seen in almost every case. If they are in
the 80s or lower, the number of misclassifications is entirely
unacceptable (to me).

Kenny’s point as I see it is that you should never use correlation alone
as evidence of causality, even though people do that all the time. If you
want to show causality, you have to use a different kind of evidence,
which is harder to get since it involves testing specimens instead of
casting nets. With group statistics, you can’t even show that a
particular B is caused by a particular A or set of As-- and what other
kind of causation is there?

Anyway, isn’t a a contradiction in terms to say that A causes B –
sometimes? Doesn’t that mean that A causes B except when it
doesn’t?

True, but you’re using a different kind of data than is required just to
calculate a correlation. You’re looking at individual cases. You’re
analyzing causality for individuals, then deducing the effect in
groups.

On the contrary, they have a lot of explaining to do about why they
insist on using mass measures to screen individuals on the basis of
correlations that are far too low to allow doing this one person at a
time. The only explanation they could possibly offer is that they don’t
care about individual cases – only about overall averages.

I think your spreadsheet is a sufficient example of a case where using
the linear regression line yields ludicrous predictions (negative infant
mortality rates), and entails large quantitative prediction errors (100%
or more) in over a quarter of the individual cases, not to mention
generating mis-rankings by 20+ places (United States predicted third from
best, actually 25th from best).

I’m willing to pipe down if someone can show me that predicting
individual behavior from group statistics, at correlation levels commonly
accepted as good, does not seriously misrepresent large numbers of
individuals. Or else convince me that misrepresentation doesn’t matter.
Your spreadsheets so far have worked in the oppposite direction.

Best,

Bill P.

[From Kenny Kitzke (2007.07.20)]

<Rick Marken (2007.07.20.1430)>

···

Get a sneak peek of the all-new AOL.com.

[From Rick Marken (2007.07.20.1940)]

Martin Taylor (2007.07.20.21.05)--

> Rick Marken (2007.07.20.1430)--

Great post, Rick!

Thanks, Martin! I agree;-)

Best

Rick

···

--
Richard S. Marken PhD
Lecturer in Psychology
UCLA
rsmarken@gmail.com

[From Kenny Kitzke (2007.07.20)]

<Rick Marken (2007.07.20.1430)>

<I do take issue with one comment in Kenny’s otherwise nice discussion
of statistics. Kenny said:

I often see uses of statistical data in studies and experiments
where conclusions are drawn and future results are claimed that
are not justified by the analyses. Among the worst are using
correlation as evidence of cause.

The idea that “correlation does not imply causality” is a
methodological, not a statistical concept. Correlation is an ambiguous
term; it refers to a descriptive statistic (r, the correlation
coefficient) and to the way data is obtained. r is just a measure of
the degree and direction of linear relationship between two variables.
It doesn’t imply that there is nor does it imply that there is not a
causal relationship between the variables.>

High correlations between two variables are necessary but not sufficient evidence to establish that a selected change in one will cause a known change in the other. The evidence is provided by an experiment. I fail to see what issue you are actually raising with my statement? Does my clarification still cause you difficulty?

<The ability to determine causality (according to standard research
methods dogma) depends on how the data was obtained. If the data was
obtained by correlational methods then r does not imply causality. In
correlational methods the variables to be correlated are collected in
an “uncontrolled” way; we just get measures of each variable from each
individual in the study. So we might measure the height and IQ of 20
different individuals. Any correlation (r) between these two variables
does not imply causality because there was no control for other
variables (like socioeconomic status) that might be responsible for
the relationship. If, however, the data were collected using
experimental methods – where all variables other than the independent
and dependent variables are controlled (held constant) – then a
correlation (r) between the IV and DV does imply causality; the
correlation (r) means (according to standard research methods
assumptions) that variations in the IV do cause variations in the DV.>

Yes, I agree with that special case. However, even in your example, I suspect it would be next to impossible to find 20 people whose only difference is their height and their IQ. And, even if you could, would a high correlation tell you whether tallness causes higher IQ or higher IQ causes tallness? In simple physical systems, we may be able to test how temperature or % of lime are correlated with concrete strength. Indeed, there are designed statistical experiments where both variables can be tested at the same time to find the best combination.

What I sometimes see in medical or health studies/research is something like 10 people took a green tea pill for 10 days and compared their blood pressure before and after. Another 10 people took a water pill for 10 days and compared their blood pressure before and after. There was no significant change for the water pill folks but 6 of the 10 taking green tea pills saw a reduction in blood pressure ranging from 2-10% with a correlation coeff. of 0.6. Therefore, taking green tea pills is likely to cause a reduction in your blood pressure. I assume you might question the test methodology and the conclusion that any one person, or even 6 out of every 10 people, using green tea pills will reduce their blood pressure up to 10%.

···

Get a sneak peek of the all-new AOL.com.

[From Kenny Kitzke (2007.07.20)]

<Rick Marken (2007.07.20.1430)>

<Of course, PCT shows that, when we are dealing with a closed loop
system, even a correlation (r) between an IV and DV that is obtained
under controlled conditions does not necessarily imply causality.
Since IV-DV experiments are typically analyzed using statistics like F
or t, students get the impression that these statistics do imply
causality while the r statistic does not. It turns out that no
statistic implies causality. Causality is a conclusion based on
modeling, not on the type of data observed or the type of statistic
used to describe or make inferences about that data.>

Rick, first, I am sorry I sent a couple of posts before they were complete, I think. It was the way I had screens displayed on my desktop and when I would click to see one another, I would accidentally click the send now button.

But, wow, I really like and agree with your comments concerning human behavior experiments and causality. When I first read about PCT, I was pretty active with statistical analyses and I saw how different PCT viewed such studies compared to classic psychology. In no small way, this made me pursue PCT as a superior theory of behavior.

<Another small nit in Kenny’s piece. Kenny said:

Another [mistaken use of statistics] is to apply population statistics
as a reliable predictor of individual results.

There is no such things as “population statistics”. Statistics are
descriptions of sample data that are used as the basis for inferences
about a population parameter. I imagine Kenny must have meant that it
is a mistake to use statistical measures based on group data (like the
least squares regression line) as a predictor of individual results.
And if this is a mistake (which I don’t think it is) then you’ve got a
heck of ‘splainin’ to do to a lot of college admissions officers, who
use grades as a basis for predicting school performance, and business
executives, who use data on past performance as a basis for hiring.>

Statistical data is not only sample data. If we measure the weight of all the people, say 20, attending the CSG Conference, the calculated average and standard deviation are population statistics. This is so common that the Greek letter sigma is used for the SD rather than the English s which would apply to say a random sample of 5 of the 20 attendees.

And, with such population statistics one can make highly reliable predictions. What is the chance that if 10 ride to their dorm room on the elevator, they will exceed the 2,000 pound elevator weight limit? If we took a random sample of 5 of the 20 attendees, we could use the sample mean and standard deviation to also make such predictions…but the confidence would be substantially lower.

Now, for a nit about your comments.

<What kind of test would you accept? The best test would be a
completely randomized design which would involve dividing the US into
two randomly selected groups of 150,000,000 each (that way we could
avoid sampling and the use of the dreaded statistics). One group
(treatment group) gets a single payer and the other (control group)
stays with the existing system. You would then have to measure the
performance of the two groups (infant mortality, life span, etc) over
some acceptable period of time, like, maybe, 20 years. Then you
compare the groups in terms of the measures you consider relevant. You
wouldn;t have to use statistics because you are testing the entire
population; no inferences necessary.

I see a few possible problems with this approach to testing policy
proposals: The study would cost a bazillion dollars, it would take a
long time, and when it was completed it is likely that those with
vested financial interests would challenge the results anyway, to the
extent that they came out in the “wrong” way.>

Small samples from populations (if random) can produce surprisingly accurate predictions of the population as in voting polls. So, to test a change in a health care system (single payer) I think you could take say a sample of 2,000 people from Pittsburgh, picked at random, and gather their relevant monthly health care costs and outcomes for say the past two years. Then, put them on a single payer system and begin to collect the same data. It would not be unusual if the change was significant that you would see statistical evidence of it in less than a year. But, the predicted result in the whole population would not have a high confidence. Additional samples could be tried and if they also confirmed a benefit from this change, a case could be made to command the entire system be changed.

However, as I cautioned before, any prediction of future performance of some statistical variable based on historical data will be reliable ONLY if the entire system remains the same. What time the sun will rise tomorrow in San Francisco will most likely be predictable with great accuracy. How many people will have a heart attack tomorrow in San Francisco may be quite unpredicatble if at 6:00 AM there is a 7.0 earthquake.

See you in Minnesota and I’ll bring a check.

Kenny

···

Get a sneak peek of the all-new AOL.com.

[From Fred Nickols (2007.07.21.0726)]

[[From Rick Marken (2007.07.20.1430)]

I'm glad to hear that you agree that there may be some valid
applications of statistics. Is it just the use of regression for
individual prediction that gets you so upset about statistics? If so,
could you tell me what you would do if there were a limited number of
slots available for students in college (as there are, due to funding
limitations) and you wanted to put the most promising students in
those slots; how would you select the students for the slots? This is
the situation administrators are in when they are accepting students
into colleges. Should these administrators ignore the statistical
data, that shows that kids with high SAT scores tend to do better in
college, and just let kids in on a first come first basis or something
like that?

As some of you might recall, I worked at Educational Testing Service for
almost 12 years. I'm no statistician nor psychometrician but I did pick up
some relevant knowledge. The statement above - about the SAT - is a bit too
broad. More specifically, as I was told (repeatedly) at ETS, the only claim
made for the SAT is that it is a good predictor of first-year academic
success. It is not as good a predictor of graduation. I was also told that
there are equally good predictors of first-year academic success and much
better predictors of graduation. Chief among these better predictors are
the parents' socio-economic standing.

And, apropos of Bill Powers' subsequent comment:

[From Bill Powers (2007.07.20.1715 MDT)]

In the first place, the most promising individuals will get educated
anyway; I would give preference to those who need help more.

James B Conant, president of Harvard during the depression years, shared
that same notion. He believed testing could be used to identify promising
candidates for university education (Harvard in particular) - regardless of
family background - and he put one of his assistant deans, Henry Chauncey,
to work on the task. There was considerable interest in the work Carl
Brigham, of Princeton, had been doing with something called the Scholastic
Aptitude Test. The College Board, originally a small association of Ivy
League admissions officers, became interested as well. The SAT eventually
became one of the College Board's chief assessment tools. One day in the
latter half of the 1940s, the executive secretary of the College Board
resigned to take a position at Stanford and Henry Chauncey became the
executive secretary. Not long afterward Chauncey founded Educational
Testing Service and left the College Board, taking its tests with him. The
rest is history.

Regards,

Fred Nickols
nickols@att.net

[Martin Taylor 2007.07.21.11.08]

[From Bill Powers (2007.07.20.1715 MDT)]

Rick Marken (2007.07.20.1430) --
.

I'm glad to hear that you agree that there may be some valid
applications of statistics. Is it just the use of regression for
individual prediction that gets you so upset about statistics?

I have always agreed, and have said many times, that there are valid uses for group statistics. I don't know where you get off implying otherwise. The only invalid use is the attempt to use them to establish facts about individuals.
...
I would prefer individual interviews by the people who are going to teach the students, with desire and ability to learn being weighted more than past performance. Of course that is not practical when there are 5000 students from all over the country to be interviewed in the space of a week or two by 10 or 20 faculty members. Anyway, it's cheaper and easier just to administer a test and hire a clerk to sort the results by score. That's why it's done that way, not because that way is better for the students.

In other words, you would like to get more information about each individual, but would discard the information you are actually able to get, on the grounds that it is obtained from people "like" the person of interest, rather than from that person's self. But would the result of interviews establish the "fact" you want to know about the individual (presumably the degree to wich they will benefit from the education you offer, or the success of the medical intervention, or ...)?

No, your interview would would not establish the "fact" you want to know. All you would do by conducting the interview is to make it more probable that you would select those who would be most likely to benefit (two levels of chance, here). And so it is with using groups statistics when dealing with individuals. As Rick's fine posting pointed out, what the statistical analyses do is to give you a bit more information about people "like" the individual of immediate interest, information that you can apply when dealing with that individual.

If, using group statistics, red-heads are more likely to benefit from your service than are blonds, and you have a choice between offering the service to a blond or a redhead, with no way of getting more information about either, you'd be well advised to choose the redhead, wouldn't you? And that would be true even if you would be wrong for 40% of the pairs of individuals you might encounter, since without knowing their hair colour, you would be wrong 50% of the time. Sixty percent correct may not be much better than 50%, but it IS better. Causality doesn't enter into it; the correlation helps you make the choice that is more likely to turn out well.

You have to show that some statement is true of essentially every member of a group before you can use it successfully to predict what any individual in the group will do.

This correct statement is what seems to lead you astray. If what you really want is to predict SUCCESSFULLY what ANY individual will do, then it is true that you have to show that it is true of essentially every individual. If, on the other hand, you want to do your best for the individual in front of you at this moment, the situation is different.

Seldom in real life do you have the luxury of knowing exactly the effect your service to the indivdual will have. What you do have is perceptions you want to control, and actions that might influence those perceptions. As you (Bill P.) often point out, a given action does not always influence a particular perception the same way: multiple means to a single effect. When the effect of an action cannot be immediately observed (the effect of education on later life success), or is irrevocable (dosing with a probably beneficial but possibly lethal medication), then you can't wait until you can show the precise effect that your action will have. You must educate (or not) now, not when the person is dying of old age; you must medicate (or not), now, not when the person has succumbed to the disease (or recovered spontaneously).

Group statistics provide information that helps you to judge the probabilities that your actions will have this or that effect. In the case of education, one-on-one interviews may improve the probabiities in respect of the individuals, but you are always working with probabilities, not "facts".

Sometimes the group statistics are all you have to work with; sometimes you can do better. But you do a disservice to the individuals you want to serve if you arbitrarily discard relevant information on the grounds that probabilities are not certainties.

Martin

[From Bill Powers (2007.07.21.09367 MDT)]

Fred Nickols (2007.07.21.0726)

I’ve attached a snippet about SAT correlations. Getting the full data
would cost me $25 so I’m not ready to do that yet.

Here are the scatter plots for the Math SAT (left) and Verbal SAT
(restricted data sets):

3d8642f4.jpg

In the left plot, a Math SAT score of 700 plus or minus 50 would predict
a GPA of about 3.3 to 3.8, according to the regression line. However, the
actual scores for that range of SAT values range from about 3.9 down to
2.3, estimating by eye – in other words, anywhere within the range of
observed GPAs. This is with a correlation of 0.47 (the other correlation
is 0.32)

So it seems clear that if this one measure were used for evaluation, its
predictions of individual performance would be very poor – in fact, many
of the students in that range would do worse than the score predicts,
some very much worse.

While the overall trend is clear and could be used to predict the average
GPA over this population, given the average SAT score, using it to accept
or reject one individual would be very questionable.

Best.

Bill P.

SATCorrelations.doc (15 KB)

[Martin Taylor 2007.07.21.13.01]

[From Bill Powers (2007.07.21.09367 MDT)]

In the left plot, a Math SAT score of 700 plus or minus 50 would predict a GPA of about 3.3 to 3.8, according to the regression line. However, the actual scores for that range of SAT values range from about 3.9 down to 2.3, estimating by eye -- in other words, anywhere within the range of observed GPAs. This is with a correlation of 0.47 (the other correlation is 0.32)

A correlation of 0.47 means that the math SAT accounts for 22% of the variance, leaving 78% still out there, reflecting the effects of other influences (including, perhaps, whether the individual had a good sleep and a nice breakfast before taking the test). That means the standard deviation of the GPA results is still 88% of what it would have been had no test been done. But 88% is still better than 100%, and therefore using it will be fairer to more people than would not using it, even if not by much.

(for 0.32, the figures are 10% of the variance accounted for, and a reduction in standard deviation to 94%).

So it seems clear that if this one measure were used for evaluation, its predictions of individual performance would be very poor -- in fact, many of the students in that range would do worse than the score predicts, some very much worse.

While the overall trend is clear and could be used to predict the average GPA over this population, given the average SAT score, using it to accept or reject one individual would be very questionable.

Perhaps, but less questionable than refusing to use that information when you have it available. The real question is whether it's worth going to all the trouble to collect the data in the first place, if the benefit of using it is so slight.

Martin

In other words, you would like
to get more information about each individual, but would discard the
information you are actually able to get, on the grounds that it is
obtained from people “like” the person of interest, rather than
from that person’s self.
[From Bill Powers (2007.07.21.1010 MDT)]

Martin Taylor 2007.07.21.11.08 –

No, not on those grounds: on the grounds that it is more likely to be
false for a given individual than true. I would prefer high-quality facts
about each individual to low-quality facts, where “quality” is
simply the probability that the fact will prove to be true of the
individual being studied. An example would be the height of the
individual. I would prefer actual measures of the individual’s height,
and discard measures of people “like” that individual in
regards other than height. I would prefer to pat this person down, rather
than being told the likelihood that people with the same yearly income
carry guns. Even if black people score slightly lower than white people
on certain tests, I would prefer to know how a particular black person
scored, rather than knowing only how the average black person scored. I
would rather know a person’s actual grades in math courses than the grade
point average of people who had similar SAT scores. And I would be right,
because the SAT scores are such poor predictors of grade point averages,
and grade point averages would not predict mathematical grades very well,
either. If I want to hire a mathematician, I will give him or her math
tests. When I tried for a job in the tech services department of a
newspaper, where I would have to diagnose and repair computers, the test
I was given consisted of descriptions of computer malfunctions that I had
to diagnose correctly.

But would the
result of interviews establish the “fact” you want to know
about the individual (presumably the degree to wich they will benefit
from the education you offer, or the success of the medical intervention,
or …)?

It would come a lot closer than an SAT score would. As to medical
interventions, I would prefer an MRI scan to a survey of people of
similar physical traits who show similar symptoms of a brain
tumor.

You’re offering all the standard defenses of statistics that I’ve heard
all my life. They all sound like attempts to defend doing something that
one intends to go on doing no matter what, even if it’s wrong. “Most
people are not going to stop applying group data to individuals, so it
can’t be wrong.” That’s basically what Rick said:

Kenny must have
meant that it is a mistake to use statistical measures based on group
data (like the least squares regression line) as a predictor of
individual results.

And if this is a mistake (which I don’t think it is) then you’ve got a
heck of ‘splainin’ to do to a lot of college admissions officers, who use
grades as a basis for predicting school performance, and business
executives, who use data on past performance as a basis for
hiring.
No, your interview would would
not establish the “fact” you want to know. All you would do by
conducting the interview is to make it more probable that you would
select those who would be most likely to benefit (two levels of chance,
here).

Back to Martin:

That would depend on what the interview is about. I would be trying to
see how much the person wants to attend college, and why, and how much
work the person is willing to contract to do, and whether the person’s
word is any good. I would not be much concerned about whether this person
was going to be a standout or a genius or get wonderful grades. Who is
there who doesn’t deserve a chance to try, even if the end result is not
something dazzling? And who is to say that a person who does poorly in
high school will not get his act together in college? Not me, for
certain! And who has the nerve to tell someone he doesn’t deserve an
education just because he isn’t smart? Who decides what a
“benefit” is? Would not a person with a middling to poor
intellectual history benefit more than a person who already does
intellectual gymnastics with ease? Or are you speaking of the teacher’s
or the university’s benefit, as opposed to the student’s?

And so it is
with using groups statistics when dealing with individuals. As Rick’s
fine posting pointed out, what the statistical analyses do is to give you
a bit more information about people “like” the individual of
immediate interest, information that you can apply when dealing with that
individual.

With a high likelihood of misclassifying the person and wrongly treating
the person. How high? That can be calculated, but nobody wants to touch
that calculation with a 10-foot yardstick, apparently. If you can show me
what the test results have to be to lower the probability of misjudging
an indivdual to an acceptable level, I will pay attention. Of course we
then have to agree on what is acceptable. Richard Kennaway did that
analysis once, and the results were so shocking that people with a vested
interest in statistics as it is done immediately leaped on him and then
abandoned the subject as quickly as possible. We’re not going to stop
doing that, so it can’t be wrong. What shall we talk about now?

If, using group
statistics, red-heads are more likely to benefit from your service than
are blonds, and you have a choice between offering the service to a blond
or a redhead, with no way of getting more information about either, you’d
be well advised to choose the redhead, wouldn’t
you?

If I were prepared to do an injustice 40 per cent of the time, yes. I
would be much better advised to devote some time to finding a better
criterion having some relationship to the service. There is always a way
to get more information if you really want it. I would offer to sell an
oil change to someone whose car had its last change 3000 miles ago,
rather than to red-headed people, even if the average red-headed person
does not change his oil as often as people with a different hair color
do. I would turn to the problem of reducing the number of
mistakes.

And that
would be true even if you would be wrong for 40% of the pairs of
individuals you might encounter, since without knowing their hair colour,
you would be wrong 50% of the time.

Neither 50% nor 40% mistakes would satisfy me, especially if I were
advertising that I could provide a service for money. I would not offer
that service at all if I could be wrong that often, especially if the
outcome mattered much. Of course I wouldn’t offer the service if the
outcome didn’t matter much, either.

Sixty percent
correct may not be much better than 50%, but it IS better. Causality
doesn’t enter into it; the correlation helps you make the choice that is
more likely to turn out well.

I think that scheme is a lot like the one so many amateur gamblers come
up with. If they just double their bets every time they lose, they will
eventually come out ahead. How long can you go on telling your customers
that being wrong about their problem 40% of the time is better than being
wrong half of the time? Being wrong 40% of the time isn’t acceptable,
either.

And anyway, the choice is not between 50% and 60% right for an
individual; it goes the other way for many statistical tests. Look at
that paper I did for Hershberger’s collection, showing that there was an
upward trend in a dataset where the same variables were related in the
direction opposite to the trend, for every individual in the set.

You do not have enough capital to survive being wrong 40% of the time. It
wouldn’t help the gambler to know that a certain strategy would increase
his chances of winning from 50% to 60%. Sooner or later he would have an
unlucky streak and be unable to increase his bet enough to try again. And
the doctor who chose his treatments on such a narrow margin of success
would run out of patients. An attrition rate of 40% per treatment would
reach a limit pretty fast.

You have to
show that some statement is true of essentially every member of a group
before you can use it successfully to predict what any individual in the
group will do.

This correct statement is what seems to lead you astray. If what you
really want is to predict SUCCESSFULLY what ANY individual will do, then
it is true that you have to show that it is true of essentially every
individual.

That’s what I want. Or more to the point, I want to find statements that
are true of each individual, even if it’s a different statement for each
one. That’s what you get from testing specimens. Maybe the greatest
delusion in statistics is the idea that there are statements that are
true of everyone in a group. How many points in a scatter diagram are ON
the regression line?

If, on the other
hand, you want to do your best for the individual in front of you at this
moment, the situation is different.

I don’t want credit for doing my best. I want credit for succeeding. If I
don’t succeed, I will know it even if nobody else does. The individual in
front of me isn’t interested in my Brownie points, either. He wants his
car, or his warts, or his marriage fixed. If I can’t do it, he’ll find
someone who can. It’s only in grade school that you get credit for a
partial solution but a wrong answer.

Seldom in real life
do you have the luxury of knowing exactly the effect your service to the
indivdual will have. What you do have is perceptions you want to control,
and actions that might influence those perceptions. As you (Bill P.)
often point out, a given action does not always influence a particular
perception the same way: multiple means to a single effect. When the
effect of an action cannot be immediately observed (the effect of
education on later life success), or is irrevocable (dosing with a
probably beneficial but possibly lethal medication), then you can’t wait
until you can show the precise effect that your action will have. You
must educate (or not) now, not when the person is dying of old age; you
must medicate (or not), now, not when the person has succumbed to the
disease (or recovered spontaneously).

But if 40% of my patients die from the medication, there’s little solace
in knowing that 50% might have died without it. I would go on looking for
a better medication, rather than stopping to practice medicine before I
knew what I was doing.

Group statistics
provide information that helps you to judge the probabilities that your
actions will have this or that effect. In the case of education,
one-on-one interviews may improve the probabiities in respect of the
individuals, but you are always working with probabilities, not
“facts”.

Of course. But for some facts, sigma is 1.0, while for others it is 0.01
times the mean. And anyway, those probabilities apply to average
outcomes, not individual outcomes.

Sometimes the group
statistics are all you have to work with; sometimes you can do better.

Sometimes group statistics are all you need, as when you’re selling life
insurance or stocking shelves at a grocery store. But for others, you
need methods of testing specimens, which means taking the time to study
individuals in as much detail as you can, rather than relying on
statistical tables.

But you do a
disservice to the individuals you want to serve if you arbitrarily
discard relevant information on the grounds that probabilities are not
certainties.

I have just the opposite feeling: I do a disservice if I use a treatment
based on bad data on the grounds that on the average it provides an
improvement – but knowing that for any individual, it will most probably
be useless or harmful. This is not about my track record; it’s about the
person I hope to help.

Best,

Bill P.

[From Rick Marken (2007.07.21.1050)]

Bill Powers (2007.07.20.1715 MDT)]

I have always agreed, and have said many times, that there are valid uses
for group statistics. I don't know where you get off implying otherwise.

Well, I'll get off implying that you don't like statistics when you
get off implying that I believe a valid use of group level statistics
is to establish facts about individuals, which is what you did when
you pointed out the ridiculous prediction of a negative infant
mortality in the US based on the regression equation using per capita
income as the predictor. Prediction based on the regression equation
is a group level use of statistics. Anyone who thinks they knows
something about individuals based on that equation is making a serious
mistake, a point that I always make in my statistics and research
methods courses.

In the second place, if the goal is to make the teaching look good, then
group statistics is the way to go: it doesn't matter which students benefit,
as long as more people with the desired characteristics are obtained through
the screening process than without it. The ones who are rejected are not
your concern. If you pick people from group A (students with high SAT
scores), group B (the graduates from the university) will do better than
graduates who had low SAT scores would do. It doesn't matter which
individuals are more successful -- that is why you can use group statistics.
And it doesn't matter if quite a few do worse than expected, as long as the
average goes the right way.

This is all I've been talking about. Administrators deal with groups;
they select in order to improve _group_ level performance. The
regression equation helps you do exactly what you say above; pick a
group of students who will do better (in some sense) than a group
selected randomly. Certainly that will involve selecting some
individuals who will actually do poorly (False Alarm) and rejecting
some individuals who would have done well (Miss). But the regression
analysis allows you to have fewer False Alarms and Misses than you
would have had had you simply selected students at random. But, again,
this is a group level use of statistics. No one should imagine that
the group relationship between SAT and college performance (such as it
is) says anything about any particular individual.

I would prefer individual interviews by the people who are going to teach
the students, with desire and ability to learn being weighted more than past
performance.

You just want different predictors than SAT. That's fine. But you are
still going to be making False Alarms and Misses (at the group level)
using this criterion. A regression analysis would tell you whether
your False Alarm/Miss rate is better with the interview than the SAT
data AT THE GROUP LEVEL.

A person who doesn't care about any of
the individual applicants can safely ignore this question, and use group
statistics, and be sure of showing a good record of successes, because
success is measured as a group average.

Right. Though I think it's a little rude to think that a person who
wants to improve things at the group level doesn't care about
individuals. I think the occasional conflict between group and
individual benefit is one of the dilemmas of living in a society. For
example, I think the evidence is overwhelming that single payer will
be beneficial at the group level. But I know that that doesn't mean
it's going to be better for any particular individual. Some people
will end up getting worse care than they would have gotten if they had
just come into an emergency room without insurance. It's like the
seatbelt laws; some fraction of people who wear seat belts as required
by group policy will have been better off if they had not worn them.

If a student has any choice, my advice would be not to take any of those
tests, because the chance of being wrongly classified is very high (or so I
claim, until someone shows me otherwise). Unfortunately, if you don't take
the tests you don't get in, so you have to take them.

I think this Ruby Ridge attitude is a little uncooperative.
Classification based on the tests is surely going to be wrong quite
often at the individual level, but it is going to produce better group
results than just guessing. I personally don't mind the tests; I took
them because I knew it was right for the group and I didn't feel
wronged when my grades and SAT scores didn't get me into Harvard,
though I knew I actually belonged there;-) Just the liberal in me, I
guess.

We could eventually get that kind of data, but why not try it out on a much
smaller scale first?

Of course I would do a sampling study. And the results would apply to
groups. Statistics, as an inferential, decision making discipline,
applies at the group level only.

I wish you would stop trying to set me up as an enemy of statistics just
because I object to one egregious misuse of it.

I think you see a misuse where there is not one. Using some measure of
performance as a basis for selection so as to improve group level
results is not a misuse of statistics. The egregious misuse of
statistics in psychology is in experiments where groups of subjects
are tested in order to learn something about individual behavioral
organization. This is bread and butter psychological research and it
is a serious ongoing misuse of statistics.

I'm an enemy of the thoughtless, automatic, superstitious use of
group statistics as a way of finding out something about an individual. I
consider that to be a kind of formalized prejudice.

So am I. And the use of selection criteria to optimize group
characteristics is _not_ that kind of prejudice.

I think your spreadsheet is a sufficient example of a case where using the
linear regression line yields ludicrous predictions (negative infant
mortality rates), and entails large quantitative prediction errors (100% or
more) in over a quarter of the individual cases, not to mention generating
mis-rankings by 20+ places (United States predicted third from best,
actually 25th from best).

I would say this is a misuse of the data in that spreadsheet. The
ludicrous predictions at the individual level are not the point. The
point is that _at the group level_ the average error in predicting
infant mortality is smaller when using Y' than when using the average
Y or a random guess, YRnd. That is, the average deviation of Y from Y'
(actual from linearly predicted infant mortality) is smaller than the
average deviation of Y from Ybar (actual from average infant
mortality) or Y from YRnd (actual from randomly selected infant
mortality). In fact average Y-Y' for that data is 18 and average
Y-Ybar is 32. That's the group level improvement in prediction.

I'm willing to pipe down if someone can show me that predicting individual
behavior from group statistics, at correlation levels commonly accepted as
good, does not seriously misrepresent large numbers of individuals.

No one disagrees with that; we know that the group level statistics
misrepresent large numbers of individuals. What Martin and I are
saying (I think) is that correlation/regression can be used to improve
group outcomes. We are not advocating the use of statistics to make
statements about individuals; we know that that's a mistake. We are
saying that using the prediction equation (when r2 is greater than 0)
can improve your group level predictions over what you would do by
predicting randomly or predicting based on how you feel about each
individual after interviewing them.

Best

Rick

···

--
Richard S. Marken PhD
Lecturer in Psychology
UCLA
rsmarken@gmail.com

While the overall trend is clear
and could be used to predict the average GPA over this population, given
the average SAT score, using it to accept or reject one individual would
be very questionable.

Perhaps, but less questionable than refusing to use that information when
you have it available.
[From Bill Powers (2007.07.21.1145 MDT)]

Martin Taylor 2007.07.21.13.01–

But if I used it, I would overestimate the GPA expected for most of the
people with the same range (650 to 750) of SAT scores, by a rather large
amount, and thus admit a lot of people into a system where they were
bound to disappoint. Look at the distribution: at the high end of the SAT
scores, there are lots of scores all the way to the left margin of the
data. And in the low half, there are many more GPA scores to the right of
the trend line than to the left. The test underestimates people who would
have done at least average work, and overestimates people who would not
have had even average GPAs.

The real
question is whether it’s worth going to all the trouble to collect the
data in the first place, if the benefit of using it is so
slight.

There we are definitely in agreement. Where we disagree is on the
propriety of using facts that are more likely to be false than true of an
individual. I’m sure that this is not how you would put it. The real
question is, at what level of statistical certainty do facts change from
being probably true of an individual to being probably false? I have
shown that hypotheses can be true of a group yet false for every
individual in the group. EVERY individual. Just to remind you:

Increasing rewards increases the amount of work done by the group, on the
average, when for every individual, increasing the reward decreases the
amount of work done very sharply. The hidden factor, which you can find
only by testing each individual, is the amount of reward wanted. People
with higher reference levels for reward will both work harder and receive
more rewards than people with lower reference levels. But for each
person, as the amount of reward increases toward the reference level, the
amount of work done rapidly decreases, going to zero when the amount of
reward reaches the amount wanted.

Best,

Bill P.

[From Rick Marken (2007.07.21.1115)]

Kenny Kitzke (2007.07.20) --

High correlations between two variables are necessary but not sufficient
evidence to establish that a selected change in one will cause a known
change in the other. The evidence is provided by an experiment. I fail to
see what issue you are actually raising with my statement? Does my
clarification still cause you difficulty?

Just a little. Even a low correlation would be counted as indicating
causality in an experiment, from a conventional point of view, so a
high correlation is not even a necessary condition for causality.

My point is that it's not the correlation coefficient but the way the
data was collected that implies or doesn't imply causality, from the
conventional point of view. It's just a little pet peeve of mine. The
statistics text that I use has the "correlation does not imply
causality" mantra in the chapter on correlation and I just think it's
misleading, even from the conventional point of view. It suggests that
data analyzed using a correlation coefficient does not imply causality
and that's not true. The statistic implies nothing; it's the
methodology that does the implying. The mantra could lead a student
to believe that a statistic, like r, could somehow imply something
about causality. I think the mantra implicitly suggests, incorrectly,
that a statistic, like F, _does_ imply causality.

Best

Rick

···

--
Richard S. Marken PhD
Lecturer in Psychology
UCLA
rsmarken@gmail.com

[From Rick Marken (2007.07.21.1140)]

Kenny Kitzke (2007.07.20)--

Statistical data is not only sample data. If we measure the weight of all
the people, say 20, attending the CSG Conference, the calculated average and
standard deviation are population statistics.

Only if the 20 people are considered the population.

This is so common that the
Greek letter sigma is used for the SD rather than the English s which would
apply to say a random sample of 5 of the 20 attendees.

Yes, the Greek letters symbolize the population parameters that are
estimated by Latin letter sample statistics. Mu is the population mean
that is estimated, with determinable accuracy, by a sample mean, M,
for example. Mu is a population parameter that is estimated by the
sample statistic, M.

And, with such population statistics one can make highly reliable
predictions. What is the chance that if 10 ride to their dorm room on the
elevator, they will exceed the 2,000 pound elevator weight limit? If we
took a random sample of 5 of the 20 attendees, we could use the sample mean
and standard deviation to also make such predictions...but the confidence
would be substantially lower.

Of course. We don't need to use statistics if we can measure the
entire population. Statistical theory was developed to deal with
problems of estimating the characteristics of populations that are
considerably larger than 20.

Best

Rick

···

--
Richard S. Marken PhD
Lecturer in Psychology
UCLA
rsmarken@gmail.com

[Martin Taylor 2007.07.21.15.36]

[From Bill Powers (2007.07.21.1010 MDT)]

....
With a high likelihood of misclassifying the person and wrongly treating the person. How high? That can be calculated, but nobody wants to touch that calculation with a 10-foot yardstick, apparently.

Of course it can be calculated, and easily, too. Did my detailed tutorial on how to do that [Martin Taylor 2007.07.17.11.08] simply not arrive in your mailbox? It arrived in mine, and I've made reference to it several times in this duscussion. And yet you persist in saying that nobody wants to touch that, as if it were some taboo, rather than a trivial operation. Rick pointed out that you can get your answer from standard tables or the outptus of standard programs.

I guess you really are controlling for a perception of "statistics" with a value of "useless", and doing so at a rather higher gain than mere skepticism would warrant.

Martin

Well, I’ll get off
implying that you don’t like statistics when you
get off implying that I believe a valid use of group level
statistics
is to establish facts about individuals, which is what you did when
you pointed out the ridiculous prediction of a negative infant
mortality in the US based on the regression equation using per
capita
income as the predictor. Prediction based on the regression
equation
is a group level use of statistics. Anyone who thinks they knows
something about individuals based on that equation is making a
serious
mistake, a point that I always make in my statistics and research
methods courses.

Kenny must have meant that it is
a mistake to use statistical measures based on group data (like the
least squares regression line) as a predictor of individual results.

And if this is a mistake (which I don’t think it is) then you’ve
got a heck of ‘splainin’ to do to a lot of college admissions officers,
who use grades as a basis for predicting school performance, and business
executives, who use data on past performance as a basis for
hiring.
This is all I’ve been talking
about. Administrators deal with groups;

they select in order to improve group level performance.
[From Bill Powers (2007.07.21.1205 MDT)]

Rick Marken (2007.07.21.1050) –

Then please explain the quote I put in my next to last post to
Martin:

There you say you don’t think it is a mistake to use the regression line
as a predictor of individual results. You were using the regression line
as a predictor of individual countries’ infant mortality rates, which you
say just above is not a mistake, while two paragraphs above you say you
always teach that it is a mistake. What am I not getting here?

Yes, this is the classical brick wall you run into with bureaucrats:
sorry, that’s the policy.

The regression
equation helps you do exactly what you say above; pick a

group of students who will do better (in some sense) than a group

selected randomly. Certainly that will involve selecting some

individuals who will actually do poorly (False Alarm) and rejecting

some individuals who would have done well (Miss). But the regression

analysis allows you to have fewer False Alarms and Misses than you

would have had had you simply selected students at random.But,
again,

this is a group level use of statistics. No one should imagine that

the group relationship between SAT and college performance (such as
it

is) says anything about any particular individual.

Right. And therefore if your aim is to serve or work with individuals,
you should not use group statistics to evaluate the individuals. You
should use modeling and experiments, or the method of levels, or any
other way of testing specimens.

A person who doesn’t care about
any of

the individual applicants can safely ignore this question, and use
group

statistics, and be sure of showing a good record of successes,
because

success is measured as a group average.

Right. Though I think it’s a little rude to think that a person who

wants to improve things at the group level doesn’t care about

individuals.

If the admissions officer says “I’m sorry, we’re accepting only SAT
scores above 700 this year,” and turns an individual away, that
officer cares more for the school’s policies than that individual
student’s education. Is it rude to say that? Maybe. Is it true?
Yes.

I think the occasional conflict
between group and

individual benefit is one of the dilemmas of living in a society.
For

example, I think the evidence is overwhelming that single payer will

be beneficial at the group level. But I know that that doesn’t mean

it’s going to be better for any particular individual. Some people

will end up getting worse care than they would have gotten if they
had

just come into an emergency room without insurance. It’s like the

seatbelt laws; some fraction of people who wear seat belts as
required

by group policy will have been better off if they had not worn
them.

I think we have the heart of my complaint right here. I keep asking
“What are the chances of guessing wrong about an individual,”
and the answer I keep getting is, “Well, of course sometimes we will
be wrong about an individual.” I want to know what
“sometimes” means. If “Some people” means 2% of them,
that’s unfortunate but hard to imagine improving. However, if it means
40% of them, that’s an entirely different matter. If it means 75%, we
have a case verging on fraud. So HOW DO WE CALCULATE THE ODDS THAT AN
INDIVIDUAL WILL BE MISCLASSIFIED, MISDIAGNOSED, MISTAKENLY REJECTED AND
SO ON? And after I find out how, I want to see it actually done, not just
talked about. So far most of what is happening is talk. Let’s get down to
the equations. I have a distinct sensation of people trying to deal with
this issue without descending from the level of generalities to actually
working through it.

If a student has any
choice, my advice would be not to take any of those

tests, because the chance of being wrongly classified is very high (or so
I

claim, until someone shows me otherwise). Unfortunately, if you don’t
take

the tests you don’t get in, so you have to take them.

I think this Ruby Ridge attitude is a little
uncooperative.

I don’t get the allusion to Ruby Ridge. What did you mean? Something
about the FBI shooting a woman holding a baby? Or was that ATF? Maybe you
mean I was talking like a “little Eichmann”. That has a nice
ring to it. though it’s been used.

If a student is turned down for entrance to a college, that is very
serious for the student, though the college can always find some other
student who tests better. If the chances of being turned down for invalid
reasons are very low, we can just shrug and say “It happens.”
But if the chances are very large, that’s not an option if we’re trying
to be fair. So it’s important to find out whether those chances are low
or high.

Classification based on the
tests is surely going to be wrong quite

often at the individual level, but it is going to produce better
group

results than just guessing.

It is, at the group level. But it’s possible to improve the group
measures while still doing damage to the majority of people in it. I
thought that sacrificing the good of individuals for the improvement of
the group is a bad thing – isn’t that called Fascism? Or do I have it
mixed up with Communism?

I think you see a misuse where
there is not one. Using some measure of

performance as a basis for selection so as to improve group level

results is not a misuse of statistics.

Not unless you look at an individual and say, “Sorry, you’re not the
type of person we hire here.” Then you’re putting the group before
the individual and hiding behind statistics to do it. And you’re probably
shooting yourself in the foot if this person is a real gem.
I will go so far as to assert that you will always do better for
both individuals and groups by examining individual cases rather than by
using group statistics, if you can do it or if you care to take the time.
Group statistics are a poor substitute for careful examination of and
interaction with individuals. It’s cheaper, however, and often more
practical (with untrained or incompetent personnel doing the screening),
just to use some cookbook test with automatic scoring, so all that is
required is to turn the crank. But the fact that we’re forced to use an
inferior method doesn’t make it any better than it is.

The egregious misuse
of

statistics in psychology is in experiments where groups of subjects

are tested in order to learn something about individual behavioral

organization. This is bread and butter psychological research and it

is a serious ongoing misuse of statistics.

I’m an enemy of the thoughtless,
automatic, superstitious use of

group statistics as a way of finding out something about an individual.
I

consider that to be a kind of formalized prejudice.

So am I. And the use of selection criteria to optimize group

characteristics is not that kind of prejudice.

Not as long as the application is to predicting and dealing with
whole-group characteristics, such as amount of food consumed per day at a
ball game, or life expectancy, or voter preferences. But that too easily
shades into judgments of individuals, such as who gets into college and
who doesn’t, or who is of the right kind to marry my daughter.

I think your spreadsheet
is a sufficient example of a case where using the

linear regression line yields ludicrous predictions (negative infant

mortality rates), and entails large quantitative prediction errors (100%
or

more) in over a quarter of the individual cases, not to mention
generating

mis-rankings by 20+ places (United States predicted third from best,

actually 25th from best).

I would say this is a misuse of the data in that spreadsheet. The

ludicrous predictions at the individual level are not the point.

You mean, “Pay no attention to that man behind the
curtain”?

The point is that at the group
level
the average error in predicting

infant mortality is smaller when using Y’ than when using the
average

Y or a random guess, YRnd. That is, the average deviation of Y from
Y’

(actual from linearly predicted infant mortality) is smaller than
the

average deviation of Y from Ybar (actual from average infant

mortality) or Y from YRnd (actual from randomly selected infant

mortality).In fact average Y-Y’ for that data is 18 and average

Y-Ybar is 32. That’s the group level improvement in
prediction.

Gaining this apparent advantage for the whole data set entails giving up
accuracy in predicting individual data points. The United States and many
other countries are given far too high a ranking by using the regression
line, and the actual mortality rates predicted by that line are in error
by up to 360%. If you used log income to predict infant mortality rate
according to that regression line, you would be wrong by more than 100%
for over a quarter of the countries on the list, and by more than 50% for
56 of them – almost half of the group. The standard error of 18 units
has to be compared with mortality values that range from 2.3 to 158 –
it’s artificially small at one end of the range and huge at the other
end, and meaningless because it’s a constant over the entire range. So I
don’t see how the regression line has told us anything useful about the
group.

The alternative you offer is to predict the mortality rates from the
average, but of course that would be even worse. The best fit, I should
think, would use a curve like a negative exponential, or the reciprocal
of income. Try using 100000/10^C2 (which is 100000/income) for the income
column. I get a correlation with infant mortality of 0.80, which is
approaching usefulness.

I’m willing to pipe down
if someone can show me that predicting individual

behavior from group statistics, at correlation levels commonly accepted
as

good, does not seriously misrepresent large numbers of
individuals.

No one disagrees with that; we know that the group level statistics

misrepresent large numbers of individuals.

How large a number?

What Martin
and I are

saying (I think) is that correlation/regression can be used to
improve

group outcomes. We are not advocating the use of statistics to make

statements about individuals; we know that that’s a mistake. We are

saying that using the prediction equation (when r2 is greater than
0)

can improve your group level predictions over what you would do by

predicting randomly or predicting based on how you feel about each

individual after interviewing them.

If you can get good predictions of group phenomema this way, by all
means do that. That’s not what I’m ranting about.

Best.

Bill P.